[I] PySpark `.save()` method does not work [iceberg]

via GitHub Fri, 07 Nov 2025 07:39:16 -0800


linhr opened a new issue, #14529:
URL: https://github.com/apache/iceberg/issues/14529

### Apache Iceberg version

1.10.0 (latest release)

### Query engine

Spark

### Please describe the bug 🐞

I was trying out Iceberg in PySpark by running the following command to
start the PySpark shell:

```bash
pyspark --packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.10.0
```

The PySpark version is 3.5.5.

I'd like to write data in Iceberg format without catalog involved. Here is
the code I have:

```python
spark.range(10).write.format("iceberg").save("/tmp/test")
```

The above code failed. Here is a sample of the error message (with the
lengthy Java stacktrace removed and personal information redacted):

```text
WARN Query: Query for candidates of
org.apache.hadoop.hive.metastore.model.MDatabase and subclasses resulted in no
possible candidates
Required table missing : "DBS" in Catalog "" Schema "". DataNucleus requires
this table to perform its persistence operations. Either your MetaData is
incorrect, or you need to enable "datanucleus.schema.autoCreateTables"
org.datanucleus.store.rdbms.exceptions.MissingTableException: Required table
missing : "DBS" in Catalog "" Schema "". DataNucleus requires this table to
perform its persistence operations. Either your MetaData is incorrect, or you
need to enable "datanucleus.schema.autoCreateTables"

WARN Query: Query for candidates of
org.apache.hadoop.hive.metastore.model.MTableColumnStatistics and subclasses
resulted in no possible candidates
Required table missing : "CDS" in Catalog "" Schema "". DataNucleus requires
this table to perform its persistence operations. Either your MetaData is
incorrect, or you need to enable "datanucleus.schema.autoCreateTables"
org.datanucleus.store.rdbms.exceptions.MissingTableException: Required table
missing : "CDS" in Catalog "" Schema "". DataNucleus requires this table to
perform its persistence operations. Either your MetaData is incorrect, or you
need to enable "datanucleus.schema.autoCreateTables"

WARN Query: Query for candidates of
org.apache.hadoop.hive.metastore.model.MPartitionColumnStatistics and
subclasses resulted in no possible candidates
Required table missing : "CDS" in Catalog "" Schema "". DataNucleus requires
this table to perform its persistence operations. Either your MetaData is
incorrect, or you need to enable "datanucleus.schema.autoCreateTables"
org.datanucleus.store.rdbms.exceptions.MissingTableException: Required table
missing : "CDS" in Catalog "" Schema "". DataNucleus requires this table to
perform its persistence operations. Either your MetaData is incorrect, or you
need to enable "datanucleus.schema.autoCreateTables"

WARN Query: Query for candidates of
org.apache.hadoop.hive.metastore.model.MConstraint and subclasses resulted in
no possible candidates
Required table missing : "CDS" in Catalog "" Schema "". DataNucleus requires
this table to perform its persistence operations. Either your MetaData is
incorrect, or you need to enable "datanucleus.schema.autoCreateTables"
org.datanucleus.store.rdbms.exceptions.MissingTableException: Required table
missing : "CDS" in Catalog "" Schema "". DataNucleus requires this table to
perform its persistence operations. Either your MetaData is incorrect, or you
need to enable "datanucleus.schema.autoCreateTables"

WARN MetaStoreDirectSql: Self-test query [select "DB_ID" from "DBS"] failed;
direct SQL is disabled
javax.jdo.JDODataStoreException: Error executing SQL query "select "DB_ID"
from "DBS"".
NestedThrowablesStackTrace:
java.sql.SQLSyntaxErrorException: Table/View 'DBS' does not exist.
Caused by: ERROR 42X05: Table/View 'DBS' does not exist.
... 108 more

WARN Query: Query for candidates of
org.apache.hadoop.hive.metastore.model.MVersionTable and subclasses resulted in
no possible candidates
Required table missing : "VERSION" in Catalog "" Schema "". DataNucleus
requires this table to perform its persistence operations. Either your MetaData
is incorrect, or you need to enable "datanucleus.schema.autoCreateTables"
org.datanucleus.store.rdbms.exceptions.MissingTableException: Required table
missing : "VERSION" in Catalog "" Schema "". DataNucleus requires this table to
perform its persistence operations. Either your MetaData is incorrect, or you
need to enable "datanucleus.schema.autoCreateTables"

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/venv/lib/python3.11/site-packages/pyspark/sql/readwriter.py", line
1463, in save
self._jwrite.save(path)
File
"/venv/lib/python3.11/site-packages/pyspark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py",
line 1322, in __call__
File
"/venv/lib/python3.11/site-packages/pyspark/errors/exceptions/captured.py",
line 179, in deco
return f(*a, **kw)
^^^^^^^^^^^
File
"/venv/lib/python3.11/site-packages/pyspark/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py",
line 326, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o43.save.
: org.apache.iceberg.hive.RuntimeMetaException: Failed to connect to Hive
Metastore
Caused by: java.lang.RuntimeException: Unable to instantiate
org.apache.hadoop.hive.metastore.HiveMetaStoreClient
... 52 more
Caused by: java.lang.reflect.InvocationTargetException
... 64 more
Caused by: MetaException(message:Version information not found in metastore.
)
... 70 more
Caused by: MetaException(message:Version information not found in metastore.
)
... 73 more
```

I can see that the examples in the documentation mostly involves catalog
table names, instead of working with table paths directly. However it seems
reading a table path directly works for me:

```python
spark.read.format("iceberg").load("/some/existing/table").show()
```

So I'm wondering why the writer does not work out of the box.

Also, in the
[documentation](https://iceberg.apache.org/docs/nightly/spark-writes/#writing-with-dataframes)
I noticed this sentence:

> The v1 DataFrame `write` API is still supported, but is not recommended.

Is there a reason for this?

### Willingness to contribute

- [ ] I can contribute a fix for this bug independently
- [ ] I would be willing to contribute a fix for this bug with guidance from
the Iceberg community
- [x] I cannot contribute a fix for this bug at this time

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] PySpark `.save()` method does not work [iceberg]

Reply via email to