linhr opened a new issue, #14529:
URL: https://github.com/apache/iceberg/issues/14529

   ### Apache Iceberg version
   
   1.10.0 (latest release)
   
   ### Query engine
   
   Spark
   
   ### Please describe the bug 🐞
   
   I was trying out Iceberg in PySpark by running the following command to 
start the PySpark shell:
   
   ```bash
   pyspark --packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.10.0
   ```
   
   The PySpark version is 3.5.5.
   
   I'd like to write data in Iceberg format without catalog involved. Here is 
the code I have:
   
   ```python
   spark.range(10).write.format("iceberg").save("/tmp/test")
   ```
   
   The above code failed. Here is a sample of the error message (with the 
lengthy Java stacktrace removed and personal information redacted):
   
   ```text
   WARN Query: Query for candidates of 
org.apache.hadoop.hive.metastore.model.MDatabase and subclasses resulted in no 
possible candidates
   Required table missing : "DBS" in Catalog "" Schema "". DataNucleus requires 
this table to perform its persistence operations. Either your MetaData is 
incorrect, or you need to enable "datanucleus.schema.autoCreateTables"
   org.datanucleus.store.rdbms.exceptions.MissingTableException: Required table 
missing : "DBS" in Catalog "" Schema "". DataNucleus requires this table to 
perform its persistence operations. Either your MetaData is incorrect, or you 
need to enable "datanucleus.schema.autoCreateTables"
   
   WARN Query: Query for candidates of 
org.apache.hadoop.hive.metastore.model.MTableColumnStatistics and subclasses 
resulted in no possible candidates
   Required table missing : "CDS" in Catalog "" Schema "". DataNucleus requires 
this table to perform its persistence operations. Either your MetaData is 
incorrect, or you need to enable "datanucleus.schema.autoCreateTables"
   org.datanucleus.store.rdbms.exceptions.MissingTableException: Required table 
missing : "CDS" in Catalog "" Schema "". DataNucleus requires this table to 
perform its persistence operations. Either your MetaData is incorrect, or you 
need to enable "datanucleus.schema.autoCreateTables"
   
   WARN Query: Query for candidates of 
org.apache.hadoop.hive.metastore.model.MPartitionColumnStatistics and 
subclasses resulted in no possible candidates
   Required table missing : "CDS" in Catalog "" Schema "". DataNucleus requires 
this table to perform its persistence operations. Either your MetaData is 
incorrect, or you need to enable "datanucleus.schema.autoCreateTables"
   org.datanucleus.store.rdbms.exceptions.MissingTableException: Required table 
missing : "CDS" in Catalog "" Schema "". DataNucleus requires this table to 
perform its persistence operations. Either your MetaData is incorrect, or you 
need to enable "datanucleus.schema.autoCreateTables"
   
   WARN Query: Query for candidates of 
org.apache.hadoop.hive.metastore.model.MConstraint and subclasses resulted in 
no possible candidates
   Required table missing : "CDS" in Catalog "" Schema "". DataNucleus requires 
this table to perform its persistence operations. Either your MetaData is 
incorrect, or you need to enable "datanucleus.schema.autoCreateTables"
   org.datanucleus.store.rdbms.exceptions.MissingTableException: Required table 
missing : "CDS" in Catalog "" Schema "". DataNucleus requires this table to 
perform its persistence operations. Either your MetaData is incorrect, or you 
need to enable "datanucleus.schema.autoCreateTables"
   
   WARN MetaStoreDirectSql: Self-test query [select "DB_ID" from "DBS"] failed; 
direct SQL is disabled
   javax.jdo.JDODataStoreException: Error executing SQL query "select "DB_ID" 
from "DBS"".
   NestedThrowablesStackTrace:
   java.sql.SQLSyntaxErrorException: Table/View 'DBS' does not exist.
   Caused by: ERROR 42X05: Table/View 'DBS' does not exist.
        ... 108 more
   
   WARN Query: Query for candidates of 
org.apache.hadoop.hive.metastore.model.MVersionTable and subclasses resulted in 
no possible candidates
   Required table missing : "VERSION" in Catalog "" Schema "". DataNucleus 
requires this table to perform its persistence operations. Either your MetaData 
is incorrect, or you need to enable "datanucleus.schema.autoCreateTables"
   org.datanucleus.store.rdbms.exceptions.MissingTableException: Required table 
missing : "VERSION" in Catalog "" Schema "". DataNucleus requires this table to 
perform its persistence operations. Either your MetaData is incorrect, or you 
need to enable "datanucleus.schema.autoCreateTables"
   
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
     File "/venv/lib/python3.11/site-packages/pyspark/sql/readwriter.py", line 
1463, in save
       self._jwrite.save(path)
     File 
"/venv/lib/python3.11/site-packages/pyspark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py",
 line 1322, in __call__
     File 
"/venv/lib/python3.11/site-packages/pyspark/errors/exceptions/captured.py", 
line 179, in deco
       return f(*a, **kw)
              ^^^^^^^^^^^
     File 
"/venv/lib/python3.11/site-packages/pyspark/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py",
 line 326, in get_return_value
   py4j.protocol.Py4JJavaError: An error occurred while calling o43.save.
   : org.apache.iceberg.hive.RuntimeMetaException: Failed to connect to Hive 
Metastore
   Caused by: java.lang.RuntimeException: Unable to instantiate 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient
        ... 52 more
   Caused by: java.lang.reflect.InvocationTargetException
        ... 64 more
   Caused by: MetaException(message:Version information not found in metastore. 
)
        ... 70 more
   Caused by: MetaException(message:Version information not found in metastore. 
)
        ... 73 more
   ```
   
   I can see that the examples in the documentation mostly involves catalog 
table names, instead of working with table paths directly. However it seems 
reading a table path directly works for me:
   
   ```python
   spark.read.format("iceberg").load("/some/existing/table").show()
   ```
   
   So I'm wondering why the writer does not work out of the box.
   
   Also, in the 
[documentation](https://iceberg.apache.org/docs/nightly/spark-writes/#writing-with-dataframes)
 I noticed this sentence:
   
   > The v1 DataFrame `write` API is still supported, but is not recommended.
   
   Is there a reason for this?
   
   ### Willingness to contribute
   
   - [ ] I can contribute a fix for this bug independently
   - [ ] I would be willing to contribute a fix for this bug with guidance from 
the Iceberg community
   - [x] I cannot contribute a fix for this bug at this time


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to