[GitHub] [iceberg] pvary commented on pull request #1505: Hive: Make HiveCatalog based tables automatically readable from Hive

GitBox Tue, 29 Sep 2020 04:47:48 -0700


pvary commented on pull request #1505:
URL: https://github.com/apache/iceberg/pull/1505#issuecomment-700649623



   > > We need the build changes, because the spark sql uses Hive code to 
interact with hive tables.
   > 
   > That sounds like a bug to me. We want to ensure that Spark does not 
interact with Iceberg tables using Hive. It should use Iceberg directly through 
the v2 interface instead. We'll need to find out why Spark is trying to load 
the table incorrectly. What catalog were you using?
   
   I did the following 2 changes in the Iceberg code, and run the spark3 tests:
   - Remove the additional dependencies, and 
   - Set the default value for  to `false` and run the spark3 tests.
   
   One such test is `TestCreateTable.testCreateTable`. The drop table in the 
"after" fails with the exception below.
   
   It uses HiveCatalog. See: 
[code](https://github.com/apache/iceberg/blob/1772f4f27b8a12d3e89a7f65b8b600b717e1f09d/spark/src/test/java/org/apache/iceberg/spark/SparkTestBase.java#L63)
   
   It tries to run `DROP TABLE IF EXISTS testhive.default.table` using 
`spark.sql`. My understanding is that this is spark code, and it tries to 
delegate it to the embedded `org.apache.hadoop.hive.ql.metadata.Table` class 
which tries to read the StorageHandler, and fails.
   
   This would mean that when Iceberg tables are using the HiveCatalog and Spark 
code tries to access them with SparkSQL, it will fail for several commands if 
we do not have `HiveIcebergStorageHandler` on the classpath.
   
   I am not sure we can change this part of Spark. I think,
   - We either have to add a note to the docs that an additional jar is needed 
if SparkSQL and HiveCatalog is used in conjunction, or 
   - We have to add the jars to our runtime jars.
   
   Your thoughts?
   
   The exception stack trace:
   ```
       java.lang.RuntimeException: 
org.apache.hadoop.hive.ql.metadata.HiveException: Error in loading storage 
handler.org.apache.iceberg.mr.hive.HiveIcebergStorageHandler
           at 
org.apache.hadoop.hive.ql.metadata.Table.getStorageHandler(Table.java:297)
           at 
org.apache.spark.sql.hive.client.HiveClientImpl.convertHiveTableToCatalogTable(HiveClientImpl.scala:465)
           at 
org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$getTableOption$3(HiveClientImpl.scala:424)
           at scala.Option.map(Option.scala:230)
           at 
org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$getTableOption$1(HiveClientImpl.scala:424)
           at 
org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:294)
           at 
org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:227)
           at 
org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:226)
           at 
org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:276)
           at 
org.apache.spark.sql.hive.client.HiveClientImpl.getTableOption(HiveClientImpl.scala:422)
           at 
org.apache.spark.sql.hive.client.HiveClient.getTable(HiveClient.scala:90)
           at 
org.apache.spark.sql.hive.client.HiveClient.getTable$(HiveClient.scala:89)
           at 
org.apache.spark.sql.hive.client.HiveClientImpl.getTable(HiveClientImpl.scala:90)
           at 
org.apache.spark.sql.hive.HiveExternalCatalog.getRawTable(HiveExternalCatalog.scala:120)
           at 
org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$getTable$1(HiveExternalCatalog.scala:719)
           at 
org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:99)
           at 
org.apache.spark.sql.hive.HiveExternalCatalog.getTable(HiveExternalCatalog.scala:719)
           at 
org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.getTable(ExternalCatalogWithListener.scala:138)
           at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.getTableMetadata(SessionCatalog.scala:446)
           at 
org.apache.spark.sql.execution.command.DropTableCommand.run(ddl.scala:226)
           at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
           at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
           at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
           at 
org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:229)
           at 
org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3616)
           at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100)
           at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
           at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
           at 
org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763)
           at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
           at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3614)
           at org.apache.spark.sql.Dataset.<init>(Dataset.scala:229)
           at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:100)
           at 
org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763)
           at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:97)
           at 
org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:606)
           at 
org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763)
           at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:601)
           at org.apache.iceberg.spark.SparkTestBase.sql(SparkTestBase.java:83)
           at 
org.apache.iceberg.spark.sql.TestCreateTable.dropTestTable(TestCreateTable.java:47)
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] pvary commented on pull request #1505: Hive: Make HiveCatalog based tables automatically readable from Hive

Reply via email to