[GitHub] [hudi] codejoyan opened a new issue #2852: [SUPPORT] Read Hudi Table from Hive - Hive Sync clarification

GitBox Mon, 19 Apr 2021 23:09:29 -0700


codejoyan opened a new issue #2852:
URL: https://github.com/apache/hudi/issues/2852



   I have a requirement to read Hudi table from Hive. 
   Documentation (https://hudi.apache.org/docs/querying_data.html#hive) says 
that we have to copy hudi-hadoop-mr-bundle-x.y.z-SNAPSHOT.jar in the aux jar 
path of the HiveServer2 host. I wanted to know what is the role of this jar and 
what happens internally when I start Hive with this jar.
   
   While hive sync without copying the jar to the aux path, I get the below 
error. In the error I can see it tries to create an external table. What if I 
manually create the external table. Will I be able to read a Hudi table from 
Hive after creating the external table manually or I will be missing out on 
some additional features?
   Please let me know if you have any questions. 
   
   ```
   scala> transformedDF.write.format("org.apache.hudi").
        | options(getQuickstartWriteConfigs).
        | option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "col_9").
        | option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, 
"col_2,col_1,col_3").
        | option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, 
"partitionpath").
        | option(DataSourceWriteOptions.KEYGENERATOR_CLASS_OPT_KEY, 
"org.apache.hudi.keygen.ComplexKeyGenerator").
        | option("hoodie.upsert.shuffle.parallelism","2").
        | option("hoodie.insert.shuffle.parallelism","2").
        | option(HoodieWriteConfig.TABLE_NAME, "TestTableHudiHive").
        | option("hoodie.datasource.hive_sync.enable", true).
        | option("hoodie.datasource.hive_sync.jdbcurl", 
"jdbc:hive2://hive_server2_host:10001/default;principal=hive/_HOST@[email protected];transportMode=http;httpPath=cliservice").
        | option("hoodie.datasource.hive_sync.database", "default").
        | option("hoodie.datasource.hive_sync.table", "TestTableHudiHive").
        | option("hoodie.datasource.hive_sync.assume_date_partitioning", false).
        | option("hoodie.datasource.hive_sync.partition_fields", 
"partitionpath").
        | mode(SaveMode.Append).
        | save(targetPath)
   21/04/15 18:15:21 ERROR HiveSyncTool: Got runtime exception when hive syncing
   org.apache.hudi.hive.HoodieHiveSyncException: Failed in executing SQL 
**CREATE EXTERNAL TABLE  IF NOT EXISTS `default`.`TestTableHudiHive`( 
`_hoodie_commit_time` string, `_hoodie_commit_seqno` string, 
`_hoodie_record_key` string, `_hoodie_partition_path` string, 
`_hoodie_file_name` string, `col_1` string, `col_2` int, `col_3` int, `col_4` 
string, `col_5` string, `col_6` int, `col_7` bigint, `col_8` string, `col_9` 
bigint, `col_10` string, `cntry_cd` string, `bus_dt` DATE) PARTITIONED BY 
(`partitionpath` string) ROW FORMAT SERDE 
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS 
INPUTFORMAT 'org.apache.hudi.hadoop.HoodieParquetInputFormat' OUTPUTFORMAT 
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION 
'gs://xxxxxxxxxxxxxxxxx1919010xxxxxxx/test_table_tgt_04142021_1'**
        at 
org.apache.hudi.hive.HoodieHiveClient.updateHiveSQL(HoodieHiveClient.java:369)
        at 
org.apache.hudi.hive.HoodieHiveClient.createTable(HoodieHiveClient.java:263)
        at org.apache.hudi.hive.HiveSyncTool.syncSchema(HiveSyncTool.java:181)
        at 
org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:136)
        at 
org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:94)
        at 
org.apache.hudi.HoodieSparkSqlWriter$.org$apache$hudi$HoodieSparkSqlWriter$$syncHive(HoodieSparkSqlWriter.scala:355)
        at 
org.apache.hudi.HoodieSparkSqlWriter$$anonfun$metaSync$2.apply(HoodieSparkSqlWriter.scala:403)
        at 
org.apache.hudi.HoodieSparkSqlWriter$$anonfun$metaSync$2.apply(HoodieSparkSqlWriter.scala:399)
        at scala.collection.mutable.HashSet.foreach(HashSet.scala:78)
        at 
org.apache.hudi.HoodieSparkSqlWriter$.metaSync(HoodieSparkSqlWriter.scala:399)
        at 
org.apache.hudi.HoodieSparkSqlWriter$.commitAndPerformPostOperations(HoodieSparkSqlWriter.scala:460)
        at 
org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:217)
        at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:134)
        at 
org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:46)
        at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
        at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
        at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
        at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
        at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
        at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
        at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
        at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
        at 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
        at 
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
        at 
org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:654)
        at 
org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:654)
        at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
        at 
org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:654)
        at 
org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:273)
        at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:267)
        at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:225)
        at 
$line23.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:54)
        at 
$line23.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:59)
        at 
$line23.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:61)
        at 
$line23.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:63)
        at $line23.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:65)
        at $line23.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:67)
        at $line23.$read$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:69)
        at $line23.$read$$iw$$iw$$iw$$iw$$iw.<init>(<console>:71)
        at $line23.$read$$iw$$iw$$iw$$iw.<init>(<console>:73)
        at $line23.$read$$iw$$iw$$iw.<init>(<console>:75)
        at $line23.$read$$iw$$iw.<init>(<console>:77)
        at $line23.$read$$iw.<init>(<console>:79)
        at $line23.$read.<init>(<console>:81)
        at $line23.$read$.<init>(<console>:85)
        at $line23.$read$.<clinit>(<console>)
        at $line23.$eval$.$print$lzycompute(<console>:7)
        at $line23.$eval$.$print(<console>:6)
        at $line23.$eval.$print(<console>)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] codejoyan opened a new issue #2852: [SUPPORT] Read Hudi Table from Hive - Hive Sync clarification

Reply via email to