[GitHub] [hudi] aswin-mp opened a new issue, #8009: [SUPPORT Hive Sync error : Could not sync using the meta sync class org.apache.hudi.hive.HiveSyncTool]

via GitHub Tue, 21 Feb 2023 07:31:31 -0800


aswin-mp opened a new issue, #8009:
URL: https://github.com/apache/hudi/issues/8009


   **Describe the problem you faced**
   
   We have a Spark application that successfully reads an AWS Glue table and 
deletes records that meet certain search criteria. However, when the 
application attempts to sync the metadata using the "Hive Sync" mode as HMS, it 
fails with the error message: "Could not sync using the meta sync class 
org.apache.hudi.hive.HiveSyncTool." The errors comes at the last stage of 
processing after deleting the records from the table.
   
   App properties : 
   
   hoodie.datasource.write.recordkey.field=<Key>
   hoodie.datasource.write.partitionpath.field=<Partition field>
   
hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.TimestampBasedKeyGenerator
   hoodie.deltastreamer.keygen.timebased.timestamp.type=SCALAR
   hoodie.deltastreamer.keygen.timebased.timestamp.scalar.time.unit=MICROSECONDS
   hoodie.deltastreamer.keygen.timebased.output.dateformat=yyyy/MM/dd
   hoodie.deltastreamer.keygen.timebased.timezone=GMT
   hoodie.deltastreamer.source.kafka.topic=<topicName>
   hoodie.datasource.hive_sync.database=<DB name>
   hoodie.datasource.hive_sync.table=<Table Name>
   hoodie.datasource.write.hive_style_partitioning=false
   hoodie.datasource.hive_sync.partition_fields=<Partition Fields>
   hoodie.datasource.meta.sync.enable=true
   hoodie.datasource.hive_sync.mode=hms
   hoodie.datasource.hive_sync.enable=true
   
   Table properties : 
   
   hoodie.datasource.write.recordkey.field=<Key>
   hoodie.datasource.write.partitionpath.field=<Partition field>
   
hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.TimestampBasedKeyGenerator
   hoodie.deltastreamer.keygen.timebased.timestamp.type=EPOCHMILLISECONDS
   hoodie.deltastreamer.keygen.timebased.output.dateformat=yyyy/MM/dd
   hoodie.deltastreamer.source.kafka.topic=<topicName>
   hoodie.datasource.hive_sync.database=<DB name>
   hoodie.datasource.hive_sync.table=<Table Name>
   hoodie.datasource.write.hive_style_partitioning=false
   hoodie.datasource.hive_sync.partition_fields=<Partition Fields>
    — enable-sync:option is set in Deltastreamer
   
   We are facing issues on tables which are written by Deltastreamers. 
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1.
   2.
   3.
   4.
   
   **Expected behavior**
   
   To ensure that the AWS Glue catalog reflects the latest changes, the Spark 
application must successfully delete the targeted records and then properly 
sync the metadata with the Hive metastore using the "HMS" mode. This will 
ensure that the catalog accurately reflects the state of the data, and that any 
subsequent queries or analysis using the Glue catalog are based on the most 
up-to-date information.
   
   **Environment Description**
   
   * Hudi version : 0.11.0
   
   * Spark version : 3.2
   
   * Hive version : 3.1.2
   
   * Hadoop version : Amazon 3.2.1
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : no
   
   
   **Additional context**
   
   Add any other context about the problem here.
   
   **Stacktrace**
   
   org.apache.hudi.exception.HoodieException: Could not sync using the meta 
sync class org.apache.hudi.hive.HiveSyncTool   at 
org.apache.hudi.sync.common.util.SyncUtilHelpers.runHoodieMetaSync(SyncUtilHelpers.java:61)
  at 
org.apache.hudi.HoodieSparkSqlWriter$.$anonfun$metaSync$2(HoodieSparkSqlWriter.scala:623)
    at 
org.apache.hudi.HoodieSparkSqlWriter$.$anonfun$metaSync$2$adapted(HoodieSparkSqlWriter.scala:622)
    at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)   at 
org.apache.hudi.HoodieSparkSqlWriter$.metaSync(HoodieSparkSqlWriter.scala:622)  
 at 
org.apache.hudi.HoodieSparkSqlWriter$.commitAndPerformPostOperations(HoodieSparkSqlWriter.scala:681)
 at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:315) 
 at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:171)    at 
org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
 at org.apache.spark.sql.execution.command.ExecutedCommandEx
 ec.sideEffectResult$lzycompute(commands.scala:75)    at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)
   at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84)
 at 
org.apache.spark.sql.execution.QueryExecution$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:115)
  at 
org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)
  at 
org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232)
 at 
org.apache.spark.sql.execution.SQLExecution$.executeQuery$1(SQLExecution.scala:110)
  at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:135)
   at 
org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)
  at 
org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232)
 at org.apache.spark.sql.execution.SQLExecution$.$anonfun
 $withNewExecutionId$5(SQLExecution.scala:135)   at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:253)
   at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:134)
   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775) at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:68)
   at 
org.apache.spark.sql.execution.QueryExecution$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:112)
 at 
org.apache.spark.sql.execution.QueryExecution$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:108)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:519)
 at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:83)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:519)
    at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.or
 
g$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$super$transformDownWithPruning(LogicalPlan.scala:30)
  at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)
    at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263)
   at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
   at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
   at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:495)  
 at 
org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:108)
   at 
org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:95)
    at 
org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:93)
   at org.apache.spark.sql.execution.QueryExecution.assertCom
 mandExecuted(QueryExecution.scala:136)    at 
org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:848)   at 
org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:382)  
 at 
org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:355) at 
org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:239) at 
com.comparethemarket.spark.io.Writer.write(Writer.java:33)   at 
com.comparethemarket.spark.io.Writer.write(Writer.java:45)   at 
com.comparethemarket.spark.job.SparkJob.run(SparkJob.java:60)    at 
com.comparethemarket.spark.job.SparkJobExecutor.execute(SparkJobExecutor.java:51)
    at 
com.comparethemarket.spark.ModelAnonymizer.executeSparkJob(ModelAnonymizer.java:228)
 at 
com.comparethemarket.spark.ModelAnonymizer.lambda$createAndExecuteSparkJob$0(ModelAnonymizer.java:168)
   at 
com.comparethemarket.spark.ModelAnonymizer.lambda$createAndExecuteSparkJob$5(ModelAnonymizer.java:194)
   at java.util.stream.ForEachOps$ForEachOp$OfRef.acce
 pt(ForEachOps.java:183)  at 
java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)    at 
java.util.Iterator.forEachRemaining(Iterator.java:116)   at 
java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
  at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)    
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472) 
at 
java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150)   
 at 
java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173)
  at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)    
at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:485)   at 
com.comparethemarket.spark.ModelAnonymizer.createAndExecuteSparkJob(ModelAnonymizer.java:194)
    at 
com.comparethemarket.spark.ModelAnonymizer.createAndExecuteSparkJobsForModels(ModelAnonymizer.java:144)
  at com.comparethemarket.spark.ModelAnonymizer.main(Model
 Anonymizer.java:94) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native 
Method)  at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)   
 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498) at 
org.apache.spark.deploy.yarn.ApplicationMaster$anon$2.run(ApplicationMaster.scala:740)
   --
   Caused by: java.lang.IllegalArgumentException: Number of table partition 
keys must match number of partition values at 
com.google.common.base.Preconditions.checkArgument(Preconditions.java:92)    at 
com.amazonaws.glue.catalog.metastore.GlueMetastoreClientDelegate.validateInputForBatchCreatePartitions(GlueMetastoreClientDelegate.java:832)
 at 
com.amazonaws.glue.catalog.metastore.GlueMetastoreClientDelegate.batchCreatePartitions(GlueMetastoreClientDelegate.java:766)
 at 
com.amazonaws.glue.catalog.metastore.GlueMetastoreClientDelegate.addPartitions(GlueMetastoreClientDelegate.java:748)
 at 
com.amazonaws.glue.catalog.metastore.AWSCatalogMetastoreClient.add_partitions(AWSCatalogMetastoreClient.java:339)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)  at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)   
 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:
 498) at 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient$SynchronizedHandler.invoke(HiveMetaStoreClient.java:2350)
   at com.sun.proxy.$Proxy131.add_partitions(Unknown Source)   at 
org.apache.hudi.hive.ddl.HMSDDLExecutor.addPartitionsToTable(HMSDDLExecutor.java:199)
    at 
org.apache.hudi.hive.HoodieHiveClient.addPartitionsToTable(HoodieHiveClient.java:98)
 at org.apache.hudi.hive.HiveSyncTool.syncPartitions(HiveSyncTool.java:397)  
... 71 more
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] aswin-mp opened a new issue, #8009: [SUPPORT Hive Sync error : Could not sync using the meta sync class org.apache.hudi.hive.HiveSyncTool]

Reply via email to