aswin-mp opened a new issue, #8009:
URL: https://github.com/apache/hudi/issues/8009
**Describe the problem you faced**
We have a Spark application that successfully reads an AWS Glue table and
deletes records that meet certain search criteria. However, when the
application attempts to sync the metadata using the "Hive Sync" mode as HMS, it
fails with the error message: "Could not sync using the meta sync class
org.apache.hudi.hive.HiveSyncTool." The errors comes at the last stage of
processing after deleting the records from the table.
App properties :
hoodie.datasource.write.recordkey.field=<Key>
hoodie.datasource.write.partitionpath.field=<Partition field>
hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.TimestampBasedKeyGenerator
hoodie.deltastreamer.keygen.timebased.timestamp.type=SCALAR
hoodie.deltastreamer.keygen.timebased.timestamp.scalar.time.unit=MICROSECONDS
hoodie.deltastreamer.keygen.timebased.output.dateformat=yyyy/MM/dd
hoodie.deltastreamer.keygen.timebased.timezone=GMT
hoodie.deltastreamer.source.kafka.topic=<topicName>
hoodie.datasource.hive_sync.database=<DB name>
hoodie.datasource.hive_sync.table=<Table Name>
hoodie.datasource.write.hive_style_partitioning=false
hoodie.datasource.hive_sync.partition_fields=<Partition Fields>
hoodie.datasource.meta.sync.enable=true
hoodie.datasource.hive_sync.mode=hms
hoodie.datasource.hive_sync.enable=true
Table properties :
hoodie.datasource.write.recordkey.field=<Key>
hoodie.datasource.write.partitionpath.field=<Partition field>
hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.TimestampBasedKeyGenerator
hoodie.deltastreamer.keygen.timebased.timestamp.type=EPOCHMILLISECONDS
hoodie.deltastreamer.keygen.timebased.output.dateformat=yyyy/MM/dd
hoodie.deltastreamer.source.kafka.topic=<topicName>
hoodie.datasource.hive_sync.database=<DB name>
hoodie.datasource.hive_sync.table=<Table Name>
hoodie.datasource.write.hive_style_partitioning=false
hoodie.datasource.hive_sync.partition_fields=<Partition Fields>
— enable-sync:option is set in Deltastreamer
We are facing issues on tables which are written by Deltastreamers.
**To Reproduce**
Steps to reproduce the behavior:
1.
2.
3.
4.
**Expected behavior**
To ensure that the AWS Glue catalog reflects the latest changes, the Spark
application must successfully delete the targeted records and then properly
sync the metadata with the Hive metastore using the "HMS" mode. This will
ensure that the catalog accurately reflects the state of the data, and that any
subsequent queries or analysis using the Glue catalog are based on the most
up-to-date information.
**Environment Description**
* Hudi version : 0.11.0
* Spark version : 3.2
* Hive version : 3.1.2
* Hadoop version : Amazon 3.2.1
* Storage (HDFS/S3/GCS..) : S3
* Running on Docker? (yes/no) : no
**Additional context**
Add any other context about the problem here.
**Stacktrace**
org.apache.hudi.exception.HoodieException: Could not sync using the meta
sync class org.apache.hudi.hive.HiveSyncTool at
org.apache.hudi.sync.common.util.SyncUtilHelpers.runHoodieMetaSync(SyncUtilHelpers.java:61)
at
org.apache.hudi.HoodieSparkSqlWriter$.$anonfun$metaSync$2(HoodieSparkSqlWriter.scala:623)
at
org.apache.hudi.HoodieSparkSqlWriter$.$anonfun$metaSync$2$adapted(HoodieSparkSqlWriter.scala:622)
at scala.collection.mutable.HashSet.foreach(HashSet.scala:79) at
org.apache.hudi.HoodieSparkSqlWriter$.metaSync(HoodieSparkSqlWriter.scala:622)
at
org.apache.hudi.HoodieSparkSqlWriter$.commitAndPerformPostOperations(HoodieSparkSqlWriter.scala:681)
at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:315)
at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:171) at
org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
at org.apache.spark.sql.execution.command.ExecutedCommandEx
ec.sideEffectResult$lzycompute(commands.scala:75) at
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)
at
org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84)
at
org.apache.spark.sql.execution.QueryExecution$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:115)
at
org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)
at
org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232)
at
org.apache.spark.sql.execution.SQLExecution$.executeQuery$1(SQLExecution.scala:110)
at
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:135)
at
org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)
at
org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun
$withNewExecutionId$5(SQLExecution.scala:135) at
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:253)
at
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:134)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775) at
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:68)
at
org.apache.spark.sql.execution.QueryExecution$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:112)
at
org.apache.spark.sql.execution.QueryExecution$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:108)
at
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:519)
at
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:83)
at
org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:519)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.or
g$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$super$transformDownWithPruning(LogicalPlan.scala:30)
at
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)
at
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263)
at
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
at
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
at
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:495)
at
org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:108)
at
org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:95)
at
org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:93)
at org.apache.spark.sql.execution.QueryExecution.assertCom
mandExecuted(QueryExecution.scala:136) at
org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:848) at
org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:382)
at
org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:355) at
org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:239) at
com.comparethemarket.spark.io.Writer.write(Writer.java:33) at
com.comparethemarket.spark.io.Writer.write(Writer.java:45) at
com.comparethemarket.spark.job.SparkJob.run(SparkJob.java:60) at
com.comparethemarket.spark.job.SparkJobExecutor.execute(SparkJobExecutor.java:51)
at
com.comparethemarket.spark.ModelAnonymizer.executeSparkJob(ModelAnonymizer.java:228)
at
com.comparethemarket.spark.ModelAnonymizer.lambda$createAndExecuteSparkJob$0(ModelAnonymizer.java:168)
at
com.comparethemarket.spark.ModelAnonymizer.lambda$createAndExecuteSparkJob$5(ModelAnonymizer.java:194)
at java.util.stream.ForEachOps$ForEachOp$OfRef.acce
pt(ForEachOps.java:183) at
java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175) at
java.util.Iterator.forEachRemaining(Iterator.java:116) at
java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
at
java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150)
at
java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:485) at
com.comparethemarket.spark.ModelAnonymizer.createAndExecuteSparkJob(ModelAnonymizer.java:194)
at
com.comparethemarket.spark.ModelAnonymizer.createAndExecuteSparkJobsForModels(ModelAnonymizer.java:144)
at com.comparethemarket.spark.ModelAnonymizer.main(Model
Anonymizer.java:94) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
Method) at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498) at
org.apache.spark.deploy.yarn.ApplicationMaster$anon$2.run(ApplicationMaster.scala:740)
--
Caused by: java.lang.IllegalArgumentException: Number of table partition
keys must match number of partition values at
com.google.common.base.Preconditions.checkArgument(Preconditions.java:92) at
com.amazonaws.glue.catalog.metastore.GlueMetastoreClientDelegate.validateInputForBatchCreatePartitions(GlueMetastoreClientDelegate.java:832)
at
com.amazonaws.glue.catalog.metastore.GlueMetastoreClientDelegate.batchCreatePartitions(GlueMetastoreClientDelegate.java:766)
at
com.amazonaws.glue.catalog.metastore.GlueMetastoreClientDelegate.addPartitions(GlueMetastoreClientDelegate.java:748)
at
com.amazonaws.glue.catalog.metastore.AWSCatalogMetastoreClient.add_partitions(AWSCatalogMetastoreClient.java:339)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:
498) at
org.apache.hadoop.hive.metastore.HiveMetaStoreClient$SynchronizedHandler.invoke(HiveMetaStoreClient.java:2350)
at com.sun.proxy.$Proxy131.add_partitions(Unknown Source) at
org.apache.hudi.hive.ddl.HMSDDLExecutor.addPartitionsToTable(HMSDDLExecutor.java:199)
at
org.apache.hudi.hive.HoodieHiveClient.addPartitionsToTable(HoodieHiveClient.java:98)
at org.apache.hudi.hive.HiveSyncTool.syncPartitions(HiveSyncTool.java:397)
... 71 more
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]