noahtaite opened a new issue, #9805:
URL: https://github.com/apache/hudi/issues/9805

   **_Tips before filing an issue_**
   
   - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)?
   
   - Join the mailing list to engage in conversations and get faster support at 
[email protected].
   
   - If you have triaged this as a bug, then file an 
[issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.
   
   **Describe the problem you faced**
   
   I'm running Hudi 0.13.1 on AWS EMR 6.12. 
   
   We recently ran "delete_partition" operation to clean up some specific 
partitions data due to bad data being ingested. We then re-ingested the correct 
data.
   
   Now, our sync to hive metastore using **AwsGlueCatalogSyncTool** is failing 
with the following:
   ```
   partitionsToDelete' failed to satisfy constraint: Member must have length 
less than or equal to 25 (Service: AWSGlue; Status Code: 400; Error Code:
   ```
   
   1 - How to get around this validation constraint for valid deletes of 25+ 
partitions in Glue?
   
   2 - These partitions should not be deleted from Glue. They were re-created 
with the good ingestion, and my users use Glue as a metastore. 
   
   **When I run a Glue sync manually using ./hudi-sync-tool, those partitions 
are actually removed. It appears the "delete_partition" replacecommit overrides 
the later deltacommit that has those partitions re-ingested.** 
   
   This appears to be a bug unless I am missing something with how 
`delete_partition` is expected to behave.
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Generate Hudi table with multiple partitions using bulk_insert e.g. 
[datasource=1/year=2000/month=1]
   2. Run delete_partition operation to delete all partitions with 
`datasource=1/*`
   3. Re-generate new partitions for datasource=1 with correct data.
   4. Hive sync fails trying to delete 25+ partitions.
   5. Manual hive sync leaves Glue table with only 1 new partition 
(datasource=1/year=2023/month=10).
   
   **Expected behavior**
   
   I expect the following:
   1 - AWS Glue sync should not fail with 25+ partitions in total request. It 
should batch properly.
   2 - My Glue table should not even be deleting the partitions. The final 
state **should** have all the partitions for datasource=1 but that is not being 
respected (it seems delete_partition replacecommit takes precedence)!
   
   **Environment Description**
   
   * Hudi version : 0.13.1-amzn-0
   
   * Spark version : 3.4.0
   
   * Hive version : 3.1.3
   
   * Hadoop version : 3.3.3
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : No
   
   
   **Additional context**
   
   Running on AWS EMR 6.12.
   
   Many of my consumers use Glue as a data catalog. Will missing partitions 
reduce performance or prevent any new data from being accessed via Glue 
directly?
    
   **Stacktrace**
   
   ```
   23/09/28 16:32:47 ERROR Client: Application diagnostics message: User class 
threw exception: org.apache.hudi.exception.HoodieException: Could not sync 
using the meta sync class org.apache.hudi.aws.sync.AwsGlueCatalogSyncTool
        at 
org.apache.hudi.sync.common.util.SyncUtilHelpers.runHoodieMetaSync(SyncUtilHelpers.java:61)
        at 
org.apache.hudi.HoodieSparkSqlWriter$.$anonfun$metaSync$2(HoodieSparkSqlWriter.scala:888)
        at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
        at 
org.apache.hudi.HoodieSparkSqlWriter$.metaSync(HoodieSparkSqlWriter.scala:886)
        at 
org.apache.hudi.HoodieSparkSqlWriter$.commitAndPerformPostOperations(HoodieSparkSqlWriter.scala:984)
        at 
org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:381)
        at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:150)
        at 
org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:47)
        at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75)
        at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)
        at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84)
        at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:104)
        at 
org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)
        at 
org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:250)
        at 
org.apache.spark.sql.execution.SQLExecution$.executeQuery$1(SQLExecution.scala:123)
        at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$9(SQLExecution.scala:160)
        at 
org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)
        at 
org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:250)
        at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$8(SQLExecution.scala:160)
        at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:271)
        at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:159)
        at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:827)
        at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:69)
        at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:101)
        at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:97)
        at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:554)
        at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:107)
        at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:554)
        at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:32)
        at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)
        at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263)
        at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:32)
        at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:32)
        at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:530)
        at 
org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:97)
        at 
org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:84)
        at 
org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:82)
        at 
org.apache.spark.sql.execution.QueryExecution.assertCommandExecuted(QueryExecution.scala:142)
        at 
org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:856)
        at 
org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:387)
        at 
org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:360)
        at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:239)
        at 
com.example.spark.datalake.hudi.HudiDatalake.persist(HudiDatalake.java:62)
        at 
com.example.spark.datalake.hudi.HudiDatalake.save(HudiDatalake.java:39)
        at 
com.example.spark.datalake.FilteredDatalake.save(FilteredDatalake.java:24)
        at 
com.example.spark.tier2.datalake.HudiDatalakeUpdater.saveToHudi(HudiDatalakeUpdater.java:86)
        at 
com.example.spark.tier2.datalake.HudiDatalakeUpdater.upsert(HudiDatalakeUpdater.java:61)
        at 
com.example.spark.tier2.extractor.BaseExtractor.extract(BaseExtractor.java:58)
        at java.util.ArrayList.forEach(ArrayList.java:1259)
        at 
com.example.spark.tier2.extractor.KViewsExtractor.extract(KViewsExtractor.java:35)
        at 
com.example.spark.tier2.DmsTierTwoExtractorRunner.run(DmsTierTwoExtractorRunner.java:239)
        at 
com.example.spark.tier2.DmsTierTwoExtractorRunner.main(DmsTierTwoExtractorRunner.java:138)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at 
org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:760)
   Caused by: org.apache.hudi.exception.HoodieException: Got runtime exception 
when hive syncing table_all
        at 
org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:165)
        at 
org.apache.hudi.sync.common.util.SyncUtilHelpers.runHoodieMetaSync(SyncUtilHelpers.java:59)
        ... 56 more
   Caused by: org.apache.hudi.hive.HoodieHiveSyncException: Failed to sync 
partitions for table table_all
        at 
org.apache.hudi.hive.HiveSyncTool.syncPartitions(HiveSyncTool.java:429)
        at 
org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:280)
        at org.apache.hudi.hive.HiveSyncTool.doSync(HiveSyncTool.java:188)
        at 
org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:162)
        ... 57 more
   Caused by: org.apache.hudi.aws.sync.HoodieGlueSyncException: Fail to drop 
partitions to dms_hudi_db.table_all
        at 
org.apache.hudi.aws.sync.AWSGlueCatalogSyncClient.dropPartitions(AWSGlueCatalogSyncClient.java:222)
        at 
org.apache.hudi.hive.HiveSyncTool.syncPartitions(HiveSyncTool.java:457)
        at 
org.apache.hudi.hive.HiveSyncTool.syncPartitions(HiveSyncTool.java:424)
        ... 60 more
   Caused by: 
org.apache.hudi.com.amazonaws.services.glue.model.ValidationException: 1 
validation error detected: Value 
'[PartitionValueList(values=[[email protected], 2021, 11]), 
PartitionValueList(values=[[email protected], 2021, 12]), 
PartitionValueList(values=[[email protected], 2021, 10]), 
PartitionValueList(values=[[email protected], 2021, 9]), 
PartitionValueList(values=[[email protected], 2021, 8]), 
PartitionValueList(values=[[email protected], 2021, 7]), 
PartitionValueList(values=[[email protected], 2021, 6]), 
PartitionValueList(values=[[email protected], 2019, 2]), 
PartitionValueList(values=[[email protected], 2021, 5]), 
PartitionValueList(values=[[email protected], 2021, 4]), 
PartitionValueList(values=[[email protected], 2019, 1]), 
PartitionValueList(values=[[email protected], 2018, 10]), 
PartitionValueList(values=[[email protected], 2019, 4]), 
PartitionValueList(values=[[email protected], 2019, 3]), Part
 itionValueList(values=[[email protected], 2019, 6]), 
PartitionValueList(values=[[email protected], 2019, 5]), 
PartitionValueList(values=[[email protected], 2019, 8]), 
PartitionValueList(values=[[email protected], 2019, 7]), 
PartitionValueList(values=[[email protected], 2018, 12]), 
PartitionValueList(values=[[email protected], 2018, 11]), 
PartitionValueList(values=[[email protected], 2019, 9]), 
PartitionValueList(values=[[email protected], 2017, 12]), 
PartitionValueList(values=[[email protected], 2017, 10]), 
PartitionValueList(values=[[email protected], 2017, 11]), 
PartitionValueList(values=[[email protected], 2021, 3]), 
PartitionValueList(values=[[email protected], 2021, 2]), 
PartitionValueList(values=[[email protected], 2021, 1]), 
PartitionValueList(values=[[email protected], 2016, 2]), 
PartitionValueList(values=[[email protected], 2016, 3]), 
PartitionValueList(values=[[email protected], 2016, 4]), Part
 itionValueList(values=[[email protected], 2016, 5]), 
PartitionValueList(values=[[email protected], 2016, 1]), 
PartitionValueList(values=[[email protected], 2016, 6]), 
PartitionValueList(values=[[email protected], 2016, 7]), 
PartitionValueList(values=[[email protected], 2016, 8]), 
PartitionValueList(values=[[email protected], 2016, 9]), 
PartitionValueList(values=[[email protected], 2022, 6]), 
PartitionValueList(values=[[email protected], 2022, 5]), 
PartitionValueList(values=[[email protected], 2022, 4]), 
PartitionValueList(values=[[email protected], 2022, 3]), 
PartitionValueList(values=[[email protected], 2022, 9]), 
PartitionValueList(values=[[email protected], 2022, 8]), 
PartitionValueList(values=[[email protected], 2022, 7]), 
PartitionValueList(values=[[email protected], 2022, 2]), 
PartitionValueList(values=[[email protected], 2022, 1]), 
PartitionValueList(values=[[email protected], 2001, 1]), Partition
 ValueList(values=[[email protected], 2017, 5]), 
PartitionValueList(values=[[email protected], 2017, 6]), 
PartitionValueList(values=[[email protected], 2017, 7]), 
PartitionValueList(values=[[email protected], 2017, 8]), 
PartitionValueList(values=[[email protected], 2017, 9]), 
PartitionValueList(values=[[email protected], 2017, 1]), 
PartitionValueList(values=[[email protected], 2017, 2]), 
PartitionValueList(values=[[email protected], 2017, 3]), 
PartitionValueList(values=[[email protected], 2017, 4]), 
PartitionValueList(values=[[email protected], 2014, 9]), 
PartitionValueList(values=[[email protected], 2018, 9]), 
PartitionValueList(values=[[email protected], 2018, 8]), 
PartitionValueList(values=[[email protected], 2018, 5]), 
PartitionValueList(values=[[email protected], 2018, 4]), 
PartitionValueList(values=[[email protected], 2018, 7]), 
PartitionValueList(values=[[email protected], 2018, 6]), PartitionValue
 List(values=[[email protected], 2018, 1]), 
PartitionValueList(values=[[email protected], 2018, 3]), 
PartitionValueList(values=[[email protected], 2018, 2]), 
PartitionValueList(values=[[email protected], 2016, 10]), 
PartitionValueList(values=[[email protected], 2016, 12]), 
PartitionValueList(values=[[email protected], 2016, 11]), 
PartitionValueList(values=[[email protected], 2020, 8]), 
PartitionValueList(values=[[email protected], 2020, 7]), 
PartitionValueList(values=[[email protected], 2020, 6]), 
PartitionValueList(values=[[email protected], 2022, 10]), 
PartitionValueList(values=[[email protected], 2022, 11]), 
PartitionValueList(values=[[email protected], 2020, 5]), 
PartitionValueList(values=[[email protected], 2022, 12]), 
PartitionValueList(values=[[email protected], __HIVE_DEFAULT_PARTITION__, 
__HIVE_DEFAULT_PARTITION__]), 
PartitionValueList(values=[[email protected], 2020, 9]), 
PartitionValueList(values=[e
 [email protected], 2023, 1]), 
PartitionValueList(values=[[email protected], 2023, 2]), 
PartitionValueList(values=[[email protected], 2023, 3]), 
PartitionValueList(values=[[email protected], 2023, 4]), 
PartitionValueList(values=[[email protected], 2023, 5]), 
PartitionValueList(values=[[email protected], 2023, 6]), 
PartitionValueList(values=[[email protected], 2023, 7]), 
PartitionValueList(values=[[email protected], 2023, 8]), 
PartitionValueList(values=[[email protected], 2023, 9]), 
PartitionValueList(values=[[email protected], 2020, 4]), 
PartitionValueList(values=[[email protected], 2020, 3]), 
PartitionValueList(values=[[email protected], 2020, 2]), 
PartitionValueList(values=[[email protected], 2020, 1]), 
PartitionValueList(values=[[email protected], 2019, 12]), 
PartitionValueList(values=[[email protected], 2019, 11]), 
PartitionValueList(values=[[email protected], 2019, 10]), 
PartitionValueList(values=[exa
 [email protected], 2020, 11]), 
PartitionValueList(values=[[email protected], 2020, 10]), 
PartitionValueList(values=[[email protected], 2015, 4]), 
PartitionValueList(values=[[email protected], 2020, 12]), 
PartitionValueList(values=[[email protected], 2015, 1]), 
PartitionValueList(values=[[email protected], 2027, 10]), 
PartitionValueList(values=[[email protected], 2012, 12])]' at 
'partitionsToDelete' failed to satisfy constraint: Member must have length less 
than or equal to 25 (Service: AWSGlue; Status Code: 400; Error Code: 
ValidationException; Request ID: xxx; Proxy: null)
        at 
org.apache.hudi.com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1879)
        at 
org.apache.hudi.com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleServiceErrorResponse(AmazonHttpClient.java:1418)
        at 
org.apache.hudi.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1387)
        at 
org.apache.hudi.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1157)
        at 
org.apache.hudi.com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:814)
        at 
org.apache.hudi.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:781)
        at 
org.apache.hudi.com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:755)
        at 
org.apache.hudi.com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:715)
        at 
org.apache.hudi.com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:697)
        at 
org.apache.hudi.com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:561)
        at 
org.apache.hudi.com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:541)
        at 
org.apache.hudi.com.amazonaws.services.glue.AWSGlueClient.doInvoke(AWSGlueClient.java:13784)
        at 
org.apache.hudi.com.amazonaws.services.glue.AWSGlueClient.invoke(AWSGlueClient.java:13751)
        at 
org.apache.hudi.com.amazonaws.services.glue.AWSGlueClient.invoke(AWSGlueClient.java:13740)
        at 
org.apache.hudi.com.amazonaws.services.glue.AWSGlueClient.executeBatchDeletePartition(AWSGlueClient.java:406)
        at 
org.apache.hudi.com.amazonaws.services.glue.AWSGlueClient.batchDeletePartition(AWSGlueClient.java:375)
        at 
org.apache.hudi.aws.sync.AWSGlueCatalogSyncClient.dropPartitions(AWSGlueCatalogSyncClient.java:214)
        ... 62 more
   
   Exception in thread "main" org.apache.spark.SparkException: Application 
application_1695917956184_0001 finished with failed status
        at org.apache.spark.deploy.yarn.Client.run(Client.scala:1337)
        at 
org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1770)
        at 
org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:1066)
        at 
org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:192)
        at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:215)
        at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:91)
        at 
org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1158)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1167)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to