[GitHub] [hudi] adityaverma1997 opened a new issue, #9257: [SUPPORT] Parquet files got cleaned up even when cleaning operation failed hence leading to subsequent failed clustering and cleaning

via GitHub Fri, 21 Jul 2023 04:47:52 -0700


adityaverma1997 opened a new issue, #9257:
URL: https://github.com/apache/hudi/issues/9257


   **Describe the problem you faced**
   
   Hi team, we are facing an issue where failed clean operation is cleaning up 
the data files.
   
   We are writing a COW table with insert operation but after running 
successfully for a day or so it is failing on clustering operation(stacktrace 
given it separate section below). When debugging we found out that actual issue 
was with the failed clean operation which was performed 4-5 commits before 
clustering. 
   
   Clean operation was not able to complete and failed but there are two issues 
here, which are:
   1. cleaning failed but data files it was supposed to clean are removed
   2. although cleaning failed but spark job succeeded.
   
   Because files were cleaned up(which was not expected) and same files were 
used in clustering for stitching, job runs into an error state.
   
   Post cleaning there is a successful rollback and failed cleaning operation 
as well.
   Attaching screenshots from hoodie timeline as well where S represents 
Successful operation while F represents Fail operation:
   ![image 
(16)](https://github.com/apache/hudi/assets/34742014/fbabee22-ea9f-488c-b0d7-180d8b0eca89)
   
   Cleaner operation screenshot showing failed cleaning:
   ![image 
(17)](https://github.com/apache/hudi/assets/34742014/4cc9e7f0-45d2-4ae7-a238-38c361cc76dd)
   
   Screenshot for successful rollback:
   ![image 
(18)](https://github.com/apache/hudi/assets/34742014/b4766899-518f-4841-84b2-48619a1dd388)
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Create a new COW Hudi  table with given set of configurations (refer 
additional context section for the same).
   2. Run it on hourly frequency and monitor the same for couple of days.
   3. Cleaner will give warning and ultimately fail but spark job will succeed 
with following error:
   ```
   Failed to perform previous clean operation, instant: 
[==>20230716130816302__clean__INFLIGHT]
   ```
   The subsequent clustering operation will also fail post that because it will 
try to stitch the parquet file which was cleaned by the failed clean operation 
leading to inconsistency.
   Please refer to stacktrace section below for both cleaner and clustering 
stacktrace.
   
   **Expected behavior**
   
   As per the logic, we expected that it shouldn't have cleaned up the data 
files when cleaner operation have failed, which lead to subsequent clustering 
failure.
   
   **Environment Description**
   
   * Hudi version : 0.13.0 / 0.13.1
   
   * Spark version : 3.3.2
   
   * EMR version: 6.11.0
   
   * Hive version : N/A
   
   * Hadoop version : N/A
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : no
   
   
   **Additional context**
   
   Hudi configurations to be used:
   ```
   {
     "hoodie.database.name": <db_name>,
     "hoodie.table.name": <table_name>,
     "hoodie.datasource.write.recordkey.field": "id",
     "hoodie.datasource.write.precombine.field": "created_dt",
     "hoodie.datasource.write.table.type": "COPY_ON_WRITE",
     "hoodie.datasource.hive_sync.enable": "true",
     "hoodie.datasource.hive_sync.database": <db_name>,
     "hoodie.datasource.hive_sync.table": <table_name>,
     "hoodie.datasource.hive_sync.support_timestamp": "true",
     "hoodie.datasource.hive_sync.use_jdbc": "false",
     "hoodie.datasource.hive_sync.mode": "hms",
     "hoodie.datasource.meta_sync.condition.sync": "true",
     "hoodie.schema.on.read.enable": "true",
     "hoodie.datasource.write.reconcile.schema": "true",
     "hoodie.copyonwrite.record.size.estimate": 100,
     "hoodie.merge.allow.duplicate.on.inserts": "false",
     "hoodie.metadata.enable": "true",
     "hoodie.metadata.index.column.stats.enable": "true",
     "hoodie.enable.data.skipping": "true",
     "hoodie.datasource.write.operation": "insert",
     "hoodie.insert.shuffle.parallelism": 0,
     "hoodie.datasource.write.partitionpath.field": "created_dt_yyyy_mm_dd_ist",
     "hoodie.datasource.hive_sync.partition_fields": 
"created_dt_yyyy_mm_dd_ist",
     "hoodie.datasource.write.hive_style_partitioning": "true",
     "hoodie.clean.async": "true",
     "hoodie.clean.max.commits": 10,
     "hoodie.cleaner.policy": "KEEP_LATEST_COMMITS",
     "hoodie.cleaner.commits.retained": 2,
     "hoodie.clustering.inline": "true",
     "hoodie.clustering.inline.max.commits": 4,
     "hoodie.clustering.plan.strategy.sort.columns": "battery_event_type_id",
     "hoodie.clustering.plan.partition.filter.mode": "RECENT_DAYS",
     "hoodie.clustering.execution.strategy.class": 
"org.apache.hudi.client.clustering.run.strategy.SparkSortAndSizeExecutionStrategy",
     "hoodie.clustering.plan.strategy.target.file.max.bytes": 134217728,
     "hoodie.clustering.plan.strategy.small.file.limit": 104857600,
     "hoodie.parquet.small.file.limit": 0,
     "hoodie.parquet.max.file.size": 134217728,
     "hoodie.archive.async": "true",
     "hoodie.keep.min.commits": 10,
     "hoodie.keep.max.commits": 15,
     "hoodie.commits.archival.batch": 15,
     "hoodie.archive.merge.enable": "true",
     "hoodie.archive.merge.files.batch.size": 15,
     "hoodie.archive.merge.small.file.limit.bytes": 20971520,
   }
   ```
   
   **Stacktrace**
   
   Cleaning error stacktrace:
   ```
   [2023-07-16, 14:07:21 UTC] {subprocess.py:89} INFO - 23/07/16 14:07:21 WARN 
CleanActionExecutor: Failed to perform previous clean operation, instant: 
[==>20230716130816302__clean__INFLIGHT]
   [2023-07-16, 14:07:21 UTC] {subprocess.py:89} INFO - 
java.lang.IllegalArgumentException: null
   [2023-07-16, 14:07:21 UTC] {subprocess.py:89} INFO -         at 
org.apache.hudi.common.util.ValidationUtils.checkArgument(ValidationUtils.java:31)
 ~[hudi-aws-bundle-0.13.0-amzn-0.jar:0.13.0-amzn-0]
   [2023-07-16, 14:07:21 UTC] {subprocess.py:89} INFO -         at 
org.apache.hudi.common.table.timeline.HoodieActiveTimeline.transitionState(HoodieActiveTimeline.java:633)
 ~[hudi-aws-bundle-0.13.0-amzn-0.jar:0.13.0-amzn-0]
   [2023-07-16, 14:07:21 UTC] {subprocess.py:89} INFO -         at 
org.apache.hudi.common.table.timeline.HoodieActiveTimeline.transitionState(HoodieActiveTimeline.java:614)
 ~[hudi-aws-bundle-0.13.0-amzn-0.jar:0.13.0-amzn-0]
   [2023-07-16, 14:07:21 UTC] {subprocess.py:89} INFO -         at 
org.apache.hudi.common.table.timeline.HoodieActiveTimeline.saveAsComplete(HoodieActiveTimeline.java:223)
 ~[hudi-aws-bundle-0.13.0-amzn-0.jar:0.13.0-amzn-0]
   [2023-07-16, 14:07:21 UTC] {subprocess.py:89} INFO -         at 
org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.commit(BaseSparkCommitActionExecutor.java:305)
 ~[hudi-spark3-bundle_2.12-0.13.0-amzn-0.jar:0.13.0-amzn-0]
   [2023-07-16, 14:07:21 UTC] {subprocess.py:89} INFO -         at 
org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.commit(BaseSparkCommitActionExecutor.java:291)
 ~[hudi-spark3-bundle_2.12-0.13.0-amzn-0.jar:0.13.0-amzn-0]
   [2023-07-16, 14:07:21 UTC] {subprocess.py:89} INFO -         at 
org.apache.hudi.table.action.commit.BaseCommitActionExecutor.autoCommit(BaseCommitActionExecutor.java:190)
 ~[hudi-spark3-bundle_2.12-0.13.0-amzn-0.jar:0.13.0-amzn-0]
   [2023-07-16, 14:07:21 UTC] {subprocess.py:89} INFO -         at 
org.apache.hudi.table.action.commit.BaseCommitActionExecutor.commitOnAutoCommit(BaseCommitActionExecutor.java:174)
 ~[hudi-spark3-bundle_2.12-0.13.0-amzn-0.jar:0.13.0-amzn-0]
   [2023-07-16, 14:07:21 UTC] {subprocess.py:89} INFO -         at 
org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.updateIndexAndCommitIfNeeded(BaseSparkCommitActionExecutor.java:273)
 ~[hudi-spark3-bundle_2.12-0.13.0-amzn-0.jar:0.13.0-amzn-0]
   [2023-07-16, 14:07:21 UTC] {subprocess.py:89} INFO -         at 
org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.execute(BaseSparkCommitActionExecutor.java:178)
 ~[hudi-spark3-bundle_2.12-0.13.0-amzn-0.jar:0.13.0-amzn-0]
   [2023-07-16, 14:07:21 UTC] {subprocess.py:89} INFO -         at 
org.apache.hudi.table.action.deltacommit.SparkUpsertPreppedDeltaCommitActionExecutor.execute(SparkUpsertPreppedDeltaCommitActionExecutor.java:44)
 ~[hudi-spark3-bundle_2.12-0.13.0-amzn-0.jar:0.13.0-amzn-0]
   [2023-07-16, 14:07:21 UTC] {subprocess.py:89} INFO -         at 
org.apache.hudi.table.HoodieSparkMergeOnReadTable.upsertPrepped(HoodieSparkMergeOnReadTable.java:111)
 ~[hudi-spark3-bundle_2.12-0.13.0-amzn-0.jar:0.13.0-amzn-0]
   [2023-07-16, 14:07:21 UTC] {subprocess.py:89} INFO -         at 
org.apache.hudi.table.HoodieSparkMergeOnReadTable.upsertPrepped(HoodieSparkMergeOnReadTable.java:80)
 ~[hudi-spark3-bundle_2.12-0.13.0-amzn-0.jar:0.13.0-amzn-0]
   [2023-07-16, 14:07:21 UTC] {subprocess.py:89} INFO -         at 
org.apache.hudi.client.SparkRDDWriteClient.upsertPreppedRecords(SparkRDDWriteClient.java:154)
 ~[hudi-spark3-bundle_2.12-0.13.0-amzn-0.jar:0.13.0-amzn-0]
   [2023-07-16, 14:07:21 UTC] {subprocess.py:89} INFO -         at 
org.apache.hudi.metadata.SparkHoodieBackedTableMetadataWriter.commit(SparkHoodieBackedTableMetadataWriter.java:186)
 ~[hudi-spark3-bundle_2.12-0.13.0-amzn-0.jar:0.13.0-amzn-0]
   [2023-07-16, 14:07:21 UTC] {subprocess.py:89} INFO -         at 
org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.processAndCommit(HoodieBackedTableMetadataWriter.java:830)
 ~[hudi-spark3-bundle_2.12-0.13.0-amzn-0.jar:0.13.0-amzn-0]
   [2023-07-16, 14:07:21 UTC] {subprocess.py:89} INFO -         at 
org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.update(HoodieBackedTableMetadataWriter.java:910)
 ~[hudi-spark3-bundle_2.12-0.13.0-amzn-0.jar:0.13.0-amzn-0]
   [2023-07-16, 14:07:21 UTC] {subprocess.py:89} INFO -         at 
org.apache.hudi.table.action.BaseActionExecutor.lambda$writeTableMetadata$1(BaseActionExecutor.java:68)
 ~[hudi-spark3-bundle_2.12-0.13.0-amzn-0.jar:0.13.0-amzn-0]
   [2023-07-16, 14:07:21 UTC] {subprocess.py:89} INFO -         at 
org.apache.hudi.common.util.Option.ifPresent(Option.java:97) 
~[hudi-aws-bundle-0.13.0-amzn-0.jar:0.13.0-amzn-0]
   [2023-07-16, 14:07:21 UTC] {subprocess.py:89} INFO -         at 
org.apache.hudi.table.action.BaseActionExecutor.writeTableMetadata(BaseActionExecutor.java:68)
 ~[hudi-spark3-bundle_2.12-0.13.0-amzn-0.jar:0.13.0-amzn-0]
   [2023-07-16, 14:07:21 UTC] {subprocess.py:89} INFO -         at 
org.apache.hudi.table.action.clean.CleanActionExecutor.runClean(CleanActionExecutor.java:221)
 ~[hudi-spark3-bundle_2.12-0.13.0-amzn-0.jar:0.13.0-amzn-0]
   [2023-07-16, 14:07:21 UTC] {subprocess.py:89} INFO -         at 
org.apache.hudi.table.action.clean.CleanActionExecutor.runPendingClean(CleanActionExecutor.java:187)
 ~[hudi-spark3-bundle_2.12-0.13.0-amzn-0.jar:0.13.0-amzn-0]
   [2023-07-16, 14:07:21 UTC] {subprocess.py:89} INFO -         at 
org.apache.hudi.table.action.clean.CleanActionExecutor.lambda$execute$8(CleanActionExecutor.java:256)
 ~[hudi-spark3-bundle_2.12-0.13.0-amzn-0.jar:0.13.0-amzn-0]
   [2023-07-16, 14:07:21 UTC] {subprocess.py:89} INFO -         at 
java.util.ArrayList.forEach(ArrayList.java:1259) ~[?:1.8.0_372]
   [2023-07-16, 14:07:21 UTC] {subprocess.py:89} INFO -         at 
org.apache.hudi.table.action.clean.CleanActionExecutor.execute(CleanActionExecutor.java:250)
 ~[hudi-spark3-bundle_2.12-0.13.0-amzn-0.jar:0.13.0-amzn-0]
   [2023-07-16, 14:07:21 UTC] {subprocess.py:89} INFO -         at 
org.apache.hudi.table.HoodieSparkCopyOnWriteTable.clean(HoodieSparkCopyOnWriteTable.java:263)
 ~[hudi-spark3-bundle_2.12-0.13.0-amzn-0.jar:0.13.0-amzn-0]
   [2023-07-16, 14:07:21 UTC] {subprocess.py:89} INFO -         at 
org.apache.hudi.client.BaseHoodieTableServiceClient.clean(BaseHoodieTableServiceClient.java:554)
 ~[hudi-spark3-bundle_2.12-0.13.0-amzn-0.jar:0.13.0-amzn-0]
   [2023-07-16, 14:07:21 UTC] {subprocess.py:89} INFO -         at 
org.apache.hudi.client.BaseHoodieWriteClient.clean(BaseHoodieWriteClient.java:765)
 ~[hudi-spark3-bundle_2.12-0.13.0-amzn-0.jar:0.13.0-amzn-0]
   [2023-07-16, 14:07:21 UTC] {subprocess.py:89} INFO -         at 
org.apache.hudi.client.BaseHoodieWriteClient.clean(BaseHoodieWriteClient.java:737)
 ~[hudi-spark3-bundle_2.12-0.13.0-amzn-0.jar:0.13.0-amzn-0]
   [2023-07-16, 14:07:21 UTC] {subprocess.py:89} INFO -         at 
org.apache.hudi.async.AsyncCleanerService.lambda$startService$0(AsyncCleanerService.java:55)
 ~[hudi-spark3-bundle_2.12-0.13.0-amzn-0.jar:0.13.0-amzn-0]
   [2023-07-16, 14:07:21 UTC] {subprocess.py:89} INFO -         at 
java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604)
 ~[?:1.8.0_372]
   [2023-07-16, 14:07:21 UTC] {subprocess.py:89} INFO -         at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
~[?:1.8.0_372]
   [2023-07-16, 14:07:21 UTC] {subprocess.py:89} INFO -         at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
~[?:1.8.0_372]
   [2023-07-16, 14:07:21 UTC] {subprocess.py:89} INFO -         at 
java.lang.Thread.run(Thread.java:750) ~[?:1.8.0_372]
   ```
   
   Clustering error stacktrace:
   ```
   [2023-07-16 20:07:27,119] {subprocess.py:89} INFO - : 
java.util.concurrent.CompletionException: 
org.apache.spark.sql.AnalysisException: Path does not exist: 
s3://<bucket_name>/<folder_name>/<table_name>/created_dt_yyyy_mm_dd_ist=2023-07-16/44824999-78d5-4ea7-b5aa-0f8b5df80671-0_0-28-85_20230716040832186.parquet
   [2023-07-16 20:07:27,119] {subprocess.py:89} INFO -     at 
java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:273)
   [2023-07-16 20:07:27,119] {subprocess.py:89} INFO -     at 
java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:280)
   [2023-07-16 20:07:27,119] {subprocess.py:89} INFO -     at 
java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1606)
   [2023-07-16 20:07:27,119] {subprocess.py:89} INFO -     at 
java.util.concurrent.CompletableFuture$AsyncSupply.exec(CompletableFuture.java:1596)
   [2023-07-16 20:07:27,119] {subprocess.py:89} INFO -     at 
java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
   [2023-07-16 20:07:27,119] {subprocess.py:89} INFO -     at 
java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
   [2023-07-16 20:07:27,120] {subprocess.py:89} INFO -     at 
java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
   [2023-07-16 20:07:27,120] {subprocess.py:89} INFO -     at 
java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:175)
   [2023-07-16 20:07:27,120] {subprocess.py:89} INFO - Caused by: 
org.apache.spark.sql.AnalysisException: Path does not exist: 
s3://<bucket_name>/<folder_name>/<table_name>/created_dt_yyyy_mm_dd_ist=2023-07-16/44824999-78d5-4ea7-b5aa-0f8b5df80671-0_0-28-85_20230716040832186.parquet
   [2023-07-16 20:07:27,120] {subprocess.py:89} INFO -     at 
org.apache.spark.sql.errors.QueryCompilationErrors$.dataPathNotExistError(QueryCompilationErrors.scala:1016)
   [2023-07-16 20:07:27,120] {subprocess.py:89} INFO -     at 
org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$checkAndGlobPathIfNecessary$4(DataSource.scala:785)
   [2023-07-16 20:07:27,120] {subprocess.py:89} INFO -     at 
org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$checkAndGlobPathIfNecessary$4$adapted(DataSource.scala:782)
   [2023-07-16 20:07:27,120] {subprocess.py:89} INFO -     at 
org.apache.spark.util.ThreadUtils$.$anonfun$parmap$2(ThreadUtils.scala:372)
   [2023-07-16 20:07:27,120] {subprocess.py:89} INFO -     at 
scala.concurrent.Future$.$anonfun$apply$1(Future.scala:659)
   [2023-07-16 20:07:27,120] {subprocess.py:89} INFO -     at 
scala.util.Success.$anonfun$map$1(Try.scala:255)
   [2023-07-16 20:07:27,121] {subprocess.py:89} INFO -     at 
scala.util.Success.map(Try.scala:213)
   [2023-07-16 20:07:27,121] {subprocess.py:89} INFO -     at 
scala.concurrent.Future.$anonfun$map$1(Future.scala:292)
   [2023-07-16 20:07:27,121] {subprocess.py:89} INFO -     at 
scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:33)
   [2023-07-16 20:07:27,121] {subprocess.py:89} INFO -     at 
scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:33)
   [2023-07-16 20:07:27,121] {subprocess.py:89} INFO -     at 
scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64)
   [2023-07-16 20:07:27,121] {subprocess.py:89} INFO -     at 
java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(ForkJoinTask.java:1402)
   [2023-07-16 20:07:27,121] {subprocess.py:89} INFO -     ... 4 more
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] adityaverma1997 opened a new issue, #9257: [SUPPORT] Parquet files got cleaned up even when cleaning operation failed hence leading to subsequent failed clustering and cleaning

Reply via email to