adityaverma1997 opened a new issue, #9257: URL: https://github.com/apache/hudi/issues/9257
**Describe the problem you faced** Hi team, we are facing an issue where failed clean operation is cleaning up the data files. We are writing a COW table with insert operation but after running successfully for a day or so it is failing on clustering operation(stacktrace given it separate section below). When debugging we found out that actual issue was with the failed clean operation which was performed 4-5 commits before clustering. Clean operation was not able to complete and failed but there are two issues here, which are: 1. cleaning failed but data files it was supposed to clean are removed 2. although cleaning failed but spark job succeeded. Because files were cleaned up(which was not expected) and same files were used in clustering for stitching, job runs into an error state. Post cleaning there is a successful rollback and failed cleaning operation as well. Attaching screenshots from hoodie timeline as well where S represents Successful operation while F represents Fail operation:  Cleaner operation screenshot showing failed cleaning:  Screenshot for successful rollback:  **To Reproduce** Steps to reproduce the behavior: 1. Create a new COW Hudi table with given set of configurations (refer additional context section for the same). 2. Run it on hourly frequency and monitor the same for couple of days. 3. Cleaner will give warning and ultimately fail but spark job will succeed with following error: ``` Failed to perform previous clean operation, instant: [==>20230716130816302__clean__INFLIGHT] ``` The subsequent clustering operation will also fail post that because it will try to stitch the parquet file which was cleaned by the failed clean operation leading to inconsistency. Please refer to stacktrace section below for both cleaner and clustering stacktrace. **Expected behavior** As per the logic, we expected that it shouldn't have cleaned up the data files when cleaner operation have failed, which lead to subsequent clustering failure. **Environment Description** * Hudi version : 0.13.0 / 0.13.1 * Spark version : 3.3.2 * EMR version: 6.11.0 * Hive version : N/A * Hadoop version : N/A * Storage (HDFS/S3/GCS..) : S3 * Running on Docker? (yes/no) : no **Additional context** Hudi configurations to be used: ``` { "hoodie.database.name": <db_name>, "hoodie.table.name": <table_name>, "hoodie.datasource.write.recordkey.field": "id", "hoodie.datasource.write.precombine.field": "created_dt", "hoodie.datasource.write.table.type": "COPY_ON_WRITE", "hoodie.datasource.hive_sync.enable": "true", "hoodie.datasource.hive_sync.database": <db_name>, "hoodie.datasource.hive_sync.table": <table_name>, "hoodie.datasource.hive_sync.support_timestamp": "true", "hoodie.datasource.hive_sync.use_jdbc": "false", "hoodie.datasource.hive_sync.mode": "hms", "hoodie.datasource.meta_sync.condition.sync": "true", "hoodie.schema.on.read.enable": "true", "hoodie.datasource.write.reconcile.schema": "true", "hoodie.copyonwrite.record.size.estimate": 100, "hoodie.merge.allow.duplicate.on.inserts": "false", "hoodie.metadata.enable": "true", "hoodie.metadata.index.column.stats.enable": "true", "hoodie.enable.data.skipping": "true", "hoodie.datasource.write.operation": "insert", "hoodie.insert.shuffle.parallelism": 0, "hoodie.datasource.write.partitionpath.field": "created_dt_yyyy_mm_dd_ist", "hoodie.datasource.hive_sync.partition_fields": "created_dt_yyyy_mm_dd_ist", "hoodie.datasource.write.hive_style_partitioning": "true", "hoodie.clean.async": "true", "hoodie.clean.max.commits": 10, "hoodie.cleaner.policy": "KEEP_LATEST_COMMITS", "hoodie.cleaner.commits.retained": 2, "hoodie.clustering.inline": "true", "hoodie.clustering.inline.max.commits": 4, "hoodie.clustering.plan.strategy.sort.columns": "battery_event_type_id", "hoodie.clustering.plan.partition.filter.mode": "RECENT_DAYS", "hoodie.clustering.execution.strategy.class": "org.apache.hudi.client.clustering.run.strategy.SparkSortAndSizeExecutionStrategy", "hoodie.clustering.plan.strategy.target.file.max.bytes": 134217728, "hoodie.clustering.plan.strategy.small.file.limit": 104857600, "hoodie.parquet.small.file.limit": 0, "hoodie.parquet.max.file.size": 134217728, "hoodie.archive.async": "true", "hoodie.keep.min.commits": 10, "hoodie.keep.max.commits": 15, "hoodie.commits.archival.batch": 15, "hoodie.archive.merge.enable": "true", "hoodie.archive.merge.files.batch.size": 15, "hoodie.archive.merge.small.file.limit.bytes": 20971520, } ``` **Stacktrace** Cleaning error stacktrace: ``` [2023-07-16, 14:07:21 UTC] {subprocess.py:89} INFO - 23/07/16 14:07:21 WARN CleanActionExecutor: Failed to perform previous clean operation, instant: [==>20230716130816302__clean__INFLIGHT] [2023-07-16, 14:07:21 UTC] {subprocess.py:89} INFO - java.lang.IllegalArgumentException: null [2023-07-16, 14:07:21 UTC] {subprocess.py:89} INFO - at org.apache.hudi.common.util.ValidationUtils.checkArgument(ValidationUtils.java:31) ~[hudi-aws-bundle-0.13.0-amzn-0.jar:0.13.0-amzn-0] [2023-07-16, 14:07:21 UTC] {subprocess.py:89} INFO - at org.apache.hudi.common.table.timeline.HoodieActiveTimeline.transitionState(HoodieActiveTimeline.java:633) ~[hudi-aws-bundle-0.13.0-amzn-0.jar:0.13.0-amzn-0] [2023-07-16, 14:07:21 UTC] {subprocess.py:89} INFO - at org.apache.hudi.common.table.timeline.HoodieActiveTimeline.transitionState(HoodieActiveTimeline.java:614) ~[hudi-aws-bundle-0.13.0-amzn-0.jar:0.13.0-amzn-0] [2023-07-16, 14:07:21 UTC] {subprocess.py:89} INFO - at org.apache.hudi.common.table.timeline.HoodieActiveTimeline.saveAsComplete(HoodieActiveTimeline.java:223) ~[hudi-aws-bundle-0.13.0-amzn-0.jar:0.13.0-amzn-0] [2023-07-16, 14:07:21 UTC] {subprocess.py:89} INFO - at org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.commit(BaseSparkCommitActionExecutor.java:305) ~[hudi-spark3-bundle_2.12-0.13.0-amzn-0.jar:0.13.0-amzn-0] [2023-07-16, 14:07:21 UTC] {subprocess.py:89} INFO - at org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.commit(BaseSparkCommitActionExecutor.java:291) ~[hudi-spark3-bundle_2.12-0.13.0-amzn-0.jar:0.13.0-amzn-0] [2023-07-16, 14:07:21 UTC] {subprocess.py:89} INFO - at org.apache.hudi.table.action.commit.BaseCommitActionExecutor.autoCommit(BaseCommitActionExecutor.java:190) ~[hudi-spark3-bundle_2.12-0.13.0-amzn-0.jar:0.13.0-amzn-0] [2023-07-16, 14:07:21 UTC] {subprocess.py:89} INFO - at org.apache.hudi.table.action.commit.BaseCommitActionExecutor.commitOnAutoCommit(BaseCommitActionExecutor.java:174) ~[hudi-spark3-bundle_2.12-0.13.0-amzn-0.jar:0.13.0-amzn-0] [2023-07-16, 14:07:21 UTC] {subprocess.py:89} INFO - at org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.updateIndexAndCommitIfNeeded(BaseSparkCommitActionExecutor.java:273) ~[hudi-spark3-bundle_2.12-0.13.0-amzn-0.jar:0.13.0-amzn-0] [2023-07-16, 14:07:21 UTC] {subprocess.py:89} INFO - at org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.execute(BaseSparkCommitActionExecutor.java:178) ~[hudi-spark3-bundle_2.12-0.13.0-amzn-0.jar:0.13.0-amzn-0] [2023-07-16, 14:07:21 UTC] {subprocess.py:89} INFO - at org.apache.hudi.table.action.deltacommit.SparkUpsertPreppedDeltaCommitActionExecutor.execute(SparkUpsertPreppedDeltaCommitActionExecutor.java:44) ~[hudi-spark3-bundle_2.12-0.13.0-amzn-0.jar:0.13.0-amzn-0] [2023-07-16, 14:07:21 UTC] {subprocess.py:89} INFO - at org.apache.hudi.table.HoodieSparkMergeOnReadTable.upsertPrepped(HoodieSparkMergeOnReadTable.java:111) ~[hudi-spark3-bundle_2.12-0.13.0-amzn-0.jar:0.13.0-amzn-0] [2023-07-16, 14:07:21 UTC] {subprocess.py:89} INFO - at org.apache.hudi.table.HoodieSparkMergeOnReadTable.upsertPrepped(HoodieSparkMergeOnReadTable.java:80) ~[hudi-spark3-bundle_2.12-0.13.0-amzn-0.jar:0.13.0-amzn-0] [2023-07-16, 14:07:21 UTC] {subprocess.py:89} INFO - at org.apache.hudi.client.SparkRDDWriteClient.upsertPreppedRecords(SparkRDDWriteClient.java:154) ~[hudi-spark3-bundle_2.12-0.13.0-amzn-0.jar:0.13.0-amzn-0] [2023-07-16, 14:07:21 UTC] {subprocess.py:89} INFO - at org.apache.hudi.metadata.SparkHoodieBackedTableMetadataWriter.commit(SparkHoodieBackedTableMetadataWriter.java:186) ~[hudi-spark3-bundle_2.12-0.13.0-amzn-0.jar:0.13.0-amzn-0] [2023-07-16, 14:07:21 UTC] {subprocess.py:89} INFO - at org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.processAndCommit(HoodieBackedTableMetadataWriter.java:830) ~[hudi-spark3-bundle_2.12-0.13.0-amzn-0.jar:0.13.0-amzn-0] [2023-07-16, 14:07:21 UTC] {subprocess.py:89} INFO - at org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.update(HoodieBackedTableMetadataWriter.java:910) ~[hudi-spark3-bundle_2.12-0.13.0-amzn-0.jar:0.13.0-amzn-0] [2023-07-16, 14:07:21 UTC] {subprocess.py:89} INFO - at org.apache.hudi.table.action.BaseActionExecutor.lambda$writeTableMetadata$1(BaseActionExecutor.java:68) ~[hudi-spark3-bundle_2.12-0.13.0-amzn-0.jar:0.13.0-amzn-0] [2023-07-16, 14:07:21 UTC] {subprocess.py:89} INFO - at org.apache.hudi.common.util.Option.ifPresent(Option.java:97) ~[hudi-aws-bundle-0.13.0-amzn-0.jar:0.13.0-amzn-0] [2023-07-16, 14:07:21 UTC] {subprocess.py:89} INFO - at org.apache.hudi.table.action.BaseActionExecutor.writeTableMetadata(BaseActionExecutor.java:68) ~[hudi-spark3-bundle_2.12-0.13.0-amzn-0.jar:0.13.0-amzn-0] [2023-07-16, 14:07:21 UTC] {subprocess.py:89} INFO - at org.apache.hudi.table.action.clean.CleanActionExecutor.runClean(CleanActionExecutor.java:221) ~[hudi-spark3-bundle_2.12-0.13.0-amzn-0.jar:0.13.0-amzn-0] [2023-07-16, 14:07:21 UTC] {subprocess.py:89} INFO - at org.apache.hudi.table.action.clean.CleanActionExecutor.runPendingClean(CleanActionExecutor.java:187) ~[hudi-spark3-bundle_2.12-0.13.0-amzn-0.jar:0.13.0-amzn-0] [2023-07-16, 14:07:21 UTC] {subprocess.py:89} INFO - at org.apache.hudi.table.action.clean.CleanActionExecutor.lambda$execute$8(CleanActionExecutor.java:256) ~[hudi-spark3-bundle_2.12-0.13.0-amzn-0.jar:0.13.0-amzn-0] [2023-07-16, 14:07:21 UTC] {subprocess.py:89} INFO - at java.util.ArrayList.forEach(ArrayList.java:1259) ~[?:1.8.0_372] [2023-07-16, 14:07:21 UTC] {subprocess.py:89} INFO - at org.apache.hudi.table.action.clean.CleanActionExecutor.execute(CleanActionExecutor.java:250) ~[hudi-spark3-bundle_2.12-0.13.0-amzn-0.jar:0.13.0-amzn-0] [2023-07-16, 14:07:21 UTC] {subprocess.py:89} INFO - at org.apache.hudi.table.HoodieSparkCopyOnWriteTable.clean(HoodieSparkCopyOnWriteTable.java:263) ~[hudi-spark3-bundle_2.12-0.13.0-amzn-0.jar:0.13.0-amzn-0] [2023-07-16, 14:07:21 UTC] {subprocess.py:89} INFO - at org.apache.hudi.client.BaseHoodieTableServiceClient.clean(BaseHoodieTableServiceClient.java:554) ~[hudi-spark3-bundle_2.12-0.13.0-amzn-0.jar:0.13.0-amzn-0] [2023-07-16, 14:07:21 UTC] {subprocess.py:89} INFO - at org.apache.hudi.client.BaseHoodieWriteClient.clean(BaseHoodieWriteClient.java:765) ~[hudi-spark3-bundle_2.12-0.13.0-amzn-0.jar:0.13.0-amzn-0] [2023-07-16, 14:07:21 UTC] {subprocess.py:89} INFO - at org.apache.hudi.client.BaseHoodieWriteClient.clean(BaseHoodieWriteClient.java:737) ~[hudi-spark3-bundle_2.12-0.13.0-amzn-0.jar:0.13.0-amzn-0] [2023-07-16, 14:07:21 UTC] {subprocess.py:89} INFO - at org.apache.hudi.async.AsyncCleanerService.lambda$startService$0(AsyncCleanerService.java:55) ~[hudi-spark3-bundle_2.12-0.13.0-amzn-0.jar:0.13.0-amzn-0] [2023-07-16, 14:07:21 UTC] {subprocess.py:89} INFO - at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604) ~[?:1.8.0_372] [2023-07-16, 14:07:21 UTC] {subprocess.py:89} INFO - at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_372] [2023-07-16, 14:07:21 UTC] {subprocess.py:89} INFO - at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_372] [2023-07-16, 14:07:21 UTC] {subprocess.py:89} INFO - at java.lang.Thread.run(Thread.java:750) ~[?:1.8.0_372] ``` Clustering error stacktrace: ``` [2023-07-16 20:07:27,119] {subprocess.py:89} INFO - : java.util.concurrent.CompletionException: org.apache.spark.sql.AnalysisException: Path does not exist: s3://<bucket_name>/<folder_name>/<table_name>/created_dt_yyyy_mm_dd_ist=2023-07-16/44824999-78d5-4ea7-b5aa-0f8b5df80671-0_0-28-85_20230716040832186.parquet [2023-07-16 20:07:27,119] {subprocess.py:89} INFO - at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:273) [2023-07-16 20:07:27,119] {subprocess.py:89} INFO - at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:280) [2023-07-16 20:07:27,119] {subprocess.py:89} INFO - at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1606) [2023-07-16 20:07:27,119] {subprocess.py:89} INFO - at java.util.concurrent.CompletableFuture$AsyncSupply.exec(CompletableFuture.java:1596) [2023-07-16 20:07:27,119] {subprocess.py:89} INFO - at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289) [2023-07-16 20:07:27,119] {subprocess.py:89} INFO - at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056) [2023-07-16 20:07:27,120] {subprocess.py:89} INFO - at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692) [2023-07-16 20:07:27,120] {subprocess.py:89} INFO - at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:175) [2023-07-16 20:07:27,120] {subprocess.py:89} INFO - Caused by: org.apache.spark.sql.AnalysisException: Path does not exist: s3://<bucket_name>/<folder_name>/<table_name>/created_dt_yyyy_mm_dd_ist=2023-07-16/44824999-78d5-4ea7-b5aa-0f8b5df80671-0_0-28-85_20230716040832186.parquet [2023-07-16 20:07:27,120] {subprocess.py:89} INFO - at org.apache.spark.sql.errors.QueryCompilationErrors$.dataPathNotExistError(QueryCompilationErrors.scala:1016) [2023-07-16 20:07:27,120] {subprocess.py:89} INFO - at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$checkAndGlobPathIfNecessary$4(DataSource.scala:785) [2023-07-16 20:07:27,120] {subprocess.py:89} INFO - at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$checkAndGlobPathIfNecessary$4$adapted(DataSource.scala:782) [2023-07-16 20:07:27,120] {subprocess.py:89} INFO - at org.apache.spark.util.ThreadUtils$.$anonfun$parmap$2(ThreadUtils.scala:372) [2023-07-16 20:07:27,120] {subprocess.py:89} INFO - at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:659) [2023-07-16 20:07:27,120] {subprocess.py:89} INFO - at scala.util.Success.$anonfun$map$1(Try.scala:255) [2023-07-16 20:07:27,121] {subprocess.py:89} INFO - at scala.util.Success.map(Try.scala:213) [2023-07-16 20:07:27,121] {subprocess.py:89} INFO - at scala.concurrent.Future.$anonfun$map$1(Future.scala:292) [2023-07-16 20:07:27,121] {subprocess.py:89} INFO - at scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:33) [2023-07-16 20:07:27,121] {subprocess.py:89} INFO - at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:33) [2023-07-16 20:07:27,121] {subprocess.py:89} INFO - at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64) [2023-07-16 20:07:27,121] {subprocess.py:89} INFO - at java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(ForkJoinTask.java:1402) [2023-07-16 20:07:27,121] {subprocess.py:89} INFO - ... 4 more ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
