ehurheap opened a new issue, #9796:
URL: https://github.com/apache/hudi/issues/9796

   **Describe the problem you faced**
   
   We run our cleaner in an async process but the cleaner is now failing 
because it starts performing rollback actions, but  runs into:
   
   ```
   org.apache.hudi.exception.HoodieRollbackException: Failed to rollback for 
instant [==>20230920133304588__deltacommit__INFLIGHT]
        at 
org.apache.hudi.table.action.rollback.BaseRollbackHelper.lambda$maybeDeleteAndCollectStats$309309f3$1(BaseRollbackHelper.java:145)
        at 
org.apache.hudi.client.common.HoodieSparkEngineContext.lambda$flatMap$7d470b86$1(HoodieSparkEngineContext.java:137)
   ...
   Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: File already 
exists:s3://<table-path-redacted>/<partition-path>/.874bd745-8788-4afa-a933-9d96cbad87af-0_20230703205316197.log.13_1-0-1
        at 
com.amazon.ws.emr.hadoop.fs.s3.upload.plan.RegularUploadPlanner.checkExistenceIfNotOverwriting(RegularUploadPlanner.java:36)
        at 
com.amazon.ws.emr.hadoop.fs.s3.upload.plan.RegularUploadPlanner.plan(RegularUploadPlanner.java:30)
        ...
   ```
   The `FileAlreadyExistsException` occurs for several files in the 
deltacommit. The deltacommit touches 381 partitions and more than 176,000 files.
   
   Does this indicates that the rollback partially completed at some point? How 
can we successfully rollback this deltacommit and unblock our cleaner?
   
   **To Reproduce**
   It is unknown how this situation arose. It is possible that a cleaner that 
fails while performing a rollback might lead to the same scenario.
   
   **Expected behavior**
   The rollback should succeed.
   
   **Environment Description**
   
   * Hudi version : 0.13.0
   * Spark version : 3.3.0
   * Hive version : n/a
   * Hadoop version : 3.2.1
   * Storage (HDFS/S3/GCS..) : s3
   * Running on Docker? (yes/no) : no
   
   **Additional context**
   
   Here are the configs used in the cleaner run:
   ```
   spark-submit \
   --class org.apache.hudi.utilities.HoodieCleaner \
   --deploy-mode cluster \
   --master yarn \
   -- <executor-configs-redacted> \
   <path-to-file-redacted>/hudi-utilities-bundle_2.12-0.13.0.jar \
   --target-base-path <table-path-redacted> \
   --hoodie-conf hoodie.metadata.enable=false \
   --hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_COMMITS \
   --hoodie-conf hoodie.cleaner.commits.retained=500 \
   --hoodie-conf hoodie.keep.min.commits=510 \
   --hoodie-conf hoodie.keep.max.commits=520 \
   --hoodie-conf hoodie.cleaner.policy.failed.writes=LAZY \
   --hoodie-conf hoodie.cleaner.parallelism=200 \
   --hoodie-conf hoodie.clean.allow.multiple=false \
   --hoodie-conf hoodie.embed.timeline.server=false \
   --hoodie-conf hoodie.archive.async=false \
   --hoodie-conf hoodie.archive.automatic=false \
   --hoodie-conf hoodie.write.concurrency.mode=optimistic_concurrency_control \
   --hoodie-conf 
hoodie.write.lock.provider=org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider
 \
   --hoodie-conf hoodie.write.lock.dynamodb.table=<tablename-redacted> \
   --hoodie-conf hoodie.write.lock.dynamodb.partition_key=<tablekey-redacted> \
   --hoodie-conf hoodie.write.lock.dynamodb.region=us-east-1
   ```
   
   
   **Stacktrace**
   
   ```
   org.apache.hudi.exception.HoodieRollbackException: Failed to rollback for 
instant [==>20230920133304588__deltacommit__INFLIGHT]
        at 
org.apache.hudi.table.action.rollback.BaseRollbackHelper.lambda$maybeDeleteAndCollectStats$309309f3$1(BaseRollbackHelper.java:145)
        at 
org.apache.hudi.client.common.HoodieSparkEngineContext.lambda$flatMap$7d470b86$1(HoodieSparkEngineContext.java:137)
        at 
org.apache.spark.api.java.JavaRDDLike.$anonfun$flatMap$1(JavaRDDLike.scala:125)
        at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
        at scala.collection.Iterator.foreach(Iterator.scala:943)
        at scala.collection.Iterator.foreach$(Iterator.scala:943)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
        at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
        at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
        at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
        at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
        at scala.collection.TraversableOnce.to(TraversableOnce.scala:366)
        at scala.collection.TraversableOnce.to$(TraversableOnce.scala:364)
        at scala.collection.AbstractIterator.to(Iterator.scala:1431)
        at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:358)
        at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:358)
        at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1431)
        at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:345)
        at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:339)
        at scala.collection.AbstractIterator.toArray(Iterator.scala:1431)
        at org.apache.spark.rdd.RDD.$anonfun$collect$2(RDD.scala:1021)
        at 
org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2269)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:138)
        at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1516)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)
   Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: File already 
exists:s3://<table-path-redacted>/<partition-path>/.874bd745-8788-4afa-a933-9d96cbad87af-0_20230703205316197.log.13_1-0-1
        at 
com.amazon.ws.emr.hadoop.fs.s3.upload.plan.RegularUploadPlanner.checkExistenceIfNotOverwriting(RegularUploadPlanner.java:36)
        at 
com.amazon.ws.emr.hadoop.fs.s3.upload.plan.RegularUploadPlanner.plan(RegularUploadPlanner.java:30)
        at 
com.amazon.ws.emr.hadoop.fs.s3.upload.plan.UploadPlannerChain.plan(UploadPlannerChain.java:37)
        at 
com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.create(S3NativeFileSystem.java:339)
        at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1125)
        at 
com.amazon.ws.emr.hadoop.fs.EmrFileSystem.create(EmrFileSystem.java:255)
        at 
org.apache.hudi.common.fs.HoodieWrapperFileSystem.lambda$create$9(HoodieWrapperFileSystem.java:290)
        at 
org.apache.hudi.common.fs.HoodieWrapperFileSystem.executeFuncWithTimeMetrics(HoodieWrapperFileSystem.java:114)
        at 
org.apache.hudi.common.fs.HoodieWrapperFileSystem.create(HoodieWrapperFileSystem.java:288)
        at 
org.apache.hudi.common.table.log.HoodieLogFormatWriter.createNewFile(HoodieLogFormatWriter.java:234)
        at 
org.apache.hudi.common.table.log.HoodieLogFormatWriter.getOutputStream(HoodieLogFormatWriter.java:121)
        at 
org.apache.hudi.common.table.log.HoodieLogFormatWriter.appendBlocks(HoodieLogFormatWriter.java:144)
        at 
org.apache.hudi.common.table.log.HoodieLogFormatWriter.appendBlock(HoodieLogFormatWriter.java:135)
        at 
org.apache.hudi.table.action.rollback.BaseRollbackHelper.lambda$maybeDeleteAndCollectStats$309309f3$1(BaseRollbackHelper.java:140)
        ... 30 more
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to