ehurheap opened a new issue, #8636: URL: https://github.com/apache/hudi/issues/8636
**Describe the problem you faced** There is a pending commit in the timeline over 1 month old. The cleaner does not seem able to advance beyond the instant of the pending commit. We run the cleaner asynchronously. `hoodie.cleaner.commits.retained` is set to a value that should clean deltacommits that occured after the pending commit. **To Reproduce** - hoodie.table.type `MERGE_ON_READ` - Streaming ingestion using spark streaming - Assuming the pending commit is the result of a failed write during ingest, first repro a failing write, followed by several successful writes. - Then run a cleaner configured with `hoodie.cleaner.commits.retained` that is small enough to point to an instant after the pending commit. **Expected behavior** - Cleaner will rollback or in some way handle old pending commits so that they do not block cleaning files according to the given cleaner configuration. **Environment Description** * Hudi version : 0.13.0 * Spark version : 3.3 * Hive version : n/a * Hadoop version : n/a * Storage (HDFS/S3/GCS..) : S3 * Running on Docker? (yes/no) : no **Additional context** Streaming Ingest Write Configs: ``` hoodie.archive.automatic -> false hoodie.metadata.enable -> false hoodie.datasource.write.operation -> insert hoodie.compact.schedule.inline -> false hoodie.datasource.write.table.type -> MERGE_ON_READ hoodie.clean.automatic -> false hoodie.compact.inline -> false ``` We had to turn off automatic cleaning in our ingestion process because it was taking too long and running into memory limitations. We set up a regular async cleaning job using this spark-submit command: ``` spark-submit --deploy-mode cluster \ --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \ --conf spark.driver.memory=40G \ --conf spark.executor.cores=2 \ --conf spark.executor.instances=500 \ --conf spark.executor.memory=20G \ --conf spark.driver.maxResults=16G \ --conf spark.app.name=hudi_cleaner \ --conf spark.kryoserializer.buffer.max=256m \ --conf "spark.executor.extraJavaOptions=-XX:-UseConcMarkSweepGC -XX:-CMSClassUnloadingEnabled -XX:+UseG1GC -XX:InitiatingHeapOccupancyPercent=35 -XX:ConcGCThread=20" \ --class org.apache.hudi.utilities.HoodieCleaner s3://bucket-location/hudi-utilities-bundle_2.12-0.13.0.jar \ --target-base-path s3://path-to-table \ --hoodie-conf hoodie.metadata.enable=false \ --hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_COMMITS \ --hoodie-conf hoodie.cleaner.commits.retained=3000 \ --hoodie-conf hoodie.keep.min.commits=3010 \ --hoodie-conf hoodie.keep.max.commits=3020 \ --hoodie-conf hoodie.cleaner.parallelism=1000 \ --hoodie-conf hoodie.clean.allow.multiple=false \ --hoodie-conf hoodie.embed.timeline.server=false \ --hoodie-conf hoodie.archive.async=false \ --hoodie-conf hoodie.archive.automatic=false ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
