ehurheap opened a new issue, #8636:
URL: https://github.com/apache/hudi/issues/8636

   **Describe the problem you faced**
   
   There is a pending commit in the timeline over 1 month old. The cleaner does 
not seem able to advance beyond the instant of the pending commit. 
   
   We run the cleaner asynchronously. `hoodie.cleaner.commits.retained` is set 
to a value that should clean deltacommits that occured after the pending commit.
   
   **To Reproduce**
   
   - hoodie.table.type `MERGE_ON_READ`
   - Streaming ingestion using spark streaming
   - Assuming the pending commit is the result of a failed write during ingest, 
first repro a failing write, followed by several successful writes.
   - Then run a cleaner configured with `hoodie.cleaner.commits.retained` that 
is small enough to point to an instant after the pending commit.
   
   **Expected behavior**
   
   - Cleaner will rollback or in some way handle old pending commits so that 
they do not block cleaning files according to the given cleaner configuration.
   
   **Environment Description**
   * Hudi version : 0.13.0
   * Spark version : 3.3
   * Hive version : n/a
   * Hadoop version : n/a
   * Storage (HDFS/S3/GCS..) : S3
   * Running on Docker? (yes/no) : no
   
   **Additional context**
   Streaming Ingest Write Configs:
   ```
   hoodie.archive.automatic -> false
   hoodie.metadata.enable -> false
   hoodie.datasource.write.operation -> insert
   hoodie.compact.schedule.inline -> false
   hoodie.datasource.write.table.type -> MERGE_ON_READ
   hoodie.clean.automatic -> false
   hoodie.compact.inline -> false
   
   ```
   
   We had to turn off automatic cleaning in our ingestion process because it 
was taking too long and running into memory limitations. We set up a regular 
async cleaning job using this spark-submit command:
   ```
   spark-submit --deploy-mode cluster \
   --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
   --conf spark.driver.memory=40G \
   --conf spark.executor.cores=2 \
   --conf spark.executor.instances=500 \
   --conf spark.executor.memory=20G \
   --conf spark.driver.maxResults=16G \
   --conf spark.app.name=hudi_cleaner \
   --conf spark.kryoserializer.buffer.max=256m \
   --conf "spark.executor.extraJavaOptions=-XX:-UseConcMarkSweepGC 
-XX:-CMSClassUnloadingEnabled -XX:+UseG1GC 
-XX:InitiatingHeapOccupancyPercent=35 -XX:ConcGCThread=20" \
   --class org.apache.hudi.utilities.HoodieCleaner 
s3://bucket-location/hudi-utilities-bundle_2.12-0.13.0.jar \
   --target-base-path s3://path-to-table \
   --hoodie-conf hoodie.metadata.enable=false \
   --hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_COMMITS \
   --hoodie-conf hoodie.cleaner.commits.retained=3000 \
   --hoodie-conf hoodie.keep.min.commits=3010 \
   --hoodie-conf hoodie.keep.max.commits=3020 \
   --hoodie-conf hoodie.cleaner.parallelism=1000 \
   --hoodie-conf hoodie.clean.allow.multiple=false \
   --hoodie-conf hoodie.embed.timeline.server=false \
   --hoodie-conf hoodie.archive.async=false \
   --hoodie-conf hoodie.archive.automatic=false
   
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to