ehurheap opened a new issue, #8209: URL: https://github.com/apache/hudi/issues/8209
**Describe the problem you faced** We are running spark streaming ingestion with the following cleaner configs: ``` (hoodie.clean.automatic -> true) (hoodie.clean.max.commits -> 30) (hoodie.cleaner.hours.retained -> 24) (hoodie.cleaner.parallelism -> 256) (hoodie.cleaner.policy -> KEEP_LATEST_BY_HOURS) ``` Ingestion commits happen about every 20-30 minutes. However using the hudi-cli I can see that the cleans occur far less frequently, and at some point about 3 weeks ago cleans stopped happening altogether. When the ingestion was restarted, it stalled on `Generating list of file slices to be cleaned:`, and eventually the executors ran out of memory and the job failed. To allow ingestion to proceed we redeployed with automatic cleaner disabled. Questions: - Why did the cleaner stop running? - Is it expected that the cleans happen less frequently than commits? - Is cleaning impacted by not using the metadata table? - What is the best approach to catch up on all the files to be cleaned? **To Reproduce** Steps to reproduce the behavior: 1. Deploy ingestion with above write configs 2. observe cleans in hudi-cli `cleans show` 3. Redeploy ingestion after cleaner has stopped for some time **Expected behavior** The cleaner table service is invoked immediately after each commit. **Environment Description** * Hudi version : 0.12.1 * Spark version : 3.3 * Storage (HDFS/S3/GCS..) : S3 * Running on Docker? (yes/no) : no **Additional context** additional write configs include ``` (hoodie.compact.inline -> false) (hoodie.compact.schedule.inline -> false) (hoodie.datasource.compaction.async.enable -> false) (hoodie.metadata.enable -> false) (hoodie.write.concurrency.mode,optimistic_concurrency_control) (hoodie.write.lock.dynamodb.partition_key,key1) (hoodie.write.lock.dynamodb.region,us-east-1) (hoodie.write.lock.dynamodb.table,datalake-locks) (hoodie.write.lock.provider,org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
