beyond1920 opened a new issue, #9591: URL: https://github.com/apache/hudi/issues/9591
I found the resources could not be released in time in Spark compaction job. In the following picture, only a few task are running, but all resources are not released even the tasks on them are all finished.  The root cause of this problem is write status RDD of compact is persist here. <img width="1026" alt="image" src="https://github.com/apache/hudi/assets/1525333/ad2e28ed-53c4-4f2d-9ba3-fc957ec067a6"> I checked the code path of compaction. The write status RDD of compact only used once, and it would not be reused or triggered again(It is different with normal insert/upsert workflow, the write status of insert/upsert workflow would be used multiple times). I set `hoodie.write.status.storage.level=NONE` in compaction job to avoid persist write status RDD. After disable persist the cost resource is only 30% of the original job which use default value (MEMORY_AND_DISK_SER). The befit is huge. Could we remove the persist of write status RDD in compaction workflow, since it could decrease resource costs and the RDD is never used again. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
