[GitHub] [hudi] beyond1920 opened a new issue, #9591: [SUPPORT] persist write status RDD in spark compaction job caused the resources could not be released in time

via GitHub Thu, 31 Aug 2023 08:05:09 -0700


beyond1920 opened a new issue, #9591:
URL: https://github.com/apache/hudi/issues/9591


   I found the resources could not be released in time in Spark compaction job.
   In the following picture, only a few task are running, but all resources are 
not released even the tasks on them are all finished.
   
![image](https://github.com/apache/hudi/assets/1525333/fa339676-f7bf-4acb-9cc7-6661221ca93f)
   The root cause of this problem is write status RDD of compact is persist 
here.
   <img width="1026" alt="image" 
src="https://github.com/apache/hudi/assets/1525333/ad2e28ed-53c4-4f2d-9ba3-fc957ec067a6";>
   I checked the code path of compaction. The write status RDD of compact only 
used once, and it would not be reused or triggered again(It is different with 
normal insert/upsert workflow, the write status of  insert/upsert workflow 
would be used multiple times). 
   I set `hoodie.write.status.storage.level=NONE` in compaction job to avoid 
persist write status RDD. After disable persist the cost resource is  only 30% 
of the original job which use default value (MEMORY_AND_DISK_SER). The befit is 
huge.
   Could we remove the persist of  write status RDD in compaction workflow, 
since it could decrease resource costs and the RDD is never used again.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] beyond1920 opened a new issue, #9591: [SUPPORT] persist write status RDD in spark compaction job caused the resources could not be released in time

Reply via email to