[PR] Reduce the data amount collected on spark driver [hudi]

via GitHub Mon, 06 Nov 2023 09:39:46 -0800


linliu-code opened a new pull request, #9995:
URL: https://github.com/apache/hudi/pull/9995


   ### Change Logs
   
   When building profile, the spark driver should only care data distribution 
on (partition, instant_time, file_id), instead of (partition, instant_time, 
file_id, record_position).
   
   TESTS:
   1. Without remove the record position, the driver OOMed constantly.
   2. After remove the record position, both 500GB and 1TB query finished 
successfully.
   
   ### Impact
   
   This fix removes some stability regression for large queries.
   
   ### Risk level (write none, low medium or high below)
   
   Low.
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] Reduce the data amount collected on spark driver [hudi]

Reply via email to