[PR] [HUDI-7906] improve the parallelism deduce in rdd write [hudi]

via GitHub Tue, 18 Jun 2024 08:55:42 -0700


KnightChess opened a new pull request, #11470:
URL: https://github.com/apache/hudi/pull/11470


   ### Change Logs
   
   as https://github.com/apache/hudi/issues/11274 and 
https://github.com/apache/hudi/pull/11463 describe, there has two case question.
   
   - if the rdd is input rdd without shuffle, the partitiion number is too 
bigger or too small
   - user need can not control it easy
     - in some case user can set `spark.default.parallelism` change it.
     - in some case user can not change because hard-code
     - and in spark, the better way is use `spark.default.parallelism` or 
`spark.sql.shuffle.partitions` can control it, other is advanced in hudi.
   
   ### Impact
   
   like dedup where use new deduce logical, user can use 
`spark.sql.shuffle.partitions` or `spark.default.parallelism` control the 
parallelism.
   For special scenes, also can use advanced params.
   
   ### Risk level (write none, low medium or high below)
   
   low
   
   ### Documentation Update
   
   None
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] [HUDI-7906] improve the parallelism deduce in rdd write [hudi]

Reply via email to