[GitHub] [hudi] waitingF opened a new pull request, #8376: [HUDI-6019] support split kafka source by count

via GitHub Tue, 04 Apr 2023 00:35:06 -0700


waitingF opened a new pull request, #8376:
URL: https://github.com/apache/hudi/pull/8376


   ### Change Logs
   
   For the kafka source, when pulling data from kafka, the default parallelism 
is the number of kafka partitions.
   There are cases: 
   1. Pulling large amount of data from kafka (eg. maxEvents=100000000), but 
the # of kafka partition is not enough, the procedure of the pulling will cost 
too much of time
   2. There is huge data skew between kafka partitions, the procedure of the 
pulling will be blocked by the slowest partition
   
   to solve those cases, I add a parameter 
`hoodie.deltastreamer.source.kafka.per.partition.maxEvents` to control the 
maxEvents in one kafka partition input, default Long.MAX_VALUE means not trun 
this feature on.
   
   Here is the comparison of this feature:
   max executor core is 128.
   before: 
   <img width="1370" alt="image" 
src="https://user-images.githubusercontent.com/19326824/228461120-033f0e98-b170-46f4-8380-c0da33f4ad58.png";>
   
   after:
   <img width="1319" alt="image" 
src="https://user-images.githubusercontent.com/19326824/228461364-2d6932cd-66bf-4bd8-806c-a669520af255.png";>
   
   performance improvement  is about 3 times (can improve more if given more 
cores).
   
   ### Impact
   
   _Default should be no impact._
   
   ### Risk level (write none, low medium or high below)
   
   _none._
   
   ### Documentation Update
   
   - `hoodie.deltastreamer.source.kafka.per.partition.maxEvents` the config 
controls the max events per kafka partition input when pulling data from kafka 
source
   
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] waitingF opened a new pull request, #8376: [HUDI-6019] support split kafka source by count

Reply via email to