[GitHub] [hudi] waitingF opened a new pull request, #8313: [SUPPORT] split source of kafka partition by count

via GitHub Wed, 29 Mar 2023 00:34:43 -0700


waitingF opened a new pull request, #8313:
URL: https://github.com/apache/hudi/pull/8313


   ### Change Logs
   
   For the kafka source, when pulling data from kafka, the default parallelism 
is the number of kafka partitions.
   There are cases: 
   1. Pulling large amount of data from kafka (eg. maxEvents=100000000), but 
the # of kafka partition is not enough, the procedure of the pulling will cost 
too much of time
   2. There is huge data skew between kafka partitions, the procedure of the 
pulling will be blocked by the slowest partition
   
   to solve those cases, I add a parameter 
`hoodie.deltastreamer.kafka.per.batch.maxEvents` to control the maxEvents in 
one kafka batch, default Long.MAX_VALUE means not trun this feature on.
   
   ### Impact
   
   _Default should be no impact._
   
   ### Risk level (write none, low medium or high below)
   
   _none._
   
   ### Documentation Update
   
   - `hoodie.deltastreamer.kafka.per.batch.maxEvents` the config controls the 
max events when pulling data from kafka source
   
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] waitingF opened a new pull request, #8313: [SUPPORT] split source of kafka partition by count

Reply via email to