waitingF opened a new pull request, #8313: URL: https://github.com/apache/hudi/pull/8313
### Change Logs For the kafka source, when pulling data from kafka, the default parallelism is the number of kafka partitions. There are cases: 1. Pulling large amount of data from kafka (eg. maxEvents=100000000), but the # of kafka partition is not enough, the procedure of the pulling will cost too much of time 2. There is huge data skew between kafka partitions, the procedure of the pulling will be blocked by the slowest partition to solve those cases, I add a parameter `hoodie.deltastreamer.kafka.per.batch.maxEvents` to control the maxEvents in one kafka batch, default Long.MAX_VALUE means not trun this feature on. ### Impact _Default should be no impact._ ### Risk level (write none, low medium or high below) _none._ ### Documentation Update - `hoodie.deltastreamer.kafka.per.batch.maxEvents` the config controls the max events when pulling data from kafka source ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
