waitingF opened a new pull request, #8376: URL: https://github.com/apache/hudi/pull/8376
### Change Logs For the kafka source, when pulling data from kafka, the default parallelism is the number of kafka partitions. There are cases: 1. Pulling large amount of data from kafka (eg. maxEvents=100000000), but the # of kafka partition is not enough, the procedure of the pulling will cost too much of time 2. There is huge data skew between kafka partitions, the procedure of the pulling will be blocked by the slowest partition to solve those cases, I add a parameter `hoodie.deltastreamer.source.kafka.per.partition.maxEvents` to control the maxEvents in one kafka partition input, default Long.MAX_VALUE means not trun this feature on. Here is the comparison of this feature: max executor core is 128. before: <img width="1370" alt="image" src="https://user-images.githubusercontent.com/19326824/228461120-033f0e98-b170-46f4-8380-c0da33f4ad58.png"> after: <img width="1319" alt="image" src="https://user-images.githubusercontent.com/19326824/228461364-2d6932cd-66bf-4bd8-806c-a669520af255.png"> performance improvement is about 3 times (can improve more if given more cores). ### Impact _Default should be no impact._ ### Risk level (write none, low medium or high below) _none._ ### Documentation Update - `hoodie.deltastreamer.source.kafka.per.partition.maxEvents` the config controls the max events per kafka partition input when pulling data from kafka source ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
