chenshzh opened a new pull request, #7019: URL: https://github.com/apache/hudi/pull/7019
### Change Logs `CopyOnWriteInputFormat#createInputSplits` is invoked by `org.apache.flink.runtime.executiongraph.ExecutionJobVertex` in JobManager to create file input splits synchronously. It's found that in batch mode this will occupy the largest share of job submission time. So in this PR it will be optimized by creating input splits in thread pool executor asynchronously. ### Impact Speed up job submission by reducing input splits creation time comsuption. ### Risk level (write none, low medium or high below) _If medium or high, explain what verification was done to mitigate the risks._ ### Documentation Update `read.cow.create-split-async.enabled`: Whether create input splits asynchronously for CopyOnWriteInputFormat reading, default true. `read.cow.create-split-async.min-parallelism`: Min parallelism to parse the real parallelism of thread pool for CopyOnWriteInputFormat to create input splits asynchronously. We should take the real job manager processors under consideration when to increase it. `read.cow.create-split-async.max-parallelism`: Max parallelism to parse the real parallelism of thread pool for CopyOnWriteInputFormat to create input splits asynchronously. We will use the file count when it's smaller than the given value. ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
