chenshzh opened a new pull request, #7019:
URL: https://github.com/apache/hudi/pull/7019

   ### Change Logs
   
   `CopyOnWriteInputFormat#createInputSplits` is invoked by 
`org.apache.flink.runtime.executiongraph.ExecutionJobVertex` in JobManager to 
create file input splits synchronously. 
   
   It's found that in batch mode this will occupy the largest share of job 
submission time.
   
   So in this PR it will be optimized by creating input splits in thread pool 
executor asynchronously.
   
   ### Impact
   
   Speed up job submission by reducing input splits creation time comsuption.
   
   ### Risk level (write none, low medium or high below)
   
   _If medium or high, explain what verification was done to mitigate the 
risks._
   
   ### Documentation Update
   
   `read.cow.create-split-async.enabled`: Whether create input splits 
asynchronously for CopyOnWriteInputFormat reading, default true.
   `read.cow.create-split-async.min-parallelism`: Min parallelism to parse the 
real parallelism of thread pool for CopyOnWriteInputFormat to create input 
splits asynchronously.
   We should take the real job manager processors under consideration when to 
increase it.
   `read.cow.create-split-async.max-parallelism`: Max parallelism to parse the 
real parallelism of thread pool for CopyOnWriteInputFormat to create input 
splits asynchronously. 
   We will use the file count when it's smaller than the given value.
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to