rkhachatryan opened a new pull request #18989:
URL: https://github.com/apache/flink/pull/18989
## What is the purpose of the change
Currently, all tasks perform materialization rougly at the same time.
This creates a spike in state deletion requests from JM to DFS.
That can delay new checkpoints because of how JM IO tasks are scheduled:
- every deletion is a separate task in the IO thread pool queue
- the queue is FIFO (unbounded)
- the default number of threads in the pool equals number of cores
- so the new checkpoint has to wait for an available thread to initialize
its location
Note, that while the checkpoint is waiting, it's already "triggered" on JM,
but not broadcasted to any TM.
This PR introduces a random delay in materialization.
## Does this pull request potentially affect one of the following parts:
- Dependencies (does it add or upgrade a dependency): no
- The public API, i.e., is any changed class annotated with
`@Public(Evolving)`: no
- The serializers: no
- The runtime per-record code paths (performance sensitive): no
- Anything that affects deployment or recovery: JobManager (and its
components), Checkpointing, Kubernetes/Yarn/Mesos, ZooKeeper: no
- The S3 file system connector: no
## Documentation
- Does this pull request introduce a new feature? no
- If yes, how is the feature documented? no
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]