pltbkd opened a new pull request #16318:
URL: https://github.com/apache/flink/pull/16318
## What is the purpose of the change
This PR clear the pending requests in ExecutionSlotAllocator when a job is
CANCELING or FAILING to avoid StackOverflowException when a large scale job is
canceling or failing.
The pending requests in ExecutionSlotAllocator are not cleared when a job
transitions to CANCELING or FAILING, while all vertices will be canceled and
assigned slot will be returned. The returned slot is possible to be used to
fulfill the pending request of a CANCELED vertex and the assignment will fail
immediately and the slot will be returned and used to fulfilled another vertex
in a recursive way. StackOverflow can happen in this way when there are many
vertices, and fatal error can happen and lead to JM will crash.
We also tried to improve the call stack of slot assignment, but found it too
complex to maintain the task or slot status if we make the step notifying new
slots or assigning new resource asynchronously, so we give up the plan and only
clear all pending requests to avoid entering the recursive loop.
## Brief change log
- Added SchedulerBase#cancelAllPendingSlotRequestsInternal(), which is
called when the execution version is updated due to canceling or failing
- Added DefaultScheduler#cancelAllPendingSlotRequestsInternal() as an
implementation
The changes are not applicable for AdaptiveScheduler, which means the issue
may still occur when using AdaptiveScheduler.
## Verifying this change
This change added tests and can be verified as follows:
- Added test that validates the pending requests are all canceled while
canceling or failing
## Does this pull request potentially affect one of the following parts:
- Dependencies (does it add or upgrade a dependency): no
- The public API, i.e., is any changed class annotated with
`@Public(Evolving)`: no
- The serializers: no
- The runtime per-record code paths (performance sensitive): no
- Anything that affects deployment or recovery: JobManager (and its
components), Checkpointing, Kubernetes/Yarn/Mesos, ZooKeeper: yes
- The S3 file system connector: no
## Documentation
- Does this pull request introduce a new feature? no
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]