jinxing64 opened a new pull request #16118:
URL: https://github.com/apache/flink/pull/16118
## What is the purpose of the change
When TM disconnects(`JobMaster#disconnectTaskManager`),
`JobMasterPartitionTrackerImpl` stops tracking all the partitions the TM ever
produced, i.e. the lifecycle of shuffle data is bound to computing resource
(TM). It works fine for internal shuffle service, but doesn't for remote
shuffle service. Note that if shuffle data is accommodated on remote, the
lifecycle of a completed partition is capable to be decoupled with TM, i.e. TM
is totally fine to be released when no computing task on it and further shuffle
reading requests could be directed to remote shuffle cluster.
This PR proposes to fix `JobMasterPartitionTrackerImpl` and handle above
scenario properly.
## Brief change log
- Fix `JobMasterPartitionTrackerImpl#stopTrackingPartitionsFor` -- only
internal partitions, which are accommodated on TM local, are stopped tracking,
- Fix `JobMaster#jobStatusChanged` -- current logic traverses
`registeredTaskManagers` and release or promote related partitions, which is
incorrect. With above change, external partitions, which are accommodated on
external shuffle service, will NOT be released when TM disconnects. Thus
release partitions by traversing existing `registeredTaskManagers` might cause
leak of external partitions.
- Exclude internal partitions when invoke
`internalReleasePartitionsOnShuffleMaster`
## Verifying this change
Added new tests in JobMasterPartitionTrackerImplTest
## Does this pull request potentially affect one of the following parts:
- Dependencies (does it add or upgrade a dependency): (yes / **no**)
- The public API, i.e., is any changed class annotated with
`@Public(Evolving)`: (yes / **no**)
- The serializers: (yes / **no** / don't know)
- The runtime per-record code paths (performance sensitive): (yes / **no**
/ don't know)
- Anything that affects deployment or recovery: JobManager (and its
components), Checkpointing, Kubernetes/Yarn/Mesos, ZooKeeper: (yes / **no** /
don't know)
- The S3 file system connector: (yes / **no** / don't know)
## Documentation
- Does this pull request introduce a new feature? (yes / **no**)
- If yes, how is the feature documented? (**not applicable** / docs /
JavaDocs / not documented)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]