Jin Xing created FLINK-22676:
--------------------------------
Summary: The partition tracker should support remote shuffle
properly
Key: FLINK-22676
URL: https://issues.apache.org/jira/browse/FLINK-22676
Project: Flink
Issue Type: Sub-task
Components: Runtime / Network
Reporter: Jin Xing
In current Flink, data partition is bound with the ResourceID of TM in
Execution#startTrackingPartitions and partition tracker will stop tracking
corresponding partitions when a TM
disconnects(JobMaster#disconnectTaskManager), i.e. the lifecycle of shuffle
data is bound with computing resource (TM). It works fine for internal shuffle
service, but doesn't for remote shuffle service. Note that shuffle data is
accommodated on remote, the lifecycle of a completed partition is capable to be
decoupled with TM, i.e. TM is totally fine to be released when no computing
task on it and further shuffle reading requests could be directed to remote
shuffle cluster. In addition, when a TM is lost, its completed data partitions
on remote shuffle cluster could avoid reproducing.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)