curcur opened a new pull request #13880:
URL: https://github.com/apache/flink/pull/13880


   ## What is the purpose of the change
   
   This PR includes three changes:
   - Enables downstream failover for approximate local recovery. That says if a 
task fails, all its downstream tasks restart, including itself. This is 
achieved by reusing the existing `RestartPipelinedRegionFailoverStrategy` --- 
treat each individual task connected by ResultPartition.Pipelined_Approximate 
as a separate region. 
   - Expose Approximate downstream failover flag as an internal feature flag in 
CheckpointConfig: `approximateLocalRecovery`
   - ITcases downstream failover
   
   
   ## Brief change log
   
   - introduce an attribute "reconnectable" in ResultPartitionType to indicate 
whether the partition is reconnectable. Notice that this is only a temporal 
solution for now. It will be removed after:
         - 1. Approximate local recovery has its won failover strategy to 
restart the failed set of tasks instead of
              restarting downstream of failed tasks depending on {@code 
RestartPipelinedRegionFailoverStrategy}
         - 2. FLINK-19895: Unify the life cycle of ResultPartitionType 
Pipelined Family. There is also a good discussion on this in FLINK-19632.
   - PipelinedRegionComputeUtil#buildRawRegions to build each task connected by 
PIPELINED_APPROXIMATE as a region.
   - JobMasterPartitionTrackerImpl#startTrackingPartition to track the 
pipelined_approximate partition
   - Introduce an internal flag `approximateLocalRecovery` in CheckpointConfig 
to enable the feature.
   - ITCase (examples) how to use Approximate downstream failover.
   
   ## Verifying this change
   
   Unittests + ITCases
   
   ## Does this pull request potentially affect one of the following parts:
   
     - Dependencies (does it add or upgrade a dependency): (no)
     - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: (yes)
     - The serializers: (no)
     - The runtime per-record code paths (performance sensitive): (no)
     - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Kubernetes/Yarn/Mesos, ZooKeeper: (yes)
     - The S3 file system connector: (no)
   
   ## Documentation
   
     - Does this pull request introduce a new feature? (no)
        An internal feature only, not ready for public usage.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to