Thesharing opened a new pull request #15312:
URL: https://github.com/apache/flink/pull/15312


   ## What is the purpose of the change
   
   *This pull request introduces the optimization of calculating tasks to 
restart in RestartPipelinedRegionFailoverStrategy. The related method is 
`RestartPipelinedRegionFailoverStrategy#getRegionsToRestart`.*
   
   *RestartPipelinedRegionFailoverStrategy is used to calculate the tasks to 
restart when a task failure occurs. It contains two parts: firstly calculate 
the regions to restart; then add all the tasks in these regions to the 
restarting queue.*
   
   *The bottleneck is mainly in the first part. This part traverses all the 
upstream and downstream regions of the failed region to determine whether they 
should be restarted or not.*
   
   *For the current failed region, if its consumed result partition is not 
available, the owner, i.e., the upstream region should restart. Also, since the 
failed region needs to restart, its result partition won't be available, all 
the downstream regions need to restart, too.*
   
   *Based on FLINK-21328 and FLINK-21330, the consumed result partitions of a 
pipelined region and the consumer vertices of a DefaultResultPartition are 
already grouped. Here we can use a HashSet to record the visited groups. For 
vertices connected with all-to-all edges, they will only need to traverse the 
group once. This decreases the time complexity from O(N^2) to O(N).*
   
   *For more details, please check FLINK-21331.*
   
   ## Brief change log
   
     - *Optimize RestartPipelinedRegionFailoverStrategy#getRegionsToRestart*
   
   ## Verifying this change
   
   *Since this optimization does not change the original logic of calculating 
tasks to restart in RestartPipelinedRegionFailoverStrategy, we believe that 
this change is already covered by existing test: 
RestartPipelinedRegionFailoverStrategyTest.*
   
   ## Does this pull request potentially affect one of the following parts:
   
     - Dependencies (does it add or upgrade a dependency): (yes / **no**)
     - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: (yes / **no**)
     - The serializers: (yes / **no** / don't know)
     - The runtime per-record code paths (performance sensitive): (yes / **no** 
/ don't know)
     - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Kubernetes/Yarn/Mesos, ZooKeeper: (**yes** / no / 
don't know)
     - The S3 file system connector: (yes / **no** / don't know)
   
   ## Documentation
   
     - Does this pull request introduce a new feature? (yes / **no**)
     - If yes, how is the feature documented? (**not applicable** / docs / 
JavaDocs / not documented)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to