[jira] [Commented] (FLINK-19693) Scheduler Change for Approximate Local Recovery to Restart Downstream of a Failed Task

Yuan Mei (Jira) Mon, 09 Nov 2020 03:02:19 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-19693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17228517#comment-17228517
 ]


Yuan Mei commented on FLINK-19693:
----------------------------------

Hey [[email protected]], Thanks so much for the write-up and question! Sorry 
for the late response, was quite occupied before the code freeze :(

 

These are really great points, and I have similar feelings/considerations when 
I introduced the new ResultPartitionType PIPELINED_APPROXIMATE and the 
corresponding `reconnectable` attribute. Some thoughts here:

 

Each shuffle mode may require subtly different
 # scheduling strategy
 # failover strategy
 # lifecycle management
 # runtime implementations

Today, it seems quite difficult to extend some of the above (if not all) to 
link 1-2-3-4 as a whole thing, and this is one of the most valuable things I've 
learned from implementing approximate local recovery.

So, my question is: [~trohrmann] & [~pnowojski] & [~sewen]

Do we have plans to expose more interfaces to ease this? Here are some immature 
thoughts, which would probably also be useful if we want to support channel 
data stored in DSTL later?
 # User-defined/Configurable Result Partition Type with configurable attributes
 # lifecycle management of different Result Partition Type that can be 
registered to JobMaster
 # User-defined scheduling strategy based on Result Partition Type
 # User-defined scheduling strategy based on Result Partition Type

 

> Scheduler Change for Approximate Local Recovery to Restart Downstream of a 
> Failed Task
> --------------------------------------------------------------------------------------
>
>                 Key: FLINK-19693
>                 URL: https://issues.apache.org/jira/browse/FLINK-19693
>             Project: Flink
>          Issue Type: Sub-task
>          Components: Runtime / Coordination
>            Reporter: Yuan Mei
>            Assignee: Yuan Mei
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.12.0
>
>
> Enables downstream failover for approximate local recovery.
> That says if a task fails, all its downstream tasks restart, including 
> itself. This is achieved by reusing the existing 
> {{RestartPipelinedRegionFailoverStrategy}} --- treat each individual task 
> connected by ResultPartition.Pipelined_Approximate as a separate region.
>  
> It introduces an attribute "reconnectable" in ResultPartitionType to indicate 
> whether the partition is reconnectable. Notice that this is only a temporary 
> solution for now. It will be removed after:
>  # Approximate local recovery has its won failover strategy to restart the 
> failed set of tasks instead of restarting downstream of failed tasks 
> depending on {[@link|https://github.com/code] 
> RestartPipelinedRegionFailoverStrategy}
>  # FLINK-19895: Unify the life cycle of ResultPartitionType Pipelined Family. 
> There is also a good discussion on this in FLINK-19632.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-19693) Scheduler Change for Approximate Local Recovery to Restart Downstream of a Failed Task

Reply via email to