[jira] [Commented] (FLINK-18113) Single Task Failure Recovery API Abstraction

Yuan Mei (Jira) Mon, 07 Dec 2020 07:47:06 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-18113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17245294#comment-17245294
 ]


Yuan Mei commented on FLINK-18113:
----------------------------------

Some thoughts here:

Each shuffle mode may require subtly different
 # scheduling strategy
 # failover strategy
 # lifecycle management
 # runtime implementations

To some extent, the above four items are correlated in some way or the other.

For example, PIPELINED_APPROXIMATE is *pipelined* in the sense that downstream 
tasks can start consuming data before upstream tasks finish. However, it is 
*blocking* in the sense that the result partitions are re-connectable (but not 
re-consumable, so strictly speaking, it is not blocking as well). 
PIPELINED_APPROXIMATE's runtime implementation is a bit different from 
pipelined in the sense that it has to handle partial records as handled in 
FLINK-19547. It needs a dedicated failover strategy to only restart failed 
tasks and the existing scheduling strategy can be reused or not (FLINK-20048). 
These are what I mean by *"correlated in some way or the other"*.

Hence, I do not think those four items above can be thought of as completely 
independent of each other. However today, it seems quite difficult to extend 
some of the above (if not all) to link 1-2-3-4 as a whole thing, and this is 
one of the most valuable things I've learned from implementing approximate 
local recovery.

So, my question is: 

Do we have plans to expose more interfaces to ease extension? Here are some 
immature thoughts, which would probably also be useful if we want to support 
channel data stored in DSTL later?
 # User-defined/Configurable Result Partition Type with configurable attributes
 # lifecycle management of different Result Partition Type that can be 
registered to JobMaster
 # User-defined scheduling strategy based on Result Partition Type
 # User-defined scheduling strategy based on Result Partition Type

> Single Task Failure Recovery API Abstraction
> --------------------------------------------
>
>                 Key: FLINK-18113
>                 URL: https://issues.apache.org/jira/browse/FLINK-18113
>             Project: Flink
>          Issue Type: New Feature
>          Components: API / Core, Runtime / Checkpointing, Runtime / 
> Coordination, Runtime / Network
>            Reporter: Yuan Mei
>            Priority: Major
>
> Overall, I would like to keep a track of discussion on API changes needed 
> based on FLINK-18112
> A similar discussion can be found in FLINK-20038
> And the discussion is around:
>  # scheduling strategy
>  # failover strategy
>  # lifecycle management
>  # runtime implementations



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-18113) Single Task Failure Recovery API Abstraction

Reply via email to