[
https://issues.apache.org/jira/browse/FLINK-18112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17127912#comment-17127912
]
Yuan Mei commented on FLINK-18112:
----------------------------------
Hey [~klion26], thank you so much for the positive feedback! Unfortunately, we
do not have any documentation ready for public sharing yet.
One of the main purposes of this Jira is to produce such a FLIP. We will let
you know once the document is ready.
> Single Task Failure Recovery Prototype
> --------------------------------------
>
> Key: FLINK-18112
> URL: https://issues.apache.org/jira/browse/FLINK-18112
> Project: Flink
> Issue Type: New Feature
> Components: Runtime / Checkpointing, Runtime / Coordination, Runtime
> / Network
> Affects Versions: 1.12.0
> Reporter: Yuan Mei
> Assignee: Yuan Mei
> Priority: Major
> Fix For: 1.12.0
>
>
> Build a prototype of single task failure recovery to address and answer the
> following questions:
> *Step 1*: Scheduling part, restart a single node without restarting the
> upstream or downstream nodes.
> *Step 2*: Checkpointing part, as my understanding of how regional failover
> works, this part might not need modification.
> *Step 3*: Network part
> - how the recovered node able to link to the upstream ResultPartitions, and
> continue getting data
> - how the downstream node able to link to the recovered node, and continue
> getting node
> - how different netty transit mode affects the results
> - what if the failed node buffered data pool is full
> *Step 4*: Failover process verification
--
This message was sent by Atlassian Jira
(v8.3.4#803005)