[ https://issues.apache.org/jira/browse/FLINK-4256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16909980#comment-16909980 ]
Stephan Ewen commented on FLINK-4256: ------------------------------------- This is in fact working for streaming as well, not only for batch. It works for both on the granularity of "pipelined regions". However, with blocking "batch" shuffles, a batch job decomposes into many small pipelined regions, which can be individually recovered. Streaming programs only decompose into multiple pipelined regions when they do not have an all-to-all shuffle ({{keyBy()}} or {{rebalance()}}). Anything beyond that, like more fine grained recovery of streaming jobs is not in the scope here, because it would need a mechanism different from Flink's current checkpointing mechanism. > Fine-grained recovery > --------------------- > > Key: FLINK-4256 > URL: https://issues.apache.org/jira/browse/FLINK-4256 > Project: Flink > Issue Type: Improvement > Components: Runtime / Coordination > Affects Versions: 1.1.0 > Reporter: Stephan Ewen > Assignee: Stephan Ewen > Priority: Major > Fix For: 1.9.0 > > > When a task fails during execution, Flink currently resets the entire > execution graph and triggers complete re-execution from the last completed > checkpoint. This is more expensive than just re-executing the failed tasks. > In many cases, more fine-grained recovery is possible. > The full description and design is in the corresponding FLIP. > https://cwiki.apache.org/confluence/display/FLINK/FLIP-1+%3A+Fine+Grained+Recovery+from+Task+Failures > The detail desgin for version1 is > https://docs.google.com/document/d/1_PqPLA1TJgjlqz8fqnVE3YSisYBDdFsrRX_URgRSj74/edit# -- This message was sent by Atlassian JIRA (v7.6.14#76016)