Hi, We are using Flink for streaming and find the "stop-the-world" recovery behavior of Flink prohibitive for use cases that prioritize availability. Partial recovery as outlined in FLIP-1 would probably alleviate these concerns.
https://cwiki.apache.org/confluence/display/FLINK/FLIP-1+%3A+Fine+Grained+Recovery+from+Task+Failures Looking at the subtasks in https://issues.apache.org/jira/browse/FLINK-4256 it appears that much of the work was already done but not much recent progress? What is missing (for streaming)? How close is version 2 (recovery from limited intermediate results)? Thanks! Thomas