Hi, 1. Currently, much work in FLINK-4256 is about failover improvements in the bouded dataset scenario. 2. For the streaming scenario, a new shuffle plugin + proper failover strategy could avoid the "stop-the-word" recovery. 3. We have already done many works about the new shuffle in the old Flink shuffle architectures because many of our customers have the concern. We have a plan to move the work to the new Flink pluggable shuffle architecture.
Best, Guowei Thomas Weise <t...@apache.org> 于2019年7月26日周五 上午8:54写道: > Hi, > > We are using Flink for streaming and find the "stop-the-world" recovery > behavior of Flink prohibitive for use cases that prioritize availability. > Partial recovery as outlined in FLIP-1 would probably alleviate these > concerns. > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-1+%3A+Fine+Grained+Recovery+from+Task+Failures > > Looking at the subtasks in > https://issues.apache.org/jira/browse/FLINK-4256 it > appears that much of the work was already done but not much recent > progress? What is missing (for streaming)? How close is version 2 (recovery > from limited intermediate results)? > > Thanks! > Thomas >