Hi, I was involved in some off-line discussions and some early stages of the discussions on this topic. In recent weeks Yuan has also prepared a good looking prototype and draft changes that look good to me, so I would be +1 for accepting this FLIP.
I guess as there were no objections here, I would suggest starting a vote thread for this proposal. If someone finds a problem with this FLIP, please vote -1 there or let us know here. Best, Piotrek niedz., 16 sie 2020 o 17:02 Yuan Mei <yuanmei.w...@gmail.com> napisaĆ(a): > Hi Devs, > > I would like to start a formal discussion about "FLIP-135 Approximate > Task-Local Recovery" [1] which supports approximate single task failure > recovery without restarting the entire streaming job. > > Flink is no longer a pure streaming engine as it was born and has been > extended to fit into many different scenarios over time: batch, AI, > event-driven applications, e.t.c. Approximate task-local recovery is one of > the attempts to fulfill these diversified scenarios and trade data > consistency for fast failure recovery. More specifically, if a task fails, > only the failed task restarts without affecting the rest of the job. > Approximate task-local recovery is similar to > RestartPipelinedRegionFailoverStrategy [2], with two major differences: > - Instead of restarting a connected region, approximate task-local recovery > restarts only the failed task(s) for a streaming job. > - RestartPipelinedRegionFailoverStrategy is exactly-once, while approximate > task-local recovery expects data loss and a bit of data duplication when > sources fail. > > Approximate task-local recovery is useful in scenarios where a certain > amount of data loss is tolerable, but a full pipeline restart is not > affordable. A typical use case is online training. Online training jobs are > usually complicated with all-to-all task connections, and a single task > failure with RestartPipelinedRegionFailoverStrategy may result in a > complete restart of the whole pipeline. Besides, the initialization is > time-consuming, including the procedure of loading training models and > starting Python subprocesses, etc. The initialization may take minutes to > complete on average. Hence, we introduce an approximate task-local recovery > strategy to only restart failed tasks. > > To ease the discussion, we divide the problem of approximate task-local > recovery into three steps with each step only focusing on addressing a set > of problems, sketched as follows: > 1). Sink Recovery, 2). Downstream Recovery and 3). Single Task Recovery. > > This FLIP focuses on tackling issues related to the first step "Sink > Recovery", and we would like to collect broader feedback in this dedicated > mail thread. > > Best, > > Yuan > > [1] > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-135+Approximate+Task-Local+Recovery > [2] > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-1+%3A+Fine+Grained+Recovery+from+Task+Failures >