Thanks for summarizing the current state of Flip-1 and outlining the way to move forward with it Chesnay.
I think we should implement the first version of the backtracking logic using the DataConsumptionException (FLINK-6227) to signal if an intermediate result partition has been lost. Moreover, I think it would be best to base the new implementation on the refined FailoverStrategy interface proposed by the scheduler refactorings [1]. We could have an adaptor to make work with the existing code for testing purposes and until the scheduler interfaces have been introduced. Apart from that, +1 for completing Flink's first improvement proposal :-) [1] https://docs.google.com/document/d/1fstkML72YBO1tGD_dmG2rwvd9bklhRVauh4FSsDDwXU/edit?usp=sharing Cheers, Till On Sun, Apr 14, 2019 at 8:20 PM Chesnay Schepler <ches...@apache.org> wrote: > Hello everyone, > > Till, Zhu Zhu and myself have prepared a Design Document > < > https://docs.google.com/document/d/1YHOpMLdC-dtgjcM-EDn6v-oXgsEQKXSoMjqRcYVbJA8> > > for introducing backtracking for failover regions. This is an > optimization of the failure handling logic for jobs with blocking result > partitions (which primarily exist in batch jobs), where only part of the > job has to be restarted. > This has a continuation of the FLIP-1 > < > https://cwiki.apache.org/confluence/display/FLINK/FLIP-1+%3A+Fine+Grained+Recovery+from+Task+Failures> > > efforts to introduce fine-grained recovery from task failures. > The associated JIRA can be found here > <https://issues.apache.org/jira/browse/FLINK-12068>. > > Any feedback is highly appreciated. > > Regards, > Chesnay >