Hi Till, thanks for drafting the FLIP, it looks really good. I did a quick pass over the PR and it seems to be heading in a right direction.
It might be required to introduce a graceful shutdown of the TaskExecutor > in order to support proper cleanup of resources. > This is actively being worked on by Niklas in FLINK-25277 [1]. In the PR, I've seen that you're also replacing directories for storing the local state with the working directory. Should this be a concern? Is for example rocksdb able to leverage multiple mount paths for spreading the load? I'd also be in favor of introducing a proper (evolving) serialization format right away instead of the Java serialization, but no hard feelings if we don't. [1] https://issues.apache.org/jira/browse/FLINK-25277 Best, D. On Wed, Dec 29, 2021 at 4:58 PM Till Rohrmann <trohrm...@apache.org> wrote: > I've created draft PR for the desired changes [1]. It might be easier to > take a look at than the branch. > > [1] https://github.com/apache/flink/pull/18237 > > Cheers, > Till > > On Tue, Dec 28, 2021 at 3:22 PM Till Rohrmann <trohrm...@apache.org> > wrote: > > > Hi everyone, > > > > I would like to start a discussion about using the working directory to > > persist local state for faster recovery (FLIP-201) [1]. Persisting the > > local state will be beneficial if a crashed process is restarted with the > > same working directory. In this case, Flink does not have to download the > > state artifacts again and can recover locally. > > > > A POC can be found here [2]. > > > > Looking forward to your feedback. > > > > [1] https://cwiki.apache.org/confluence/x/wJuqCw > > [2] https://github.com/tillrohrmann/flink/tree/FLIP-201 > > > > Cheers, > > Till > > >