Yes let's move on, already cast my vote in the voting thread. Thanks all for the patience answering / addressing my belated questions!
Best Regards, Yu On Sun, 27 Sep 2020 at 20:00, Stephan Ewen <se...@apache.org> wrote: > Good to see this FLIP moving. > > From what I understand, the remaining questions are mainly about how to > express the roles of the CheckpointStorage and specifically the behavior of > JMCheckpointStorage and FsCheckpointStorage in the docs. > This sounds like details we should discuss over the concrete text proposals > in the PR. > > On Sun, Sep 27, 2020 at 5:38 AM Yu Li <car...@gmail.com> wrote: > > > Thanks Seth, the updated FLIP overall LGTM, and I've left some inline > > comments in the doc. > > > > Best Regards, > > Yu > > > > > > On Fri, 25 Sep 2020 at 20:58, Seth Wiesman <sjwies...@gmail.com> wrote: > > > > > Done > > > > > > Seth > > > > > > On Fri, Sep 25, 2020 at 2:47 AM Yu Li <car...@gmail.com> wrote: > > > > > > > *bq. I think it might help to highlight specific stumbling blocks > users > > > > have today and why I believe this change addresses those issues.* > > > > Thanks for adding more details, I believe adding these blocks to the > > FLIP > > > > doc could make the motivation more vivid and convincing. > > > > > > > > *bq. To be concrete I think the JavaDoc for setCheckpointStorage > would > > be > > > > something like...* > > > > I could see this definition extracts the existing description from > the > > > > current `StateBackend` interface, it's a valid option, and let me > quote > > > it > > > > again: > > > > - CheckpointStorage defines how checkpoint snapshots are persisted > for > > > > fault tolerance. Various implementations store their checkpoints in > > > > different fashions and have different requirements and availability > > > > guarantees. > > > > - JobManagerCheckpointStorage stores checkpoints in the memory of the > > > > JobManager. It is lightweight and without additional dependencies but > > is > > > > not highly available. > > > > - FileSystemCheckpointStorage stores checkpoints in a file system. > For > > > > systems like HDFS, NFS Drives, S3, and GCS, this storage policy > > supports > > > > large state size, in the magnitude of many terabytes while providing > a > > > > highly available foundation for stateful applications. This > checkpoint > > > > storage policy is recommended for most production deployments. > > > > > > > > Sticking to this definition, I think we should have the below > migration > > > > plans for existing backends: > > > > - `MemoryStateBackend(null, String savepointPath)` to > > > > `HashMapStateBackend() + JobManagerCheckpointStorage()` > > > > - `MemoryStateBackend(<non-null-checkpoint-path>, String > > savepointPath)` > > > to > > > > `HashMapStateBackend() + FileSystemCheckpointStorage()` > > > > in addition of the existing: > > > > - `MemoryStateBackend()` to `HashMapStateBackend() + > > > > JobManagerCheckpointStorage()` > > > > and could be summarized as: > > > > - MemoryStateBackend with checkpoint path: `HashMapStateBackend() + > > > > FileSystemCheckpointStorage()` > > > > - MemoryStateBackend w/o checkpoint path: `HashMapStateBackend() + > > > > JobManagerCheckpointStorage()` > > > > > > > > And I believe adding the above highlighted blocks to the FLIP doc > (the > > > "New > > > > StateBackend User API" and "Migration Plan" sections, separately) > could > > > > make it more complete. > > > > > > > > PS. Please note that although the current javadoc of `StateBackend` > > > states > > > > "MemoryStateBackend is not highly available", it actually supports > > > > persisting the checkpoint data to DFS when checkpoint path is given, > so > > > the > > > > mapping between old and new APIs are not that straight-forward and > need > > > > some clear clarifications, from my point of view. > > > > > > > > Best Regards, > > > > Yu > > > > > > > > > > > > On Fri, 25 Sep 2020 at 08:33, Seth Wiesman <sjwies...@gmail.com> > > wrote: > > > > > > > > > Hi Yu, > > > > > > > > > > bq* I thought the FLIP aims at resolving some *existing* confusion, > > > i.e. > > > > > the durability mystery to users. > > > > > > > > > > I think it might help to highlight specific stumbling blocks users > > have > > > > > today and why I believe this change addresses those issues. Some > > > frequent > > > > > things I've heard over the past several years include: > > > > > > > > > > 1) "We use RocksDB because we don't need fault tolerance." > > > > > 2) "We don't use RocksDB because we don't want to manage an > external > > > > > database." > > > > > 3) Believing RocksDB is reading and writing directly with S3 or > HDFS > > > (vs. > > > > > local disk) > > > > > 4) Believing FsStateBackend spills to disk or has anything to do > with > > > the > > > > > local filesystem > > > > > 5) Pointing RocksDB at network-attached storage, believing that the > > > state > > > > > backend needs to be fault-tolerant > > > > > > > > > > This question from the ml is very representative of where users are > > > > > struggling[1]. Many of these questions were not from new users but > > from > > > > > organizations that were in production! Just yesterday I was on the > > > phone > > > > > with a company that didn't realize they were in production without > > > > > checkpointing; honestly, you would be shocked how often this > happens. > > > The > > > > > current state backend abstraction is to complex for many of our > > users. > > > > What > > > > > all these questions have in common is misunderstanding the > > relationship > > > > > between how data is stored locally on TMs vs how checkpoints make > > that > > > > > state durable. > > > > > > > > > > The FLIP aims actively help users by allowing them to reason about > > > state > > > > > backends separately from checkpoint durability. In the future, a > > state > > > > > backend only defines where and how state is stored locally on the > TM > > > > while > > > > > checkpoint storage defines where and how checkpoints are stored for > > > > > recovery. To be concrete I think the JavaDoc for > setCheckpointStorage > > > > would > > > > > be something like: > > > > > > > > > > ```java > > > > > /** > > > > > * CheckpointStorage defines how checkpoint snapshots are persisted > > for > > > > > fault tolerance > > > > > *. Various implementations store their checkpoints in different > > > fashions > > > > > and have different requirements and > > > > > * availability guarantees. > > > > > * > > > > > *<p>For example, JobManagerCheckpointStorage stores checkpoints in > > the > > > > > memory of the JobManager. > > > > > * It is lightweight and without additional dependencies but is not > > > > highly > > > > > available > > > > > * and only supports small state sizes. This checkpoint storage > > policy > > > is > > > > > convenient for > > > > > * local testing and development. > > > > > * > > > > > *<p>FileSystemCheckpointStorage stores checkpoints in a > filesystem. > > > For > > > > > systems like > > > > > * HDFS, NFS Drives, S3, and GCS, this storage policy supports > large > > > > state > > > > > size, > > > > > * in the magnitude of many terabytes while providing a highly > > > available > > > > > foundation > > > > > * for stateful applications. This checkpoint storage policy is > > > > recommended > > > > > for most > > > > > * production deployments. > > > > > */ > > > > > void setCheckpointStorage(CheckpointStorage storage) {} > > > > > ``` > > > > > > > > > > Seth > > > > > > > > > > [1] > > > > > > > > > > > > > > > > > > > > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/State-Storage-Questions-td37919.html > > > > > [2] Also naming, but we're aligned here > > > > > > > > > > On Thu, Sep 24, 2020 at 10:24 AM Yu Li <car...@gmail.com> wrote: > > > > > > > > > > > And to make it clear, I'm +1 on the idea of decoupling state > > backends > > > > > with > > > > > > checkpointing. I don't have any question about making it clear > that > > > > > > heap/RocksDB is where we serve the routine state read/write and > > where > > > > to > > > > > > put the checkpoint data is another story. My only concern lies in > > the > > > > > newly > > > > > > introduced setCheckpointStorage API and how we define its > > semantics, > > > > and > > > > > > not sure whether it's due to my ignorance. > > > > > > > > > > > > Best Regards, > > > > > > Yu > > > > > > > > > > > > > > > > > > On Thu, 24 Sep 2020 at 23:11, Yu Li <car...@gmail.com> wrote: > > > > > > > > > > > > > *bq. What new confusion would be introduced here?* > > > > > > > No *new* confusion introduced, but as mentioned at the very > > > beginning > > > > > of > > > > > > > the motivation ("Apache Flink's durability story is a mystery > to > > > many > > > > > > > users"), I thought the FLIP aims at resolving some *existing* > > > > > > > confusions, i.e. the durability mystery to users. > > > > > > > > > > > > > > For me, I'm not 100% clear about how to write the javadoc of > the > > > > > > > setCheckpointStorage API. Would it be like "specify where the > > > > > checkpoint > > > > > > > data is stored"? If so, do we need to explain the fact that > when > > a > > > > > > > checkpoint path is given, JM will also persist the checkpoint > > data > > > to > > > > > > DFS? > > > > > > > It's true that such confusion also exists today, but would the > > > > > > introduction > > > > > > > of the new API expose it further? > > > > > > > > > > > > > > IMHO we need to document the newly introduced API / classes and > > > their > > > > > > > semantics clearly in the FLIP to make sure everyone is on the > > same > > > > > page, > > > > > > > but if we feel such work / discussions are all details and only > > > need > > > > to > > > > > > > happen at the documenting and release note phase, it's also > fine > > to > > > > me. > > > > > > > > > > > > > > And if I'm the only one who has such questions / concerns on > the > > > new > > > > > > > `setCheckpointStorage` API and most of others feel its semantic > > is > > > > > sound > > > > > > > and clear, then please just ignore me and move on. > > > > > > > > > > > > > > Thanks. > > > > > > > > > > > > > > Best Regards, > > > > > > > Yu > > > > > > > > > > > > > > > > > > > > > On Wed, 23 Sep 2020 at 17:08, Stephan Ewen <se...@apache.org> > > > wrote: > > > > > > > > > > > > > >> I am confused now with the concerns here. This is very much > from > > > the > > > > > > user > > > > > > >> perspective (which is partially also the developer perspective > > > which > > > > > is > > > > > > >> the > > > > > > >> sign of an intuitive abstraction). > > > > > > >> > > > > > > >> Of course, there will be docs describing what > > JMCheckpointStorage > > > > and > > > > > > >> FsCheckpointStorage are. > > > > > > >> And having release notes that describe that > > > > > > >> RocksDBStateBackend("s3://...") > > > > > > >> now corresponds to a combination of "RocksDBBackend" and > > > > > > >> "FsCheckpointStorage" is also straightforward. > > > > > > >> > > > > > > >> We said to keep the old RocksDBStateBackend class and let it > > > > implement > > > > > > >> both > > > > > > >> interfaces such that the old code still works exactly as > before. > > > > > > >> > > > > > > >> What new confusion would be introduced here? > > > > > > >> Understanding the difference between JMCheckpointStorage and > > > > > > >> FsCheckpointStorage was always necessary when one needed to > > > > understand > > > > > > the > > > > > > >> difference between MemoryStateBackend and FsStateBackend. It > > > should > > > > be > > > > > > >> easier to define this after this change, because it is the > only > > > > thing > > > > > > that > > > > > > >> we describe when explaining what checkpoint storage to use > > (rather > > > > > than > > > > > > >> also having the choice of index structure coupled to that). > > > > > > >> > > > > > > >> > > > > > > >> On Wed, Sep 23, 2020 at 10:39 AM Aljoscha Krettek < > > > > > aljos...@apache.org> > > > > > > >> wrote: > > > > > > >> > > > > > > >> > On 23.09.20 04:40, Yu Li wrote: > > > > > > >> > > To be specific, with the old API users don't need to set > > > > > checkpoint > > > > > > >> > > storage, instead they only need to pass the checkpoint > path > > > w/o > > > > > > caring > > > > > > >> > > about the storage. The new APIs are forcing users to set > the > > > > > storage > > > > > > >> so > > > > > > >> > > they have to know the difference between different > storages. > > > > It's > > > > > > not > > > > > > >> an > > > > > > >> > > implementation change, but an API change that users have > to > > > > > > understand > > > > > > >> > and > > > > > > >> > > follow-up. > > > > > > >> > > > > > > > >> > I think the main point of the FLIP is to make it more > obvious > > to > > > > > users > > > > > > >> > what is happening. > > > > > > >> > > > > > > > >> > With current Flink, they would do a `setStateBackend(new > > > > > > >> > FsStateBackend(<path>))`. What the user is actually "saying" > > > with > > > > > this > > > > > > >> > is: I want to keep state on heap but store checkpoints in > DFS. > > > > They > > > > > > are > > > > > > >> > not actually changing the "State Backend", the thing that > > keeps > > > > > state > > > > > > in > > > > > > >> > operators, but only where state is checkpointed. The thing > > that > > > is > > > > > > used > > > > > > >> > for local state storage in operators is still the "Heap > > > Backend". > > > > > > >> > > > > > > > >> > With the proposed FLIP, a user would do a > > > > `setCheckpointStorage(new > > > > > > >> > FsStorage(<path>))`. Which makes it obvious that they're > > > changing > > > > > > where > > > > > > >> > checkpoints are stored but not the actual "State Backend", > > which > > > > is > > > > > > >> > still "Heap Backend" (the default). > > > > > > >> > > > > > > > >> > I do understand Yu's point, though, that this will be > > confusing > > > > for > > > > > > >> > current Flink users. They are used to setting a "State > > Backend" > > > > > > if/when > > > > > > >> > they want to change the storage location. To fit the new > model > > > > they > > > > > > >> > would have to change the call from `setStateBackend()` to > > > > > > >> > `setCheckpointStorage()`. > > > > > > >> > > > > > > > >> > I think we need to life with this short-term confusion > because > > > in > > > > > the > > > > > > >> > long run the proposed split between checkpoint location and > > > state > > > > > > >> > backend makes sense and will be more straightforward for > users > > > to > > > > > > >> > understand. > > > > > > >> > > > > > > > >> > Best, > > > > > > >> > Aljoscha > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > >