Thanks Paul for bringing this discussion. Added some my thoughts, please correct me if I'm wrong since I'm not very familiar with Flink.
> Savepoint operation is not SQL, thus can’t be passed to Flink engine like a > normal statement, thus we may need a new operation type. Technically, it's an easy way to implement, since we already found a way to extend the protocol and meanwhile keep it compatible, an example is the LaunchEngine operator. But I'm thinking if we can introduce a SQL layer to handle it? >From the previous discussion[1], Kyuubi definitely requires a SQL layer, and with the SQL layer, we can even make the savepoint queryable in SQL-like syntax! > Beeline requires a SIGINT (CTRL + C) to trigger a query cancel [2], but this > doesn’t work for async queries, which is very common in streaming scenarios. > Thus we may need to extend beeline to support a cancel command. LGTM, we can implement a `!cancel` command in beeline, besides, we can introduce a SQL-like syntax to achieve it, e.g. `CALL cancel(query_id='12345')` [1] https://github.com/apache/incubator-kyuubi/pull/2048#issuecomment-1059980949 Thanks, Cheng Pan On Fri, Mar 18, 2022 at 5:06 PM Paul Lam <paullin3...@gmail.com> wrote: > > Hi team, > > As we aimed to make Flink engine production-ready [1], Flink > savepoint/checkpoint management > is a currently missing but crucial part, which we should prioritize. > Therefore, I start this thread to > discus the implementation of Flink savepoint/checkpoint management. > > There’re mainly three questions we need to think about: > 1. how to trigger a savepoint? > 2. how to find the available savepoints/checkpoints for a job? > 3. how to specify a savepoint/checkpoint for restore? > > # 1. how to trigger a savepoint > Apart from the automatic checkpoint, Flink allows user to manually trigger a > savepoint, > either during job running period or on stopping. To support that, there’re > two prerequisites. > > 1) support savepoint/cancel operations > Savepoint operation is not SQL, thus can’t be passed to Flink engine like a > normal statement, > thus we may need a new operation type. > > Cancel query operation is supported currently, but it’s not exposed to > beeline in a production- > ready way. Beeline requires a SIGINT (CTRL + C) to trigger a query cancel > [2], but this doesn’t > work for async queries, which is very common in streaming scenarios. Thus we > may > need to extend beeline to support a cancel command. > > To sum up, I think we might need to add a `savepoint` operation in Kyuubi > Server, and expose > savepoint and cancel operations in Beeline (maybe JDBC as well if possible). > > 2) expose Flink Job ID > To track an async query, we need an ID. It could be the operation ID of > Kyuubi or job ID of Flink > (maybe Flink cluster id as well). Since Kyuubi doesn’t persist metadata, we > may lose track the > Flink jobs after a restart. So I propose to expose Flink Job IDs to users, to > let users bookkeep > the IDs manually, and supports built-in Job management after Kyuubi has > metadata persistence. > > The users’ workflow should be like: > 1. execute a Flink query which creates a Flink job and returns the Job ID > 2. trigger a savepoint using the Job ID which returns a savepoint path > 3. cancel the query using the Job ID (or just cancel-with-savepoint) > > # 2. how to find available savepoints/checkpoints for a job > In some cases, Flink job crushes and we need to find an available savepoint > or checkpoint for > job restoring. This can be done via Flink history server or searching the > checkpoint/savepoint > directories. > > I think it’s good to support automatically searching for the available > savepoint/checkpoints, but > at the early phase, it’s okay to let users to do it manually. It doesn’t > block Flink engine to be > production-ready. > > # 3. how to specify a savepoint/checkpoint for restore > This question is relatively simple: add a new configuration option for users > to set savepoint/ > checkpoint path for restore before we provide the automatic search, and > optionally automatically > set the path after that . > > 1. https://github.com/apache/incubator-kyuubi/issues/2100 > <https://github.com/apache/incubator-kyuubi/issues/2100> > 2. > https://cwiki.apache.org/confluence/display/hive/hiveserver2+clients#HiveServer2Clients-CancellingtheQuery > > Best, > Paul Lam >