Hi Paul, Fired a ticket: https://issues.apache.org/jira/browse/FLINK-27977 for savepoints housekeeping.
Best regards, Jing On Thu, Jun 9, 2022 at 10:37 AM Martijn Visser <martijnvis...@apache.org> wrote: > Hi Paul, > > That's a fair point, but I still think we should not offer that capability > via the CLI either. But that's a different discussion :) > > Thanks, > > Martijn > > Op do 9 jun. 2022 om 10:08 schreef Paul Lam <paullin3...@gmail.com>: > >> Hi Martijn, >> >> I think the `DROP SAVEPOINT` statement would not conflict with NO_CLAIM >> mode, since the statement is triggered by users instead of Flink runtime. >> >> We’re simply providing a tool for user to cleanup the savepoints, just >> like `bin/flink savepoint -d :savepointPath` in Flink CLI [1]. >> >> [1] >> https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/ops/state/savepoints/#disposing-savepoints >> >> Best, >> Paul Lam >> >> 2022年6月9日 15:41,Martijn Visser <martijnvis...@apache.org> 写道: >> >> Hi all, >> >> I would not include a DROP SAVEPOINT syntax. With the recently introduced >> CLAIM/NO CLAIM mode, I would argue that we've just clarified snapshot >> ownership and if you have a savepoint established "with NO_CLAIM it creates >> its own copy and leaves the existing one up to the user." [1] We shouldn't >> then again make it fuzzy by making it possible that Flink can remove >> snapshots. >> >> Best regards, >> >> Martijn >> >> [1] >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-193%3A+Snapshots+ownership >> >> Op do 9 jun. 2022 om 09:27 schreef Paul Lam <paullin3...@gmail.com>: >> >>> Hi team, >>> >>> It's great to see our opinions are finally converging! >>> >>> `STOP JOB <job_id> [WITH SAVEPOINT] [WITH DRAIN] ` >>> >>> >>> LGTM. Adding it to the FLIP. >>> >>> To Jark, >>> >>> We can simplify the statement to "CREATE SAVEPOINT FOR JOB <job_id>” >>> >>> >>> Good point. The default savepoint dir should be enough for most cases. >>> >>> To Jing, >>> >>> DROP SAVEPOINT ALL >>> >>> >>> I think it’s valid to have such a statement, but I have two concerns: >>> >>> - `ALL` is already an SQL keyword, thus it may cause ambiguity. >>> - Flink CLI and REST API doesn’t provided the corresponding >>> functionalities, and we’d better keep them aligned. >>> >>> How about making this statement as follow-up tasks which should touch >>> REST API and Flink CLI? >>> >>> Best, >>> Paul Lam >>> >>> 2022年6月9日 11:53,godfrey he <godfre...@gmail.com> 写道: >>> >>> Hi all, >>> >>> Regarding `PIPELINE`, it comes from flink-core module, see >>> `PipelineOptions` class for more details. >>> `JOBS` is a more generic concept than `PIPELINES`. I'm also be fine with >>> `JOBS`. >>> >>> +1 to discuss JOBTREE in other FLIP. >>> >>> +1 to `STOP JOB <job_id> [WITH SAVEPOINT] [WITH DRAIN] ` >>> >>> +1 to `CREATE SAVEPOINT FOR JOB <job_id>` and `DROP SAVEPOINT >>> <savepoint_path>` >>> >>> Best, >>> Godfrey >>> >>> Jing Ge <j...@ververica.com> 于2022年6月9日周四 01:48写道: >>> >>> >>> Hi Paul, Hi Jark, >>> >>> Re JOBTREE, agree that it is out of the scope of this FLIP >>> >>> Re `RELEASE SAVEPOINT ALL', if the community prefers 'DROP' then 'DROP >>> SAVEPOINT ALL' housekeeping. WDYT? >>> >>> Best regards, >>> Jing >>> >>> >>> On Wed, Jun 8, 2022 at 2:54 PM Jark Wu <imj...@gmail.com> wrote: >>> >>> >>> Hi Jing, >>> >>> Regarding JOBTREE (job lineage), I agree with Paul that this is out of >>> the scope >>> of this FLIP and can be discussed in another FLIP. >>> >>> Job lineage is a big topic that may involve many problems: >>> 1) how to collect and report job entities, attributes, and lineages? >>> 2) how to integrate with data catalogs, e.g. Apache Atlas, DataHub? >>> 3) how does Flink SQL CLI/Gateway know the lineage information and show >>> jobtree? >>> 4) ... >>> >>> Best, >>> Jark >>> >>> On Wed, 8 Jun 2022 at 20:44, Jark Wu <imj...@gmail.com> wrote: >>> >>> >>> Hi Paul, >>> >>> I'm fine with using JOBS. The only concern is that this may conflict >>> with displaying more detailed >>> information for query (e.g. query content, plan) in the future, e.g. >>> SHOW QUERIES EXTENDED in ksqldb[1]. >>> This is not a big problem as we can introduce SHOW QUERIES in the future >>> if necessary. >>> >>> STOP JOBS <job_id> (with options `table.job.stop-with-savepoint` and >>> `table.job.stop-with-drain`) >>> >>> What about STOP JOB <job_id> [WITH SAVEPOINT] [WITH DRAIN] ? >>> It might be trivial and error-prone to set configuration before >>> executing a statement, >>> and the configuration will affect all statements after that. >>> >>> CREATE SAVEPOINT <savepoint_path> FOR JOB <job_id> >>> >>> We can simplify the statement to "CREATE SAVEPOINT FOR JOB <job_id>", >>> and always use configuration "state.savepoints.dir" as the default >>> savepoint dir. >>> The concern with using "<savepoint_path>" is here should be savepoint >>> dir, >>> and savepoint_path is the returned value. >>> >>> I'm fine with other changes. >>> >>> Thanks, >>> Jark >>> >>> [1]: >>> https://docs.ksqldb.io/en/latest/developer-guide/ksqldb-reference/show-queries/ >>> >>> >>> >>> On Wed, 8 Jun 2022 at 15:07, Paul Lam <paullin3...@gmail.com> wrote: >>> >>> >>> Hi Jing, >>> >>> Thank you for your inputs! >>> >>> TBH, I haven’t considered the ETL scenario that you mentioned. I think >>> they’re managed just like other jobs interns of job lifecycles (please >>> correct me if I’m wrong). >>> >>> WRT to the SQL statements about SQL lineages, I think it might be a >>> little bit out of the scope of the FLIP, since it’s mainly about >>> lifecycles. By the way, do we have these functionalities in Flink CLI or >>> REST API already? >>> >>> WRT `RELEASE SAVEPOINT ALL`, I’m sorry for the deprecated FLIP docs, the >>> community is more in favor of `DROP SAVEPOINT <savepoint_path>`. I’m >>> updating the FLIP arcading to the latest discussions. >>> >>> Best, >>> Paul Lam >>> >>> 2022年6月8日 07:31,Jing Ge <j...@ververica.com> 写道: >>> >>> Hi Paul, >>> >>> Sorry that I am a little bit too late to join this thread. Thanks for >>> driving this and starting this informative discussion. The FLIP looks >>> really interesting. It will help us a lot to manage Flink SQL jobs. >>> >>> Have you considered the ETL scenario with Flink SQL, where multiple SQLs >>> build a DAG for many DAGs? >>> >>> 1) >>> +1 for SHOW JOBS. I think sooner or later we will start to discuss how >>> to support ETL jobs. Briefly speaking, SQLs that used to build the DAG are >>> responsible to *produce* data as the result(cube, materialized view, etc.) >>> for the future consumption by queries. The INSERT INTO SELECT FROM example >>> in FLIP and CTAS are typical SQL in this case. I would prefer to call them >>> Jobs instead of Queries. >>> >>> 2) >>> Speaking of ETL DAG, we might want to see the lineage. Is it possible to >>> support syntax like: >>> >>> SHOW JOBTREE <job_id> // shows the downstream DAG from the given job_id >>> SHOW JOBTREE <job_id> FULL // shows the whole DAG that contains the >>> given job_id >>> SHOW JOBTREES // shows all DAGs >>> SHOW ANCIENTS <job_id> // shows all parents of the given job_id >>> >>> 3) >>> Could we also support Savepoint housekeeping syntax? We ran into this >>> issue that a lot of savepoints have been created by customers (via their >>> apps). It will take extra (hacking) effort to clean it. >>> >>> RELEASE SAVEPOINT ALL >>> >>> Best regards, >>> Jing >>> >>> On Tue, Jun 7, 2022 at 2:35 PM Martijn Visser <martijnvis...@apache.org> >>> wrote: >>> >>> >>> Hi Paul, >>> >>> I'm still doubting the keyword for the SQL applications. SHOW QUERIES >>> could >>> imply that this will actually show the query, but we're returning IDs of >>> the running application. At first I was also not very much in favour of >>> SHOW JOBS since I prefer calling it 'Flink applications' and not 'Flink >>> jobs', but the glossary [1] made me reconsider. I would +1 SHOW/STOP JOBS >>> >>> Also +1 for the CREATE/SHOW/DROP SAVEPOINT syntax. >>> >>> Best regards, >>> >>> Martijn >>> >>> [1] >>> >>> https://nightlies.apache.org/flink/flink-docs-master/docs/concepts/glossary >>> >>> Op za 4 jun. 2022 om 10:38 schreef Paul Lam <paullin3...@gmail.com>: >>> >>> Hi Godfrey, >>> >>> Sorry for the late reply, I was on vacation. >>> >>> It looks like we have a variety of preferences on the syntax, how about >>> we >>> choose the most acceptable one? >>> >>> WRT keyword for SQL jobs, we use JOBS, thus the statements related to >>> jobs >>> would be: >>> >>> - SHOW JOBS >>> - STOP JOBS <job_id> (with options `table.job.stop-with-savepoint` and >>> `table.job.stop-with-drain`) >>> >>> WRT savepoint for SQL jobs, we use the `CREATE/DROP` pattern with `FOR >>> JOB`: >>> >>> - CREATE SAVEPOINT <savepoint_path> FOR JOB <job_id> >>> - SHOW SAVEPOINTS FOR JOB <job_id> (show savepoints the current job >>> manager remembers) >>> - DROP SAVEPOINT <savepoint_path> >>> >>> cc @Jark @ShengKai @Martijn @Timo . >>> >>> Best, >>> Paul Lam >>> >>> >>> godfrey he <godfre...@gmail.com> 于2022年5月23日周一 21:34写道: >>> >>> Hi Paul, >>> >>> Thanks for the update. >>> >>> 'SHOW QUERIES' lists all jobs in the cluster, no limit on APIs >>> >>> (DataStream or SQL) or >>> clients (SQL client or CLI). >>> >>> Is DataStream job a QUERY? I think not. >>> For a QUERY, the most important concept is the statement. But the >>> result does not contain this info. >>> If we need to contain all jobs in the cluster, I think the name should >>> be JOB or PIPELINE. >>> I learn to SHOW PIPELINES and STOP PIPELINE [IF RUNNING] id. >>> >>> SHOW SAVEPOINTS >>> >>> To list the savepoint for a specific job, we need to specify a >>> specific pipeline, >>> the syntax should be SHOW SAVEPOINTS FOR PIPELINE id >>> >>> Best, >>> Godfrey >>> >>> Paul Lam <paullin3...@gmail.com> 于2022年5月20日周五 11:25写道: >>> >>> >>> Hi Jark, >>> >>> WRT “DROP QUERY”, I agree that it’s not very intuitive, and that’s >>> part of the reason why I proposed “STOP/CANCEL QUERY” at the >>> beginning. The downside of it is that it’s not ANSI-SQL compatible. >>> >>> Another question is, what should be the syntax for ungracefully >>> canceling a query? As ShengKai pointed out in a offline discussion, >>> “STOP QUERY” and “CANCEL QUERY” might confuse SQL users. >>> Flink CLI has both stop and cancel, mostly due to historical problems. >>> >>> WRT “SHOW SAVEPOINT”, I agree it’s a missing part. My concern is >>> that savepoints are owned by users and beyond the lifecycle of a Flink >>> cluster. For example, a user might take a savepoint at a custom path >>> that’s different than the default savepoint path, I think jobmanager >>> >>> would >>> >>> not remember that, not to mention the jobmanager may be a fresh new >>> one after a cluster restart. Thus if we support “SHOW SAVEPOINT”, it's >>> probably a best-effort one. >>> >>> WRT savepoint syntax, I’m thinking of the semantic of the savepoint. >>> Savepoints are alias for nested transactions in DB area[1], and there’s >>> correspondingly global transactions. If we consider Flink jobs as >>> global transactions and Flink checkpoints as nested transactions, >>> then the savepoint semantics are close, thus I think savepoint syntax >>> in SQL-standard could be considered. But again, I’m don’t have very >>> strong preference. >>> >>> Ping @Timo to get more inputs. >>> >>> [1] https://en.wikipedia.org/wiki/Nested_transaction < >>> >>> https://en.wikipedia.org/wiki/Nested_transaction> >>> >>> >>> Best, >>> Paul Lam >>> >>> 2022年5月18日 17:48,Jark Wu <imj...@gmail.com> 写道: >>> >>> Hi Paul, >>> >>> 1) SHOW QUERIES >>> +1 to add finished time, but it would be better to call it "end_time" >>> >>> to >>> >>> keep aligned with names in Web UI. >>> >>> 2) DROP QUERY >>> I think we shouldn't throw exceptions for batch jobs, otherwise, how >>> >>> to >>> >>> stop batch queries? >>> At present, I don't think "DROP" is a suitable keyword for this >>> >>> statement. >>> >>> From the perspective of users, "DROP" sounds like the query should be >>> removed from the >>> list of "SHOW QUERIES". However, it doesn't. Maybe "STOP QUERY" is >>> >>> more >>> >>> suitable and >>> compliant with commands of Flink CLI. >>> >>> 3) SHOW SAVEPOINTS >>> I think this statement is needed, otherwise, savepoints are lost >>> >>> after the >>> >>> SAVEPOINT >>> command is executed. Savepoints can be retrieved from REST API >>> "/jobs/:jobid/checkpoints" >>> with filtering "checkpoint_type"="savepoint". It's also worth >>> >>> considering >>> >>> providing "SHOW CHECKPOINTS" >>> to list all checkpoints. >>> >>> 4) SAVEPOINT & RELEASE SAVEPOINT >>> I'm a little concerned with the SAVEPOINT and RELEASE SAVEPOINT >>> >>> statements >>> >>> now. >>> In the vendors, the parameters of SAVEPOINT and RELEASE SAVEPOINT are >>> >>> both >>> >>> the same savepoint id. >>> However, in our syntax, the first one is query id, and the second one >>> >>> is >>> >>> savepoint path, which is confusing and >>> not consistent. When I came across SHOW SAVEPOINT, I thought maybe >>> >>> they >>> >>> should be in the same syntax set. >>> For example, CREATE SAVEPOINT FOR [QUERY] <query_id> & DROP SAVEPOINT >>> <sp_path>. >>> That means we don't follow the majority of vendors in SAVEPOINT >>> >>> commands. I >>> >>> would say the purpose is different in Flink. >>> What other's opinion on this? >>> >>> Best, >>> Jark >>> >>> [1]: >>> >>> >>> https://nightlies.apache.org/flink/flink-docs-master/docs/ops/rest_api/#jobs-jobid-checkpoints >>> >>> >>> >>> On Wed, 18 May 2022 at 14:43, Paul Lam <paullin3...@gmail.com> wrote: >>> >>> Hi Godfrey, >>> >>> Thanks a lot for your inputs! >>> >>> 'SHOW QUERIES' lists all jobs in the cluster, no limit on APIs >>> >>> (DataStream >>> >>> or SQL) or >>> clients (SQL client or CLI). Under the hook, it’s based on >>> ClusterClient#listJobs, the >>> same with Flink CLI. I think it’s okay to have non-SQL jobs listed >>> >>> in SQL >>> >>> client, because >>> these jobs can be managed via SQL client too. >>> >>> WRT finished time, I think you’re right. Adding it to the FLIP. But >>> >>> I’m a >>> >>> bit afraid that the >>> rows would be too long. >>> >>> WRT ‘DROP QUERY’, >>> >>> What's the behavior for batch jobs and the non-running jobs? >>> >>> >>> >>> In general, the behavior would be aligned with Flink CLI. Triggering >>> >>> a >>> >>> savepoint for >>> a non-running job would cause errors, and the error message would be >>> printed to >>> the SQL client. Triggering a savepoint for batch(unbounded) jobs in >>> streaming >>> execution mode would be the same with streaming jobs. However, for >>> >>> batch >>> >>> jobs in >>> batch execution mode, I think there would be an error, because batch >>> execution >>> doesn’t support checkpoints currently (please correct me if I’m >>> >>> wrong). >>> >>> >>> WRT ’SHOW SAVEPOINTS’, I’ve thought about it, but Flink >>> >>> clusterClient/ >>> >>> jobClient doesn’t have such a functionality at the moment, neither do >>> Flink CLI. >>> Maybe we could make it a follow-up FLIP, which includes the >>> >>> modifications >>> >>> to >>> clusterClient/jobClient and Flink CLI. WDYT? >>> >>> Best, >>> Paul Lam >>> >>> 2022年5月17日 20:34,godfrey he <godfre...@gmail.com> 写道: >>> >>> Godfrey >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>