Re: [DISCUSS] FLIP-222: Support full query lifecycle statements in SQL client

Jing Ge Wed, 08 Jun 2022 10:49:05 -0700

Hi Paul, Hi Jark,

Re JOBTREE, agree that it is out of the scope of this FLIP


Re `RELEASE SAVEPOINT ALL', if the community prefers 'DROP' then 'DROP
SAVEPOINT ALL' housekeeping. WDYT?

Best regards,
Jing


On Wed, Jun 8, 2022 at 2:54 PM Jark Wu <[email protected]> wrote:

> Hi Jing,
>
> Regarding JOBTREE (job lineage), I agree with Paul that this is out of the
> scope
>  of this FLIP and can be discussed in another FLIP.
>
> Job lineage is a big topic that may involve many problems:
> 1) how to collect and report job entities, attributes, and lineages?
> 2) how to integrate with data catalogs, e.g. Apache Atlas, DataHub?
> 3) how does Flink SQL CLI/Gateway know the lineage information and show
> jobtree?
> 4) ...
>
> Best,
> Jark
>
> On Wed, 8 Jun 2022 at 20:44, Jark Wu <[email protected]> wrote:
>
>> Hi Paul,
>>
>> I'm fine with using JOBS. The only concern is that this may conflict with
>> displaying more detailed
>> information for query (e.g. query content, plan) in the future, e.g. SHOW
>> QUERIES EXTENDED in ksqldb[1].
>> This is not a big problem as we can introduce SHOW QUERIES in the future
>> if necessary.
>>
>> > STOP JOBS <job_id> (with options `table.job.stop-with-savepoint` and
>> `table.job.stop-with-drain`)
>> What about STOP JOB <job_id> [WITH SAVEPOINT] [WITH DRAIN] ?
>> It might be trivial and error-prone to set configuration before executing
>> a statement,
>> and the configuration will affect all statements after that.
>>
>> > CREATE SAVEPOINT <savepoint_path> FOR JOB <job_id>
>> We can simplify the statement to "CREATE SAVEPOINT FOR JOB <job_id>",
>> and always use configuration "state.savepoints.dir" as the default
>> savepoint dir.
>> The concern with using "<savepoint_path>" is here should be savepoint
>> dir,
>> and savepoint_path is the returned value.
>>
>> I'm fine with other changes.
>>
>> Thanks,
>> Jark
>>
>> [1]:
>> https://docs.ksqldb.io/en/latest/developer-guide/ksqldb-reference/show-queries/
>>
>>
>>
>>
>> On Wed, 8 Jun 2022 at 15:07, Paul Lam <[email protected]> wrote:
>>
>>> Hi Jing,
>>>
>>> Thank you for your inputs!
>>>
>>> TBH, I haven’t considered the ETL scenario that you mentioned. I think
>>> they’re managed just like other jobs interns of job lifecycles (please
>>> correct me if I’m wrong).
>>>
>>> WRT to the SQL statements about SQL lineages, I think it might be a
>>> little bit out of the scope of the FLIP, since it’s mainly about
>>> lifecycles. By the way, do we have these functionalities in Flink CLI or
>>> REST API already?
>>>
>>> WRT `RELEASE SAVEPOINT ALL`, I’m sorry for the deprecated FLIP docs, the
>>> community is more in favor of `DROP SAVEPOINT <savepoint_path>`. I’m
>>> updating the FLIP arcading to the latest discussions.
>>>
>>> Best,
>>> Paul Lam
>>>
>>> 2022年6月8日 07:31，Jing Ge <[email protected]> 写道：
>>>
>>> Hi Paul,
>>>
>>> Sorry that I am a little bit too late to join this thread. Thanks for
>>> driving this and starting this informative discussion. The FLIP looks
>>> really interesting. It will help us a lot to manage Flink SQL jobs.
>>>
>>> Have you considered the ETL scenario with Flink SQL, where multiple SQLs
>>> build a DAG for many DAGs?
>>>
>>> 1)
>>> +1 for SHOW JOBS. I think sooner or later we will start to discuss how
>>> to support ETL jobs. Briefly speaking, SQLs that used to build the DAG are
>>> responsible to *produce* data as the result(cube, materialized view, etc.)
>>> for the future consumption by queries. The INSERT INTO SELECT FROM example
>>> in FLIP and CTAS are typical SQL in this case. I would prefer to call them
>>> Jobs instead of Queries.
>>>
>>> 2)
>>> Speaking of ETL DAG, we might want to see the lineage. Is it possible to
>>> support syntax like:
>>>
>>> SHOW JOBTREE <job_id>  // shows the downstream DAG from the given job_id
>>> SHOW JOBTREE <job_id> FULL // shows the whole DAG that contains the
>>> given job_id
>>> SHOW JOBTREES // shows all DAGs
>>> SHOW ANCIENTS <job_id> // shows all parents of the given job_id
>>>
>>> 3)
>>> Could we also support Savepoint housekeeping syntax? We ran into this
>>> issue that a lot of savepoints have been created by customers (via their
>>> apps). It will take extra (hacking) effort to clean it.
>>>
>>> RELEASE SAVEPOINT ALL
>>>
>>> Best regards,
>>> Jing
>>>
>>> On Tue, Jun 7, 2022 at 2:35 PM Martijn Visser <[email protected]>
>>> wrote:
>>>
>>>> Hi Paul,
>>>>
>>>> I'm still doubting the keyword for the SQL applications. SHOW QUERIES
>>>> could
>>>> imply that this will actually show the query, but we're returning IDs of
>>>> the running application. At first I was also not very much in favour of
>>>> SHOW JOBS since I prefer calling it 'Flink applications' and not 'Flink
>>>> jobs', but the glossary [1] made me reconsider. I would +1 SHOW/STOP
>>>> JOBS
>>>>
>>>> Also +1 for the CREATE/SHOW/DROP SAVEPOINT syntax.
>>>>
>>>> Best regards,
>>>>
>>>> Martijn
>>>>
>>>> [1]
>>>>
>>>> https://nightlies.apache.org/flink/flink-docs-master/docs/concepts/glossary
>>>>
>>>> Op za 4 jun. 2022 om 10:38 schreef Paul Lam <[email protected]>:
>>>>
>>>> > Hi Godfrey,
>>>> >
>>>> > Sorry for the late reply, I was on vacation.
>>>> >
>>>> > It looks like we have a variety of preferences on the syntax, how
>>>> about we
>>>> > choose the most acceptable one?
>>>> >
>>>> > WRT keyword for SQL jobs, we use JOBS, thus the statements related to
>>>> jobs
>>>> > would be:
>>>> >
>>>> > - SHOW JOBS
>>>> > - STOP JOBS <job_id> (with options `table.job.stop-with-savepoint` and
>>>> > `table.job.stop-with-drain`)
>>>> >
>>>> > WRT savepoint for SQL jobs, we use the `CREATE/DROP` pattern with `FOR
>>>> > JOB`:
>>>> >
>>>> > - CREATE SAVEPOINT <savepoint_path> FOR JOB <job_id>
>>>> > - SHOW SAVEPOINTS FOR JOB <job_id> (show savepoints the current job
>>>> > manager remembers)
>>>> > - DROP SAVEPOINT <savepoint_path>
>>>> >
>>>> > cc @Jark @ShengKai @Martijn @Timo .
>>>> >
>>>> > Best,
>>>> > Paul Lam
>>>> >
>>>> >
>>>> > godfrey he <[email protected]> 于2022年5月23日周一 21:34写道：
>>>> >
>>>> >> Hi Paul,
>>>> >>
>>>> >> Thanks for the update.
>>>> >>
>>>> >> >'SHOW QUERIES' lists all jobs in the cluster, no limit on APIs
>>>> >> (DataStream or SQL) or
>>>> >> clients (SQL client or CLI).
>>>> >>
>>>> >> Is DataStream job a QUERY? I think not.
>>>> >> For a QUERY, the most important concept is the statement. But the
>>>> >> result does not contain this info.
>>>> >> If we need to contain all jobs in the cluster, I think the name
>>>> should
>>>> >> be JOB or PIPELINE.
>>>> >> I learn to SHOW PIPELINES and STOP PIPELINE [IF RUNNING] id.
>>>> >>
>>>> >> > SHOW SAVEPOINTS
>>>> >> To list the savepoint for a specific job, we need to specify a
>>>> >> specific pipeline,
>>>> >> the syntax should be SHOW SAVEPOINTS FOR PIPELINE id
>>>> >>
>>>> >> Best,
>>>> >> Godfrey
>>>> >>
>>>> >> Paul Lam <[email protected]> 于2022年5月20日周五 11:25写道：
>>>> >> >
>>>> >> > Hi Jark,
>>>> >> >
>>>> >> > WRT “DROP QUERY”, I agree that it’s not very intuitive, and that’s
>>>> >> > part of the reason why I proposed “STOP/CANCEL QUERY” at the
>>>> >> > beginning. The downside of it is that it’s not ANSI-SQL compatible.
>>>> >> >
>>>> >> > Another question is, what should be the syntax for ungracefully
>>>> >> > canceling a query? As ShengKai pointed out in a offline discussion,
>>>> >> > “STOP QUERY” and “CANCEL QUERY” might confuse SQL users.
>>>> >> > Flink CLI has both stop and cancel, mostly due to historical
>>>> problems.
>>>> >> >
>>>> >> > WRT “SHOW SAVEPOINT”, I agree it’s a missing part. My concern is
>>>> >> > that savepoints are owned by users and beyond the lifecycle of a
>>>> Flink
>>>> >> > cluster. For example, a user might take a savepoint at a custom
>>>> path
>>>> >> > that’s different than the default savepoint path, I think
>>>> jobmanager
>>>> >> would
>>>> >> > not remember that, not to mention the jobmanager may be a fresh new
>>>> >> > one after a cluster restart. Thus if we support “SHOW SAVEPOINT”,
>>>> it's
>>>> >> > probably a best-effort one.
>>>> >> >
>>>> >> > WRT savepoint syntax, I’m thinking of the semantic of the
>>>> savepoint.
>>>> >> > Savepoints are alias for nested transactions in DB area[1], and
>>>> there’s
>>>> >> > correspondingly global transactions. If we consider Flink jobs as
>>>> >> > global transactions and Flink checkpoints as nested transactions,
>>>> >> > then the savepoint semantics are close, thus I think savepoint
>>>> syntax
>>>> >> > in SQL-standard could be considered. But again, I’m don’t have very
>>>> >> > strong preference.
>>>> >> >
>>>> >> > Ping @Timo to get more inputs.
>>>> >> >
>>>> >> > [1] https://en.wikipedia.org/wiki/Nested_transaction <
>>>> >> https://en.wikipedia.org/wiki/Nested_transaction>
>>>> >> >
>>>> >> > Best,
>>>> >> > Paul Lam
>>>> >> >
>>>> >> > > 2022年5月18日 17:48，Jark Wu <[email protected]> 写道：
>>>> >> > >
>>>> >> > > Hi Paul,
>>>> >> > >
>>>> >> > > 1) SHOW QUERIES
>>>> >> > > +1 to add finished time, but it would be better to call it
>>>> "end_time"
>>>> >> to
>>>> >> > > keep aligned with names in Web UI.
>>>> >> > >
>>>> >> > > 2) DROP QUERY
>>>> >> > > I think we shouldn't throw exceptions for batch jobs, otherwise,
>>>> how
>>>> >> to
>>>> >> > > stop batch queries?
>>>> >> > > At present, I don't think "DROP" is a suitable keyword for this
>>>> >> statement.
>>>> >> > > From the perspective of users, "DROP" sounds like the query
>>>> should be
>>>> >> > > removed from the
>>>> >> > > list of "SHOW QUERIES". However, it doesn't. Maybe "STOP QUERY"
>>>> is
>>>> >> more
>>>> >> > > suitable and
>>>> >> > > compliant with commands of Flink CLI.
>>>> >> > >
>>>> >> > > 3) SHOW SAVEPOINTS
>>>> >> > > I think this statement is needed, otherwise, savepoints are lost
>>>> >> after the
>>>> >> > > SAVEPOINT
>>>> >> > > command is executed. Savepoints can be retrieved from REST API
>>>> >> > > "/jobs/:jobid/checkpoints"
>>>> >> > > with filtering "checkpoint_type"="savepoint". It's also worth
>>>> >> considering
>>>> >> > > providing "SHOW CHECKPOINTS"
>>>> >> > > to list all checkpoints.
>>>> >> > >
>>>> >> > > 4) SAVEPOINT & RELEASE SAVEPOINT
>>>> >> > > I'm a little concerned with the SAVEPOINT and RELEASE SAVEPOINT
>>>> >> statements
>>>> >> > > now.
>>>> >> > > In the vendors, the parameters of SAVEPOINT and RELEASE
>>>> SAVEPOINT are
>>>> >> both
>>>> >> > > the same savepoint id.
>>>> >> > > However, in our syntax, the first one is query id, and the
>>>> second one
>>>> >> is
>>>> >> > > savepoint path, which is confusing and
>>>> >> > > not consistent. When I came across SHOW SAVEPOINT, I thought
>>>> maybe
>>>> >> they
>>>> >> > > should be in the same syntax set.
>>>> >> > > For example, CREATE SAVEPOINT FOR [QUERY] <query_id> & DROP
>>>> SAVEPOINT
>>>> >> > > <sp_path>.
>>>> >> > > That means we don't follow the majority of vendors in SAVEPOINT
>>>> >> commands. I
>>>> >> > > would say the purpose is different in Flink.
>>>> >> > > What other's opinion on this?
>>>> >> > >
>>>> >> > > Best,
>>>> >> > > Jark
>>>> >> > >
>>>> >> > > [1]:
>>>> >> > >
>>>> >>
>>>> https://nightlies.apache.org/flink/flink-docs-master/docs/ops/rest_api/#jobs-jobid-checkpoints
>>>> >> > >
>>>> >> > >
>>>> >> > > On Wed, 18 May 2022 at 14:43, Paul Lam <[email protected]>
>>>> wrote:
>>>> >> > >
>>>> >> > >> Hi Godfrey,
>>>> >> > >>
>>>> >> > >> Thanks a lot for your inputs!
>>>> >> > >>
>>>> >> > >> 'SHOW QUERIES' lists all jobs in the cluster, no limit on APIs
>>>> >> (DataStream
>>>> >> > >> or SQL) or
>>>> >> > >> clients (SQL client or CLI). Under the hook, it’s based on
>>>> >> > >> ClusterClient#listJobs, the
>>>> >> > >> same with Flink CLI. I think it’s okay to have non-SQL jobs
>>>> listed
>>>> >> in SQL
>>>> >> > >> client, because
>>>> >> > >> these jobs can be managed via SQL client too.
>>>> >> > >>
>>>> >> > >> WRT finished time, I think you’re right. Adding it to the FLIP.
>>>> But
>>>> >> I’m a
>>>> >> > >> bit afraid that the
>>>> >> > >> rows would be too long.
>>>> >> > >>
>>>> >> > >> WRT ‘DROP QUERY’,
>>>> >> > >>> What's the behavior for batch jobs and the non-running jobs?
>>>> >> > >>
>>>> >> > >>
>>>> >> > >> In general, the behavior would be aligned with Flink CLI.
>>>> Triggering
>>>> >> a
>>>> >> > >> savepoint for
>>>> >> > >> a non-running job would cause errors, and the error message
>>>> would be
>>>> >> > >> printed to
>>>> >> > >> the SQL client. Triggering a savepoint for batch(unbounded)
>>>> jobs in
>>>> >> > >> streaming
>>>> >> > >> execution mode would be the same with streaming jobs. However,
>>>> for
>>>> >> batch
>>>> >> > >> jobs in
>>>> >> > >> batch execution mode, I think there would be an error, because
>>>> batch
>>>> >> > >> execution
>>>> >> > >> doesn’t support checkpoints currently (please correct me if I’m
>>>> >> wrong).
>>>> >> > >>
>>>> >> > >> WRT ’SHOW SAVEPOINTS’, I’ve thought about it, but Flink
>>>> >> clusterClient/
>>>> >> > >> jobClient doesn’t have such a functionality at the moment,
>>>> neither do
>>>> >> > >> Flink CLI.
>>>> >> > >> Maybe we could make it a follow-up FLIP, which includes the
>>>> >> modifications
>>>> >> > >> to
>>>> >> > >> clusterClient/jobClient and Flink CLI. WDYT?
>>>> >> > >>
>>>> >> > >> Best,
>>>> >> > >> Paul Lam
>>>> >> > >>
>>>> >> > >>> 2022年5月17日 20:34，godfrey he <[email protected]> 写道：
>>>> >> > >>>
>>>> >> > >>> Godfrey
>>>> >> > >>
>>>> >> > >>
>>>> >> >
>>>> >>
>>>> >
>>>>
>>>
>>>

Re: [DISCUSS] FLIP-222: Support full query lifecycle statements in SQL client

Reply via email to