Re: [DISCUSS] FLIP-145: Support SQL windowing table-valued function

Danny Chan Fri, 09 Oct 2020 23:44:34 -0700

Thanks for driving this Jark ~

The FLIP overall looks good, i think the window TVFs would be the main "entry 
point" syntax for our NTR use cases.
The syntax originated from the "One SQL To Rule Them All" paper and i think we 
have reached an agreement.


I want to make some additions to the window TVF syntax here

- We support the standard syntax of polymorphic table functions with named 
parameters. e.g.

select *
from table(
tumble(
  DATA => table Shipments,
  TIMECOL => descriptor(rowtime),
  SIZE => INTERVAL '1' MINUTE))

- The first parameter can also be any form of sub-query, e.g.

select *
from table(
  tumble((select * from Shipments), descriptor(rowtime), INTERVAL '1' MINUTE))

The current proposed syntax is a simplified one and for most of the cases it is 
more easy to use.

The question asked by Benchao is reasonable, as we suggest the window TVF as a 
normal UDTF, it can be chained with any kind of other non-windowed operators, 
we may need some time to give clear semantics for them (especially it is good 
if there are real use cases), as a start, let us focus the windowed operations 
first.

I'm also +1 for voting ~

Best,
Danny Chan
在 2020年10月10日 +0800 PM2:03，Benchao Li <libenc...@apache.org>，写道：
> Hi Jark,
>
> Thanks for your reply, this makes sense to me.
>
> The scenario I used above is just a case to explain what I'm concerned
> about,
> not necessarily a production use case. We can leave it to the future to see
> whether
> other users have these use cases.
>
> Then I have no other concerns, +1 to start the VOTE.
>
>
> Jark Wu <imj...@gmail.com> 于2020年10月10日周六 下午1:44写道：
>
> > Hi Benchao,
> >
> > That's a good question.
> >
> > IMO, the new windowed operators and the current time operators are two
> > different sets of functions,
> > just like time operators and non-time operators are two different sets of
> > functions.
> > I think it's fine if we don't support integrating them, just like time
> > operators can't be applied on non-windowed aggregate.
> > If users want to use time operators in the whole pipeline, then he/she can
> > use the grouped window aggregates instead of the window TVFs.
> >
> > The key idea of window TVF is that all the operators in the pipeline are
> > based on the **windows**.
> > In terms of syntax, if the key clause (e.g. group by, partitioned by, join
> > on, order by) contains window_start and window_end,
> > it can be translated into windowed operators.
> > Thus, we will have windowed CEP, windowed sort, windowed over aggregate in
> > the future to make it possible to build a windowed pipeline.
> >
> > But I think we can elaborate the integration more in the future if users
> > need it. Actually, I don't fully understand the scenario of integrating
> > window TVF and time operators at this point.
> > For example, interval join an input stream and a window join result. I
> > don't see why it can't be expressed by nested window join and why users
> > have to use interval join here.
> > Maybe we can wait for more inputs from users when the window TVF is
> > released and we can elaborate it again.
> >
> > Best,
> > Jark
> >
> > On Sat, 10 Oct 2020 at 12:01, 刘 芃成 <pengchengliucr...@gmail.com> wrote:
> >
> > > Hi, Benchao,
> > > I think I got your point, actually, in current implementation for
> > > group window aggregation, the value of time attributes(e.g.
> > > TUMBLE_ROWTIME/TUMBLE_PROCTIME) is calculated as (window_end – 1), so I
> > > think we can just use it directly if you need this. But I think this time
> > > attributes is mainly suggested to use in case of cascaded window 
> > > operations.
> > > Regarding the example you provided, I think the semantics of the SQL in
> > > your example which doing interval join(e.g. with TUMBLE_ROWTIME) after
> > > window aggregation is not clear in the current implementation, and I think
> > > that’s a strong reason why we need the new TVFs syntax.
> > > With the new syntax, users should understand which time column to
> > > use and how to generate it when doing interval join and etc.
> > >
> > > Best,
> > > Pengcheng
> > >
> > > 发件人: Benchao Li <libenc...@apache.org>
> > > 日期: 2020年10月10日 星期六 上午11:02
> > > 收件人: pengcheng Liu <pengchengliucr...@gmail.com>
> > > 抄送: dev <dev@flink.apache.org>
> > > 主题: Re: [DISCUSS] FLIP-145: Support SQL windowing table-valued function
> > >
> > > Hi pengcheng,
> > >
> > > Thanks for your response.
> > > I knew that the original time attribute column will be retained after the
> > > TVF,
> > > what I'm questioning is how do we get the time attribute column after
> > > Aggregation.
> > > Your answer did not remove my doubts about this.
> > >
> > > It's ok if we did not plan to integrate new TVF aggregate with old "time
> > > attribute scenarios"
> > > listed in my previous email in this FLIP. However it's good to elaborate
> > > more on this, and
> > > leave it to the future plan.
> > >
> > > pengcheng Liu <pengchengliucr...@gmail.com<mailto:
> > > pengchengliucr...@gmail.com>> 于2020年10月10日周六 上午10:45写道：
> > > Hi，Benchao,
> > > In TVFs, the time attributes is just passed through from parent rels,
> > > and the TVFs just add two
> > > additional window attributes(i.e. window_start & window_end). Also, I
> > > think the time columns can be not only a time attribute
> > > with type of `TimeIndicatorType` but also a regular column with type
> > > of `Timestamp`.
> > >
> > > For cascaded window operations, we can use window_start/window_end of
> > > the previous window result directly to
> > > indicate operating on the same window, or use new DESCRIPTOR column
> > > to assign new windows, in case of the change of
> > > the time column(e.g. in some case, the original timestamp is
> > > inaccurate and need some conversion to be used).
> > >
> > > You can check the definition or signature of these TVFs in the FLIP.
> > > e.g.
> > > SELECT * FROM TABLE(
> > > TUMBLE(TABLE Bid, DESCRIPTOR(bidtime), INTERVAL '10' MINUTES))
> > > In the example, the `bidtime` is the time attribute column, which is
> > > the first operand of the DESCRIPTOR function.
> > >
> > > +1 start voting.
> > >
> > > Benchao Li <libenc...@apache.org<mailto:libenc...@apache.org>>
> > > 于2020年10月10日周六 上午10:08写道：
> > > Hi Jark,
> > >
> > > 2 & 3 sounds good to me.
> > >
> > > Regarding time attribute,
> > > I still have some questions, I knew it's easy to support cascaded window
> > > aggregate using new TVFs.
> > > However there are some other places where need time attribute:
> > > - CEP
> > > - interval join
> > > - order by
> > > - over window
> > > If there is no time attribute column, how do we integrate these old
> > > features with the new TVFs.
> > > E.g.
> > > StreamA -> new window aggregate -> interval join -> Sink
> > > /
> > > StreamB -----------------------------------
> > >
> > >
> > > Jark Wu <imj...@gmail.com<mailto:imj...@gmail.com>> 于2020年10月9日周五
> > > 下午11:51写道：
> > > Hi Benchao,
> > >
> > > 1) time attribute
> > > Yes. We don't need time attribute auxiliary function. Because the new
> > > window operations are all based on the
> > > window_start and window_end columns instead of on the time attributes. So
> > > we don't need to propagate time attributes.
> > > Cascaded window aggregate can be expressed by simply GROUP BY the
> > > window_start and window_end of the previous window result.
> > > I have added a cascaded window aggregate example in the Tumbling Window
> > > section in the FLIP.
> > > If you want to define proctime window aggregate, the time column in TVF
> > > should be a proctime attribute field (or PROCTIME() function).
> > >
> > > 2) batch support
> > > Yes. The proposed syntax/API are unified for batch and streaming. Batch
> > > support is in the plan, but may not have enough time to catch up 1.12.
> > >
> > > 3) support `grouping sets`
> > > This is not included in the FLIP, but I think it's great if we can support
> > > `grouping sets`.
> > > The existing window impl doesn't support this because we convert the
> > > LogicalAggregate into WindowAggregate in the beginning,
> > > the expand grouping sets rule can't be applied in this situation.
> > > Fortunately, with the new window impl, the conversion to WindowAggregate
> > > will happen at the end, so I think the expand rule can be
> > > applied and support this feature naturally.
> > > Therefore, IMO, we don't need to include this feature in this FLIP to
> > > avoid
> > > the FLIP being too large.
> > > This can be a follow-up issue (maybe just add tests and docs) after the
> > > FLIP.
> > >
> > > Best,
> > > Jark
> > >
> > >
> > > On Fri, 9 Oct 2020 at 19:09, 刘 芃成 <pengchengliucr...@gmail.com<mailto:
> > > pengchengliucr...@gmail.com>> wrote:
> > >
> > > > Hi，Benchao,
> > > > Welcome to join the discussion, yes, this new syntax can make
> > > SQL
> > > > more clear and simpler.
> > > > For your first question, the `window_start` and `window_end`
> > > > columns will be added automatically,
> > > > so we don't need to use auxiliary group functions to infer or
> > > > access the window properties.
> > > >
> > > > For the `grouping sets` on TVFs, I think it's interesting if we
> > > > can support it, as we already supported `grouping sets`
> > > > on streaming aggregates in blink planner. But I'm not sure if it
> > > > will be included into this FLIP.
> > > >
> > > > cc @Jark Wu
> > > >
> > > > Best,
> > > > Pengcheng
> > > >
> > > >
> > > > 在 2020/10/9 下午5:25，“Benchao Li”<libenc...@apache.org<mailto:
> > > libenc...@apache.org>> 写入:
> > > >
> > > > Thanks Jark for bringing this discussion, I like this FLIP very
> > > much.
> > > >
> > > > Especially the cumulate window, it's much like the current TUMBLE
> > > > window +
> > > > Fast Emit (which is an undocumented experimental feature), however,
> > > > it's
> > > > more powerful.
> > > >
> > > > And This will make the SQL semantic more standard, especially for
> > > the
> > > > HOPPING window.
> > > >
> > > > Regarding time attribute,
> > > > It seems that we don't need a specific function to infer the time
> > > > attribute
> > > > like
> > > > `TUMBLE_ROWTIME` / `TUMBLE_PROCTIME`. Then are `window_start` and
> > > > `window_end`
> > > > column a time attribute column automatically?
> > > > - If not, what will be the time attribute of the result relation of
> > > > these
> > > > TVFs?
> > > > Especially after the window aggregation.
> > > > - If yes, then how do we handle proctime?
> > > >
> > > > Regarding batch operators,
> > > > It's great to hear that we can reuse the batch operators in
> > > continuous
> > > > batch mode
> > > > as you mentioned in the FLIP.
> > > > Current window aggregate could also be used in batch mode with
> > > > rowtime. Do
> > > > you plan
> > > > to support these TVFs for batch mode in this FLIP? Hence the
> > > Table/SQL
> > > > is a
> > > > unified
> > > > API, it's great if we can keep the features complete both in
> > > streaming
> > > > and
> > > > batch mode.
> > > >
> > > > There is one more question, I don't know whether it should be
> > > > considered in
> > > > this FLIP.
> > > > Does the new window support `grouping sets`? (It's not supported in
> > > old
> > > > window impl).
> > > >
> > > > Jark Wu <imj...@gmail.com<mailto:imj...@gmail.com>> 于2020年10月9日周五
> > > 下午4:14写道：
> > > >
> > > > > Hi all,
> > > > >
> > > > > I know we have a lot of discussion and development on going right
> > > > now but
> > > > > it would be great if we can get FLIP-145 into a votable state.
> > > > > If there are no objections, I would like to start voting in the
> > > next
> > > > days.
> > > > >
> > > > > Best,
> > > > > Jark
> > > > >
> > > > > On Thu, 1 Oct 2020 at 14:29, Jark Wu <imj...@gmail.com<mailto:
> > > imj...@gmail.com>> wrote:
> > > > >
> > > > > > Hi everyone,
> > > > > >
> > > > > > I have added a section for Performance Optimization to describe
> > > > how to
> > > > > > improve the performance in the short-term and long-term
> > > > > > and sketch the future performance potential under the new window
> > > > API.
> > > > > > Introducing the window API is just the first step, we will
> > > > > > continuously improve the performance to make it powerful and
> > > > useful.
> > > > > >
> > > > > > Best,
> > > > > > Jark
> > > > > >
> > > > > > On Thu, 1 Oct 2020 at 14:28, Jark Wu <imj...@gmail.com<mailto:
> > > imj...@gmail.com>> wrote:
> > > > > >
> > > > > > > Hi Pengcheng,
> > > > > > >
> > > > > > > Yes, the window TVF is part of the FLIP. Welcome to contribute
> > > > and join
> > > > > > > the discussion.
> > > > > > > Regarding the SESSION window aggregation, users can use the
> > > > existing
> > > > > > > grouped session window function.
> > > > > > >
> > > > > > > Best,
> > > > > > > Jark
> > > > > > >
> > > > > > > On Sun, 27 Sep 2020 at 21:24, liupengcheng <
> > > > pengchengliucr...@gmail.com<mailto:pengchengliucr...@gmail.com>
> > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi Jark,
> > > > > > > > Thanks for reply, yes, I think it's a good feature, it
> > > > can
> > > > > > > > improve the NRT scenarios
> > > > > > > > as you mentioned in the FLIP. Also, I think it can
> > > > improve the
> > > > > > > > streaming SQL greatly,
> > > > > > > > it can support richer window operations in flink SQL
> > > and
> > > > bring
> > > > > > > > great convenience to users.
> > > > > > > > (we are now only supported group window in flink).
> > > > > > > >
> > > > > > > > Regarding the SESSION window, I think it's especially
> > > > useful
> > > > > for
> > > > > > > > user behavior analysis(e.g.
> > > > > > > > counting user visits on a news website or social
> > > > platform), but
> > > > > > > > I agree that we can keep it
> > > > > > > > out of the FLIP now to catch up 1.12.
> > > > > > > >
> > > > > > > > Recently, I've done some work on the stream planner
> > > with
> > > > the
> > > > > > > > TVFs, and I'm willing to contribute
> > > > > > > > to this part. Is it in the plan of this FLIP?
> > > > > > > >
> > > > > > > > Best,
> > > > > > > > PengchengLiu
> > > > > > > >
> > > > > > > >
> > > > > > > > 在 2020/9/26 下午11:09，“Jark Wu”<imj...@gmail.com<mailto:
> > > imj...@gmail.com>> 写入:
> > > > > > > >
> > > > > > > > Hi pengcheng,
> > > > > > > >
> > > > > > > > That's great to see you also have the need of window join.
> > > > > > > > You are right, the windowing TVF is a powerful feature
> > > which
> > > > can
> > > > > > > > support
> > > > > > > > more operations in the future.
> > > > > > > > I think it as of the date time "partition" selection in
> > > > batch SQL
> > > > > > > > jobs,
> > > > > > > > with this new syntax, I think it is possible
> > > > > > > > to migrate traditional batch SQL jobs to Flink SQL by
> > > > changing a
> > > > > > > > few lines.
> > > > > > > >
> > > > > > > > Regarding the SESSION window, this is on purpose to keep
> > > it
> > > > out of
> > > > > > > > the
> > > > > > > > FLIP, because we want to keep the
> > > > > > > > FLIP small to catch up 1.12 and SESSION TVF is rarely
> > > useful
> > > > (e.g.
> > > > > > > > session
> > > > > > > > window join?).
> > > > > > > >
> > > > > > > > Best,
> > > > > > > > Jark
> > > > > > > >
> > > > > > > > On Fri, 25 Sep 2020 at 22:59, liupengcheng <
> > > > > > > > pengchengliucr...@gmail.com<mailto:
> > > pengchengliucr...@gmail.com>>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi, Jark,
> > > > > > > > > I'm very interested in this feature, and I'm
> > > also
> > > > working
> > > > > > > > on this
> > > > > > > > > recently.
> > > > > > > > > I just have a glance at the FLIP, it's good,
> > > but I
> > > > found
> > > > > > > > that
> > > > > > > > > there is no plan to add SESSION windows.
> > > > > > > > > Also, I think there can be more things we can do
> > > > based on
> > > > > > > > this new
> > > > > > > > > syntax. For example,
> > > > > > > > > - window sort support
> > > > > > > > > - window union/intersect/minus support
> > > > > > > > > - Improve dimension table join
> > > > > > > > > We can have more deep discussion on this new
> > > > feature
> > > > > later
> > > > > > > > .
> > > > > > > > > I've also opened an jira that is related to this
> > > > feature
> > > > > > > > recently:
> > > > > > > > > https://issues.apache.org/jira/browse/FLINK-18830
> > > > > > > > >
> > > > > > > > > Best!
> > > > > > > > > PengchengLiu
> > > > > > > > >
> > > > > > > > > 在 2020/9/25 下午10:30，“Jark Wu”<imj...@gmail.com<mailto:
> > > imj...@gmail.com>> 写入:
> > > > > > > > >
> > > > > > > > > Hi everyone,
> > > > > > > > >
> > > > > > > > > I want to start a FLIP about supporting windowing
> > > > > table-valued
> > > > > > > > > functions
> > > > > > > > > (TVF).
> > > > > > > > > The main purpose of this FLIP is to improve the near
> > > > > real-time
> > > > > > > > (NRT)
> > > > > > > > > experience of Flink.
> > > > > > > > >
> > > > > > > > > FLIP-145:
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > >
> > > >
> > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-145%3A+Support+SQL+windowing+table-valued+function
> > > > > > > > >
> > > > > > > > > We want to introduce TUMBLE, HOP, CUMULATE windowing
> > > > TVFs,
> > > > > the
> > > > > > > > > CUMULATE is
> > > > > > > > > a new kind of window.
> > > > > > > > > With the windowing TVFs, we can support richer
> > > > operations on
> > > > > > > > windows,
> > > > > > > > > including window join, window TopN and so on.
> > > > > > > > > This makes things simple: we only need to assign
> > > > windows at
> > > > > the
> > > > > > > > > beginning
> > > > > > > > > of the query, and then apply operations after that
> > > like
> > > > > > > > traditional
> > > > > > > > > batch
> > > > > > > > > SQL.
> > > > > > > > > We hope it can help to reduce the learning curve of
> > > > windows,
> > > > > > > > improve
> > > > > > > > > NRT
> > > > > > > > > for Flink, and attract more batch users.
> > > > > > > > >
> > > > > > > > > A simple code snippet for 10 minutes tumbling window
> > > > > aggregate:
> > > > > > > > >
> > > > > > > > > SELECT window_start, window_end, SUM(price)
> > > > > > > > > FROM TABLE(
> > > > > > > > > TUMBLE(TABLE Bid, DESCRIPTOR(bidtime), INTERVAL
> > > > '10'
> > > > > > > > MINUTES))
> > > > > > > > > GROUP BY window_start, window_end;
> > > > > > > > >
> > > > > > > > > I'm looking forward to your feedback.
> > > > > > > > >
> > > > > > > > > Best,
> > > > > > > > > Jark
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > >
> > > > Best,
> > > > Benchao Li
> > > >
> > >
> > >
> > > --
> > >
> > > Best,
> > > Benchao Li
> > >
> > >
> > > --
> > >
> > > Best,
> > > Benchao Li
> > >
> >
>
> --
>
> Best,
> Benchao Li

Re: [DISCUSS] FLIP-145: Support SQL windowing table-valued function

Reply via email to