Thanks Villiam for pointing this out! So the standard still solves duplicate names in "TUMBLE(TUMBLE())" by alias, which makes sense. Then we can evaluate whether Calcite supports alias for table function in a correct way in CALCITE-4274 <https://issues.apache.org/jira/browse/CALCITE-4274>.
-Rui On Thu, Sep 24, 2020 at 3:12 AM Viliam Durina <[email protected]> wrote: > I think we're discussing two issues. One is "TUMBLE join TUMBLE" and the > other is "TUMBLE(TUMBLE())". In both cases the `window_start` and > `window_end` columns are duplicated. > > In case of JOIN, I don't see any issue. It's equivalent to any join of two > tables, e.g.: > > SELECT * > FROM t1 > JOIN t2 on t1.id=t2.id > > You'll have two ID columns in the result, no issue here. > > In case of "TUMBLE(TUMBLE())", I'd point to the feature of polymorphic > table functions in the standard: > > CREATE TABLE input_table(window_start, window_end, timestamp); > > SELECT T.window_start, T.window_end, T.timestamp, F.window_start, > F.window_end > FROM TABLE(TUMBLE(TABLE(input_table) AS T, DESCRIPTOR(timestamp), > INTERVAL '5' MINUTES)) AS F > > As you can see, you can refer to the duplicate columns using a different > alias: `T` alias is for the column in the input_table, `F` alias is for the > columns added by the function. > > Viliam > > On Thu, 24 Sep 2020 at 00:21, Kenneth Knowles <[email protected]> wrote: > > > It happens in "raw Beam" somewhat often that there is one windowed > > aggregation followed by another. It is pretty unheard of to "assign" > > windows twice before an aggregation, because Beam cannot do anything > useful > > with this: Beam does not have the ability to have two different > > fields/columns that are both associated with a watermark. So TUMBLE > > followed by TUMBLE is more expressive. You get a number of columns all of > > which can be used for windowed aggregation. > > > > One reason this could happen is that you are authoring a table function > > that operates generically. It does not know that its input has already > been > > TUMBLED. In Beam you would reify the prior window information, do your > own > > TUMBLE. Likely you would do an aggregation and then restore the prior > > window information somehow. A compelling example does not come to mind, > but > > I think considering applying TUMBLE to a generic input helps clarify that > > it could happen in code worth allowing. > > > > Kenn > > > > On Wed, Sep 23, 2020 at 11:47 AM Julian Hyde <[email protected]> wrote: > > > > > I don't think it's a problem with table functions in general. And > > > besides, we can't change the semantics of table functions. A table > > > function must not produce duplicate column names. > > > > > > The problem is with the semantics of these particular table functions > > > - HOP, TUMBLE, SESSION - and what semantics are desirable depends on > > > how people will typically use them. Is it common to follow TUMBLE with > > > TUMBLE? What would a user expect to be the output columns? > > > > > > On Wed, Sep 23, 2020 at 11:21 AM Rui Wang <[email protected]> > wrote: > > > > > > > > >Is it reasonable to apply TUMBLE to TUMBLE? If so, would people > > > > > generally want two sets of window_start, window_end columns? > > > > > > > > I think it is reasonable to apply TUMBLE to TUMBLE or even TUMBLE to > > HOP > > > > join, as long as there is a real requirement there. The window > > > starts/ends > > > > are not duplicates. For example TUMBLE as L JOIN HOP as R, L offers a > > > > window start and a window end, same for R. This is no different from > a > > > > normal JOIN case where both JOIN sides have the same column names > (but > > > they > > > > are not considered duplicates). The SQL rule is still applying: > within > > a > > > > scope there shouldn't be ambiguous column names (e.g. duplicated > column > > > > name). For JOIN duplicate names from JOIN inputs are differentiated > by > > > > table alias. > > > > > > > > > > > > Regarding https://issues.apache.org/jira/browse/CALCITE-4274, this > is > > an > > > > interesting case that is different from the JOIN case, and I also > think > > > > this is a general case (not limited to TUMBLE). > > > > > > > > Think about that for any query that uses table function of the > pattern > > > > in CALCITE-4274. The first table function generates column A and then > > it > > > > becomes the input for the second table function, which also wants to > > > append > > > > a column named "A". How should Calcite handle this case? > > > > > > > > > > > > -Rui > > > > > > > > > > > > > > > > On Wed, Sep 23, 2020 at 9:13 AM Julian Hyde <[email protected]> > wrote: > > > > > > > > > I think we should also discuss > > > > > https://issues.apache.org/jira/browse/CALCITE-4274 here. > > > > > > > > > > We've never discussed what should happen if you apply TUMBLE to > > TUMBLE > > > > > (or TUMBLE to HOP, etc.). What happens now is that you get > duplicate > > > > > columns. > > > > > > > > > > Is it reasonable to apply TUMBLE to TUMBLE? If so, would people > > > > > generally want two sets of window_start, window_end columns? > > > > > > > > > > Julian > > > > > > > > > > On Wed, Sep 23, 2020 at 2:41 AM Danny Chan <[email protected]> > > > wrote: > > > > > > > > > > > > Thanks for the feedback, I agree we should keep the verbose part > > > > > > > > > > > > **L.window_start = R.window_start AND L.window_end > =R.window_end** > > > > > > > > > > > > Which would make the semantic more clear ~ > > > > > > > > > > > > Best, > > > > > > Danny Chan > > > > > > 在 2020年9月23日 +0800 PM3:24,Viliam Durina <[email protected] > >,写道: > > > > > > > You can also use > > > > > > > > > > > > > > SELECT L.f0, R.f2, L.window_start, L.window_end > > > > > > > FROM > > > > > > > Tumble(table T1, descriptor(T1.ts), INTERVAL ‘5’ MINUTE) L > > > > > > > JOIN > > > > > > > Tumble(table T2, descriptor(T2.ts), INTERVAL ‘5’ MINUTE) R > > > > > > > USING (f0, window_start) > > > > > > > > > > > > > > Viliam > > > > > > > > > > > > > > On Wed, 23 Sep 2020 at 08:02, Rui Wang <[email protected]> > > > wrote: > > > > > > > > > > > > > > > Regarding to **L.window_start = R.window_start AND > > L.window_end = > > > > > > > > R.window_end**: > > > > > > > > > > > > > > > > In general, the current table function windowing model is to > > > append > > > > > window > > > > > > > > metadata to table directly, thus window metadata becomes a > part > > > of > > > > > table > > > > > > > > (or call it data). So as a part of table, these two columns > > > should be > > > > > > > > treated as normal columns thus they should be in the join on > > > > > condition. > > > > > > > > > > > > > > > > If you want to make it optional, it makes window start/end > > > columns > > > > > special > > > > > > > > and has a semantic binding with special table functions > > (TUMBLE, > > > HOP, > > > > > > > > SESSION), which then becomes really not a SQL thing. For > > > example, we > > > > > can > > > > > > > > allow users to define their own windowing table function. In > > that > > > > > case, how > > > > > > > > will you utilize window start/end produced by a customized > > > windowing > > > > > table > > > > > > > > function? What if users produce wired windows that have > > > overlapped > > > > > window > > > > > > > > starts or window ends? > > > > > > > > > > > > > > > > Keeping windows start/end as a part of the table, treating > them > > > no > > > > > > > > different from other columns, could give a consistent > behavior > > > for > > > > > either > > > > > > > > built-in table function or user-defined table function. > > > > > > > > > > > > > > > > If you think it is too verbose, there are two options to > > > optimize: > > > > > > > > > > > > > > > > 1. for TUMBLE/HOP/SESSION, to identify a unique window, you > > will > > > > > only need > > > > > > > > either window start or end, so you can simplify it, for > > example, > > > to > > > > > > > > L.window_start = R.window_start only. > > > > > > > > 2. (not recommended), you can cut off **L.window_start = > > > > > R.window_start AND > > > > > > > > L.window_end = R.window_end**, but add window metadata > > > comparison to > > > > > join > > > > > > > > implicitly by execution engine. E.g. you can make up the join > > > > > condition in > > > > > > > > your JoinRel if two inputs are TUMBLE. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -Rui > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Sep 22, 2020 at 10:27 PM Danny Chan < > > > [email protected]> > > > > > wrote: > > > > > > > > > > > > > > > > > Yes, the red part is **L.window_start = R.window_start AND > > > > > L.window_end = > > > > > > > > > R.window_end** > > > > > > > > > > > > > > > > > > > Is this a limitation for "triggered by the watermark of > the > > > > > stream”? > > > > > > > > > > > > > > > > > > No, because in most of the cases, there is no need to > output > > > the > > > > > > > > > intermediate/partial join records then send retractions. > > > > > > > > > > > > > > > > > > > > > > > > > > > So, how do you think about the condition syntax > > > **L.window_start = > > > > > > > > > R.window_start AND L.window_end = R.window_end** ? > > > > > > > > > > > > > > > > > > Best, > > > > > > > > > Danny Chan > > > > > > > > > 在 2020年9月23日 +0800 PM12:47,[email protected],写道: > > > > > > > > > > > > > > > > > > > > L.window_start = R.window_start AND L.window_end = > > > R.window_end > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > Viliam Durina > > > > > > > Jet Developer > > > > > > > hazelcast® > > > > > > > > > > > > > > <https://www.hazelcast.com> 2 W 5th Ave, Ste 300 | San Mateo, > CA > > > > > 94402 | > > > > > > > USA > > > > > > > +1 (650) 521-5453 <(650)%20521-5453> <(650)%20521-5453> > <(650)%20521-5453> | > > > hazelcast.com < > > > > > https://www.hazelcast.com> > > > > > > > > > > > > > > -- > > > > > > > This message contains confidential information and is intended > > only > > > > > for the > > > > > > > individuals named. If you are not the named addressee you > should > > > not > > > > > > > disseminate, distribute or copy this e-mail. Please notify the > > > sender > > > > > > > immediately by e-mail if you have received this e-mail by > mistake > > > and > > > > > > > delete this e-mail from your system. E-mail transmission cannot > > be > > > > > > > guaranteed to be secure or error-free as information could be > > > > > intercepted, > > > > > > > corrupted, lost, destroyed, arrive late or incomplete, or > contain > > > > > viruses. > > > > > > > The sender therefore does not accept liability for any errors > or > > > > > omissions > > > > > > > in the contents of this message, which arise as a result of > > e-mail > > > > > > > transmission. If verification is required, please request a > > > hard-copy > > > > > > > version. -Hazelcast > > > > > > > > > > > > > -- > Viliam Durina > Jet Developer > hazelcast® > > <https://www.hazelcast.com> 2 W 5th Ave, Ste 300 | San Mateo, CA 94402 | > USA > +1 (650) 521-5453 <(650)%20521-5453> | hazelcast.com < > https://www.hazelcast.com> > > -- > This message contains confidential information and is intended only for > the > individuals named. If you are not the named addressee you should not > disseminate, distribute or copy this e-mail. Please notify the sender > immediately by e-mail if you have received this e-mail by mistake and > delete this e-mail from your system. E-mail transmission cannot be > guaranteed to be secure or error-free as information could be intercepted, > corrupted, lost, destroyed, arrive late or incomplete, or contain viruses. > The sender therefore does not accept liability for any errors or omissions > in the contents of this message, which arise as a result of e-mail > transmission. If verification is required, please request a hard-copy > version. -Hazelcast >
