Re: [DISCUSS] Some quesitons about the new introduced window table functions (TUMBLE, HOP and SESSION)

Viliam Durina Mon, 08 Jun 2020 00:05:26 -0700

> 2. rename the hop_size to emit_frequency. Because that's what this
argument tries to say: e.g. emit in every x time_unit.


It's not about how often you emit. By "emit" I mean the moment when the
rows are sent out. For example, when working with event time and the input
stalls, you might emit much later. Or you can use it for historical data in
which case the actual interval will be much shorter. It's the time interval
by which the window "hops". Two subsequent windows will be this much apart.

Regarding the DESCRIPTOR, the PTF is supposed to be implemented by up to 4
member functions: describe, start, fulfill, finish. The `describe` function
is supposed to be called during validation. It can validate the arguments.
In case of TUMBLE, it can check whether the column described by the
descriptor is present in the input table and if it is of a correct type. It
can throw errors which will become validation errors. It also determines
the actual output table type.

I think it's worthwhile to read the PTF spec before implementing this, it's
freely available here:
https://standards.iso.org/ittf/PubliclyAvailableStandards/c069776_ISO_IEC_TR_19075-7_2017.zip

Viliam

On Sun, 7 Jun 2020 at 06:23, Rui Wang <amaliu...@apache.org> wrote:

> Thanks Danny for your summary!
>
> For question one, I prefer the ordering of parameter in option two.
> However, for argument names, I think it might be better to
> 1. name the window size explicitly by "window_size".
> 2. rename the hop_size to emit_frequency. Because that's what this
> argument tries to say: e.g. emit in every x time_unit.
> 3. don't use "table" as the first argument name, because "table" is a
> keyword. So we can replace it by "data"
>
> so it might be
> TUMBLE(data, DESCRIPTOR(time_column), window_size[, offset])
> HOP(data, DESCRIPTOR(time_column), window_size, emit_frequency[, offset])
>
> Meanwhile, one thing worths mentioning is, like Viliam always discussed in
> CALCITE-4000[1], we can also finalize the data types for each signature:
> TUMBLE(
> data TABLE,
> time_column DESCRIPTOR,
> window_size INTERVAL DAY TO SECOND,
> offset INTERVAL DAY TO SECOND DEFAULT INTERVAL '0' SECONDS)
>
> HOP(
> data TABLE,
> time_column DESCRIPTOR,
> window_size INTERVAL DAY TO SECOND,
> emit_frequency INTERVAL DAY TO SECOND,
> offset INTERVAL DAY TO SECOND DEFAULT INTERVAL '0' SECONDS)
>
> So those intervals have to be INTERVAL DAY TO SECOND, not INTERVAL YEAR TO
> MONTH, because year to month varies in seconds.
>
>
> Regarding question two, I tried to dig into codebase and I think it could
> be a RexCorrelVariable to replace that table_name, so it becomes
> table_function(RexCorrelVariable, DESCRIPTOR($1), internal:INTERVAL MINUTE)
>
> I know RexCorrelVariable might be not designed for this purpose, but I
> find RexCorrelVariable is really useful because:
> 1. It can provide an id, to indicate which input of TableFunctionScan that
> this table_function is applied on. Note that TableFunctionScan might not
> have one input.
> 2. It can provide RelDataType, to save the referenced input's schema. The
> input schema is useful during operator argument type validation (for
> DESCRIPTOR) and operator return type inference.
>
>
> [1]:
>
> https://jira.apache.org/jira/browse/CALCITE-4000?focusedCommentId=17107665&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17107665
>
>
> -Rui
>
> On Thu, Jun 4, 2020 at 6:02 AM Danny Chan <yuzhao....@gmail.com> wrote:
>
> > Hi, all, in the last 2 releases, Rui Wang has contributed the window
> table
> > functions to replace the legacy group window syntax, the idea comes from
> > the SQL-2016 polymorphic table functions. But the current implementation
> > also got some feedbacks(or confusion), this thread tries to have a
> > discussion on theses questions that are undecided.
> >
> > One is about the window function argument names:
> >
> > === option1
> > TUMBLE(table, DESCRIPTOR(datetime), interval, offset)
> > HOP(table, DESCRIPTOR(time_column), slide, size)
> >
> > === option2
> > TUMBLE(table, DESCRIPTOR(time_column), size, offset)
> > HOP(table, DESCRIPTOR(time_column), size, hop_size, offset)
> >
> > I would prefer for option2 for the reasons pointed out by Viliam in
> > CALCITE-3737.
> >
> >
> > Another is about how to translate the query
> >
> > For query, select * from TUMBLE(table, DESCRIPTOR(time_column),
> interval);
> >
> > Currently our outputs plan is:
> >
> > LogicalProject(ORDERID=[$0], ROWTIME=[$1], window_start=[$2],
> > window_end=[$3])
> > LogicalTableFunctionScan(invocation=[TUMBLE(DESCRIPTOR($1),
> > interal:INTERVAL MINUTE)], rowType=[RecordType(INTEGER ORDERID,
> > TIMESTAMP(0) time_column, TIMESTAMP(0) window_start, TIMESTAMP(0)
> > window_end)])
> > LogicalProject(ORDERID=[$0], ROWTIME=[$1])
> > LogicalTableScan(table=[[CATALOG, SALES, SHIPMENTS]])
> >
> > But this seems not right, because the "table table_name" argument is not
> a
> > relational expression, in CALCITE-3955, Julian has pointed out that we
> > should translate the "table" argument as a CURSOR (or TABLE by Viliam),
> but
> > another question is how to translate the referenced name "time_column",
> to
> > a correlate variable ? which is also confusing because there is no
> > correlation in the plan actually.
> >
> > Any suggestions are appreciated !
> >
> >
> > [1] https://issues.apache.org/jira/browse/CALCITE-3955
> > [2] https://issues.apache.org/jira/browse/CALCITE-3737
> >
> > Best,
> > Danny Chan
> >
>


-- 
Viliam Durina
Jet Developer
      hazelcast®

  <https://www.hazelcast.com> 2 W 5th Ave, Ste 300 | San Mateo, CA 94402 |
USA
+1 (650) 521-5453 | hazelcast.com <https://www.hazelcast.com>

-- 
This message contains confidential information and is intended only for the 
individuals named. If you are not the named addressee you should not 
disseminate, distribute or copy this e-mail. Please notify the sender 
immediately by e-mail if you have received this e-mail by mistake and 
delete this e-mail from your system. E-mail transmission cannot be 
guaranteed to be secure or error-free as information could be intercepted, 
corrupted, lost, destroyed, arrive late or incomplete, or contain viruses. 
The sender therefore does not accept liability for any errors or omissions 
in the contents of this message, which arise as a result of e-mail 
transmission. If verification is required, please request a hard-copy 
version. -Hazelcast

Re: [DISCUSS] Some quesitons about the new introduced window table functions (TUMBLE, HOP and SESSION)

Reply via email to