Hi everyone,
we are very happy to support TUMBLE/HOP/SESSION in our upcoming Flink
1.3 release. However, there are some problems regarding nested window
queries that we would like to discuss with the Calcite community.
Take the following query:
SELECT
rowtime, SUM(x)
FROM (
SELECT
TUMBLE_END(rowtime, INTERVAL '2' MINUTE) AS rowtime,
MIN(x) AS x
FROM MyTable
GROUP BY TUMBLE(rowtime, INTERVAL '2' MINUTE)
)
GROUP BY TUMBLE(rowtime, INTERVAL '1' HOUR)
Initially, we thought that we can use the xxx_END() group auxiliary
functions to define the rowtime for the upper query. However, according
to http://calcite.apache.org/docs/stream.html, TUMBLE_END should return
the timestamp of the exclusive window end, i.e., for a window of 1 hour
that contains all elements from 12:00:00.000 until 12:59:59.999
(inclusive), TUMBLE_END would return 13:00:00.000. The problem is that
Flink uses the inclusive window end as new timestamp. The reason for
that is that if you do preaggregation with a window, say 5 minute
windows which later will be aggregated into 1 hour windows, the last 5
minute window (from 12:55:00.000 until 12:59:59.999 incl) would have a
timestamp of 13:00:00.000 and fall into the next window starting at
13:00:00.000.
The question is how Calcite is planning to support nested windows. Right
now we see the following options:
- TUMBLE_END returns the inclusive window end
- we introduce an additional group auxiliary function for the inclusive
window end like: SELECT TUMBLE_TIME(rowtime, INTERVAL '2' MINUTE) AS
rowtime ...
- we allow references to the window in the select: SELECT
TUMBLE(rowtime, INTERVAL '1' HOUR) AS rowtime ...
What do you think?
Regards,
Timo