Re: About Stream SQL

Julian Hyde Fri, 05 Feb 2016 02:29:39 -0800

Stephan,

I agree that we have a long way to go to get to a standard, but I
think that means that we should start as soon as possible. By which I
mean, rather than going ahead and creating Flink-specific extensions,
let's have discussions about SQL extensions in a broad forum, pulling
in members of other projects. Streaming/CEP is not a new field, so the
use cases are well known.


It's true that Calcite's streaming SQL doesn't go far beyond standard
SQL. I don't want to diverge too far; and besides, one can accomplish
a lot with each feature added to the language. What is needed are a
very few well chosen deep features; then we can add liberal syntactic
sugar on top of these to to make common use cases more concise.

For the record, HOP and TUMBLE described in [1] are not OLAP sliding
windows; they go in the GROUP BY clause. But unlike GROUP BY, they
allow each row to contribute to more than one sub-total. This is novel
in SQL (only GROUPING SETS allows this, and in a limited form) and
could be the basis for user-defined windows.

Also, our sliding windows can be defined by row count as well as by
time. For example, suppose you want to calculate the length of a
sport's team's streak of consecutive wins or losses. You can partition
by N, where N is the number of state changes of the win-loss variable,
and so each switch from win-to-lose or lose-to-win starts a new
window.

As you define your extensions to SQL, I strongly suggest that you make
explicit, as columns, any data values that drive behavior. These might
include event times, arrival times, processing times, flush
directives, and any required orderings. If information is not
explicit, we can not reason about the query as algebra, and the
planner cannot more radical plans, such as sorting (within a window)
or changing how the rows are partitioned across parallel processors. A
litmus test is whether a database can apply the same SQL to the data
archived from the stream and to achieve the same results.

I saw that Fabian blogged recently[2] about stream windows in Flink.
Could we start the process by trying to convert some use cases
(expressed in English, and with sample input and output data) into
SQL? Then we can iterate to make the SQL concise, understandable, and
well-defined.

Julian

[1] 
http://mail-archives.apache.org/mod_mbox/calcite-dev/201506.mbox/%3CCAPSgeETbowxM2TRX0RFxQ_tEAPk2uM=he0arywinbtovgwb...@mail.gmail.com%3E

[2] https://flink.apache.org/news/2015/12/04/Introducing-windows.html

On Thu, Feb 4, 2016 at 1:35 AM, Stephan Ewen <[email protected]> wrote:
> Hi!
>
> True, the Flink community is looking into stream SQL, and is currently
> building on top of Calcite. This is all going well, but we probably need
> some custom syntax around windowing.
>
> For Stream SQL Windowing, what I have seen so far in Calcite (correct me if
> I am wrong there), is pretty much a variant of the OLAP sliding window
> aggregates.
>
>   - Windows are in those basically calculated by rounding down/up
> timestamps, thus bucketizing the events. That works for many cases, but is
> quite tricky syntax.
>
>   - Flink supports various notions of time for windowing (processing time,
> ingestion time, event time), as well as triggers. To be able to extend the
> window specification with such additional parameters is pretty crucial and
> would probably go well with a dedicated window clause.
>
>   - Flink also has unaligned windows (sessions, timeouts, ...) which are
> very hard to map to grouping and window aggregations across ordered groups.
>
>
> Converging to a core standard around stream SQL is very desirable, I
> completely agree.
> For the basic constructs, I think this is quite feasible and Calcite has
> some good suggestions there.
>
> In the advanced constructs, the systems differ quite heavily currently, so
> converging there may be harder there. Also, we are just learning what
> semantics people need concerning windowing/event time/etc. May almost be a
> tad bit too early to try and define a standard there...
>
>
> Greetings,
> Stephan
>
>
> On Thu, Feb 4, 2016 at 9:35 AM, Julian Hyde <[email protected]> wrote:
>
>> I totally agree with you. (Sorry for the delayed response; this week has
>> been very busy.)
>>
>> There is a tendency of vendors (and projects) to think that their
>> technology is unique, and superior to everyone else’s, and want to showcase
>> it in their dialect of SQL. That is natural, and it’s OK, since it makes
>> them strive to make their technology better.
>>
>> However, they have to remember that the end users don’t want something
>> unique, they want something that solves their problem. They would like
>> something that is standards compliant so that it is easy to learn, easy to
>> hire developers for, and — if the worst comes to the worst — easy to
>> migrate to a compatible competing technology.
>>
>> I know the developers at Storm and Flink (and Samza too) and they
>> understand the importance of collaborating on a standard.
>>
>> I have been trying to play a dual role: supplying the parser and planner
>> for streaming SQL, and also to facilitate the creation of a standard
>> language and semantics of streaming SQL. For the latter, see Streaming page
>> on Calcite’s web site[1]. On that page, I intend to illustrate all of the
>> main patterns of streaming queries, give them names (e.g. “Tumbling
>> windows”), and show how those translate into streaming SQL.
>>
>> Also, it would be useful to create a reference implementation of streaming
>> SQL in Calcite so that you can validate and run queries. The performance,
>> scalability and reliability will not be the same as if you ran Storm, Flink
>> or Samza, but at least you can see what the semantics should be.
>>
>> I believe that most, if not all, of the examples that the projects are
>> coming up with can be translated into SQL. It will be challenging, because
>> we want to preserve the semantics of SQL, allow streaming SQL to
>> interoperate with traditional relations, and also retain the general look
>> and feel of SQL. (For example, I fought quite hard[2] recently for the
>> principle that GROUP BY defines a partition (in the set-theory sense)[3]
>> and therefore could not be used to represent a tumbling window, until I
>> remembered that GROUPING SETS already allows each input row to appear in
>> more than one output sub-total.)
>>
>> What can you, the users, do? Get involved in the discussion about what you
>> want in the language. Encourage the projects to bring their proposed SQL
>> features into this forum for discussion, and add to the list of patterns
>> and examples on the Streaming page. As in any standards process, the users
>> help to keep the vendors focused.
>>
>> I’ll be talking about streaming SQL, planning, and standardization at the
>> Samza meetup in 2 weeks[4], so if any of you are in the Bay Area, please
>> stop by.
>>
>> Julian
>>
>> [1] http://calcite.apache.org/docs/stream.html
>>
>> [2]
>> http://mail-archives.apache.org/mod_mbox/calcite-dev/201506.mbox/%3CCAPSgeETbowxM2TRX0RFxQ_tEAPk2uM=he0arywinbtovgwb...@mail.gmail.com%3E
>>
>> [3] https://en.wikipedia.org/wiki/Partition_of_a_set
>>
>> [4] http://www.meetup.com/Bay-Area-Samza-Meetup/events/228430492/
>>
>> > On Jan 29, 2016, at 10:29 PM, Wanglan (Lan) <[email protected]>
>> wrote:
>> >
>> > Hi to all,
>> >
>> > I am from Huawei and am focusing on data stream processing.
>> > Recently I noticed that both in Storm community and Flink community
>> there are endeavors to user Calcite as SQL parser to enable Storm/Flink to
>> support SQL. They both want to supplemented or clarify Streaming SQL of
>> calcite, especially the definition of windows.
>> > I am considering if both communities working on designing Stream SQL
>> syntax separately, there would come out two different syntaxes which
>> represent the same use case.
>> > Therefore, I am wondering if it is possible to unify such work, i.e.
>> design and compliment the calcite Streaming SQL to enrich window definition
>> so that both storm and flink can reuse the calcite(Streaming SQL) as their
>> SQL parser for streaming cases with little change.
>> > What do you think about this idea?
>> >
>>
>>

Re: About Stream SQL

Reply via email to