Re: "Standardizing" streaming SQL

Julian Hyde Fri, 09 Mar 2018 15:43:16 -0800

I’m not personally a fan of wikis. (If your goal is a web site that can be 
edited by users, then markdown + PRs works pretty well in the Apache world. And 
you’re less likely to end up with an unstructured mess.)


But I do strongly believe in sketching out a specification (and if not obvious, 
a design) before starting work. It allows for feedback before the person doing 
the work is too invested.

The Stream-SQL-TCK project is not under Apache so it would be GitHub issues 
rather than Apache JIRA cases, but it would work the same.

> On Feb 17, 2018, at 10:18 AM, Edmon Begoli <[email protected]> wrote:
> 
> I made a comment, but it might be better to state it here.
> 
> Since you are starting from scratch, could you maybe add to wiki your
> requirements/design thoughts, so that we could understand the intent and
> perhaps help?
> 
> I am new to Apache contributions, so I might not know that PRs are the
> preferred way, but I personally get a better understanding from a little
> bit of a big picture write up, even if it is bunch of bullet points, and
> references.
> 
> On Friday, February 16, 2018, Julian Hyde <[email protected]> wrote:
> 
>> I have kicked off development with the first PR to the Stream-SQL-TCK
>> repository, and I have logged an issue for what I intend to work on next.
>> 
>> Please review the PR[1] and comment on the issues [2], and “watch” the
>> GitHub repo so that you are notified of new issues and PRs.
>> 
>> If you disagree with the approach, feel free to say so. The PR is
>> something of a straw man. I’d rather have negative reaction than no
>> reaction.
>> 
>> Julian
>> 
>> [1] https://github.com/Stream-SQL-TCK/Stream-SQL-TCK/pulls <
>> https://github.com/Stream-SQL-TCK/Stream-SQL-TCK/pulls>
>> 
>> [2] https://github.com/Stream-SQL-TCK/Stream-SQL-TCK/issues <
>> https://github.com/Stream-SQL-TCK/Stream-SQL-TCK/issues>
>> 
>> 
>>> On Feb 9, 2018, at 11:44 PM, Julian Hyde <[email protected]> wrote:
>>> 
>>> As you know, I am a big believer that SQL is a great language not just
>>> for data at rest, but also data in flight. Calcite has extensions to
>>> SQL for streaming queries, and a reference implementation, and I have
>>> spoken about streaming SQL at several conferences over the years.
>>> Several projects, including Apex, Beam, Flink and Storm, have
>>> leveraged Calcite to add streaming SQL support.
>>> 
>>> But SQL becomes truly valuable when people can assume that its
>>> features exist in every product in the market. It makes their
>>> applications portable, and it makes it easier for them to apply their
>>> skills to new products. So, it is important that streaming SQL becomes
>>> standard.
>>> 
>>> The official SQL standard is written by ANSI/ISO and is dominated by
>>> large vendors, and I don't even know how to engage with them. But the
>>> interesting work on streaming systems is happening in Apache, so it
>>> makes sense to start closer to home. After conversations with folks
>>> from a few projects - some of those mentioned above, plus Kafka and
>>> Spark - a group of us have concluded that the next step is to develop
>>> a standard using the Apache way - by open discussion, making decisions
>>> by consensus, by iteratively developing and reviewing code, and by
>>> releasing that code periodically.
>>> 
>>> How can you develop a standard by writing software? The idea is to
>>> develop a Test Compatibility Kit (TCK), a suite of tests that embodies
>>> the standard. If you are the author of a streaming engine, you can
>>> download the TCK and run it against your engine, and the test tells
>>> you whether you engine is compliant.
>>> 
>>> The TCK is developed by committers from the participating engines. If
>>> we want to add a new feature to streaming SQL, say stream-to-stream
>>> joins, then we would add tests to the TCK, and achieve consensus about
>>> the SQL syntax and the expected behavior - which rows will be emitted,
>>> at what times, and in what order, for given inputs to a query.
>>> 
>>> Our plan is to use this list - dev@calcite - for discussions, and use
>>> a github project (under Apache license but outside the ASF) for code
>>> and issues.
>>> 
>>> Kenn Knowles has already created the project:
>>> https://github.com/Stream-SQL-TCK/Stream-SQL-TCK
>>> 
>>> Next steps are to design a language for the tests, figure out which
>>> features we would like to test in our first release, and start writing
>>> the first few tests.
>>> 
>>> Here are the basic features we might test in the first release:
>>> * SELECT ... FROM
>>> * WHERE
>>> * GROUP BY with Hop and Tumble windowing functions
>>> * UNION ALL
>>> * Query a table (no streams involved)
>>> * JOIN a stream to a stream
>>> * JOIN a stream to a static table
>>> 
>>> Here are more advanced features we might test in later releases:
>>> * GROUP BY with Session windowing function
>>> * MATCH_RECOGNIZE
>>> * Arbitrary stateful processing
>>> * Injected UDFs
>>> * Windowed aggregate functions (OVER)
>>> * JOIN a stream to time-varying table
>>> * Mechanism to emit early results (EMIT)
>>> 
>>> All of the above are subject to discussion & change.
>>> 
>>> Here is my sketch of a test:
>>> 
>>> test "filter-equals" {
>>> decls {
>>>   CREATE Orders (TIMESTAMP rowtime, INT orderId, VARCHAR product);
>>> }
>>> queries {
>>>   Q1: SELECT STREAM * FROM Orders WHERE product = ‘soda’
>>> }
>>> input {
>>>   Orders (‘00:01’, 10, ‘beer’)
>>>   Orders (‘00:03’, 11, ‘soda’)
>>> }
>>> output {
>>>   Q1 (‘00:03’, 11, ‘soda’)
>>> }
>>> }
>>> 
>>> Again, subject to change. Especially, don't worry too much about the
>>> syntax; that will certainly change. But it shows what pieces of
>>> information are necessary to define a test without making any
>>> reference to the engine that will execute that test.
>>> 
>>> If you're interested in participating in this project, you are most
>>> welcome. Please raise your hand by joining the discussion on this
>>> list. Also, start logging cases in the github project, and start
>>> writing pull requests.
>>> 
>>> Julian
>> 
>>

Re: "Standardizing" streaming SQL

Reply via email to