Re: "Standardizing" streaming SQL

Julian Hyde Fri, 16 Feb 2018 14:59:16 -0800

Edmon,

If you can make committees aware of this effort, I would be very grateful. It’s 
very clear to me that people are using streaming SQL in the real world, and 
there is a need to standardize. Since we’re focusing on a TCK rather than 
specifications our efforts could complement those of the standards committees.


Julian


> On Feb 9, 2018, at 11:53 PM, Edmon Begoli <[email protected]> wrote:
> 
> Julian,
> 
> I am certainly interested in participating in the discussion, and in the
> initiative -- time permits.
> In my environment, streaming data from large environmental sensor networks
> is a common challenge.
> 
> Riccardo Tomassini and I just this week discussed the research interests
> and work in streams reasoning.
> 
> In terms of standards and influencing those - I am participating in some
> data standards committees, so this participation might be a way for us
> (Calcite and related) to have a voice in terms of contributions or
> influences of the streaming standards.
> 
> You are right that it is mostly big vendors, but I think there is a room
> for us to have a say.
> 
> Thank you for the initiative,
> Edmon
> 
> On Sat, Feb 10, 2018 at 2:44 AM, Julian Hyde <[email protected]> wrote:
> 
>> As you know, I am a big believer that SQL is a great language not just
>> for data at rest, but also data in flight. Calcite has extensions to
>> SQL for streaming queries, and a reference implementation, and I have
>> spoken about streaming SQL at several conferences over the years.
>> Several projects, including Apex, Beam, Flink and Storm, have
>> leveraged Calcite to add streaming SQL support.
>> 
>> But SQL becomes truly valuable when people can assume that its
>> features exist in every product in the market. It makes their
>> applications portable, and it makes it easier for them to apply their
>> skills to new products. So, it is important that streaming SQL becomes
>> standard.
>> 
>> The official SQL standard is written by ANSI/ISO and is dominated by
>> large vendors, and I don't even know how to engage with them. But the
>> interesting work on streaming systems is happening in Apache, so it
>> makes sense to start closer to home. After conversations with folks
>> from a few projects - some of those mentioned above, plus Kafka and
>> Spark - a group of us have concluded that the next step is to develop
>> a standard using the Apache way - by open discussion, making decisions
>> by consensus, by iteratively developing and reviewing code, and by
>> releasing that code periodically.
>> 
>> How can you develop a standard by writing software? The idea is to
>> develop a Test Compatibility Kit (TCK), a suite of tests that embodies
>> the standard. If you are the author of a streaming engine, you can
>> download the TCK and run it against your engine, and the test tells
>> you whether you engine is compliant.
>> 
>> The TCK is developed by committers from the participating engines. If
>> we want to add a new feature to streaming SQL, say stream-to-stream
>> joins, then we would add tests to the TCK, and achieve consensus about
>> the SQL syntax and the expected behavior - which rows will be emitted,
>> at what times, and in what order, for given inputs to a query.
>> 
>> Our plan is to use this list - dev@calcite - for discussions, and use
>> a github project (under Apache license but outside the ASF) for code
>> and issues.
>> 
>> Kenn Knowles has already created the project:
>> https://github.com/Stream-SQL-TCK/Stream-SQL-TCK
>> 
>> Next steps are to design a language for the tests, figure out which
>> features we would like to test in our first release, and start writing
>> the first few tests.
>> 
>> Here are the basic features we might test in the first release:
>> * SELECT ... FROM
>> * WHERE
>> * GROUP BY with Hop and Tumble windowing functions
>> * UNION ALL
>> * Query a table (no streams involved)
>> * JOIN a stream to a stream
>> * JOIN a stream to a static table
>> 
>> Here are more advanced features we might test in later releases:
>> * GROUP BY with Session windowing function
>> * MATCH_RECOGNIZE
>> * Arbitrary stateful processing
>> * Injected UDFs
>> * Windowed aggregate functions (OVER)
>> * JOIN a stream to time-varying table
>> * Mechanism to emit early results (EMIT)
>> 
>> All of the above are subject to discussion & change.
>> 
>> Here is my sketch of a test:
>> 
>> test "filter-equals" {
>>  decls {
>>    CREATE Orders (TIMESTAMP rowtime, INT orderId, VARCHAR product);
>>  }
>>  queries {
>>    Q1: SELECT STREAM * FROM Orders WHERE product = ‘soda’
>>  }
>>  input {
>>    Orders (‘00:01’, 10, ‘beer’)
>>    Orders (‘00:03’, 11, ‘soda’)
>>  }
>>  output {
>>    Q1 (‘00:03’, 11, ‘soda’)
>>  }
>> }
>> 
>> Again, subject to change. Especially, don't worry too much about the
>> syntax; that will certainly change. But it shows what pieces of
>> information are necessary to define a test without making any
>> reference to the engine that will execute that test.
>> 
>> If you're interested in participating in this project, you are most
>> welcome. Please raise your hand by joining the discussion on this
>> list. Also, start logging cases in the github project, and start
>> writing pull requests.
>> 
>> Julian
>>

Re: "Standardizing" streaming SQL

Reply via email to