"Standardizing" streaming SQL

Julian Hyde Fri, 09 Feb 2018 23:45:13 -0800

As you know, I am a big believer that SQL is a great language not just
for data at rest, but also data in flight. Calcite has extensions to
SQL for streaming queries, and a reference implementation, and I have
spoken about streaming SQL at several conferences over the years.
Several projects, including Apex, Beam, Flink and Storm, have
leveraged Calcite to add streaming SQL support.


But SQL becomes truly valuable when people can assume that its
features exist in every product in the market. It makes their
applications portable, and it makes it easier for them to apply their
skills to new products. So, it is important that streaming SQL becomes
standard.

The official SQL standard is written by ANSI/ISO and is dominated by
large vendors, and I don't even know how to engage with them. But the
interesting work on streaming systems is happening in Apache, so it
makes sense to start closer to home. After conversations with folks
from a few projects - some of those mentioned above, plus Kafka and
Spark - a group of us have concluded that the next step is to develop
a standard using the Apache way - by open discussion, making decisions
by consensus, by iteratively developing and reviewing code, and by
releasing that code periodically.

How can you develop a standard by writing software? The idea is to
develop a Test Compatibility Kit (TCK), a suite of tests that embodies
the standard. If you are the author of a streaming engine, you can
download the TCK and run it against your engine, and the test tells
you whether you engine is compliant.

The TCK is developed by committers from the participating engines. If
we want to add a new feature to streaming SQL, say stream-to-stream
joins, then we would add tests to the TCK, and achieve consensus about
the SQL syntax and the expected behavior - which rows will be emitted,
at what times, and in what order, for given inputs to a query.

Our plan is to use this list - dev@calcite - for discussions, and use
a github project (under Apache license but outside the ASF) for code
and issues.

Kenn Knowles has already created the project:
https://github.com/Stream-SQL-TCK/Stream-SQL-TCK

Next steps are to design a language for the tests, figure out which
features we would like to test in our first release, and start writing
the first few tests.

Here are the basic features we might test in the first release:
* SELECT ... FROM
* WHERE
* GROUP BY with Hop and Tumble windowing functions
* UNION ALL
* Query a table (no streams involved)
* JOIN a stream to a stream
* JOIN a stream to a static table

Here are more advanced features we might test in later releases:
* GROUP BY with Session windowing function
* MATCH_RECOGNIZE
* Arbitrary stateful processing
* Injected UDFs
* Windowed aggregate functions (OVER)
* JOIN a stream to time-varying table
* Mechanism to emit early results (EMIT)

All of the above are subject to discussion & change.

Here is my sketch of a test:

test "filter-equals" {
  decls {
    CREATE Orders (TIMESTAMP rowtime, INT orderId, VARCHAR product);
  }
  queries {
    Q1: SELECT STREAM * FROM Orders WHERE product = ‘soda’
  }
  input {
    Orders (‘00:01’, 10, ‘beer’)
    Orders (‘00:03’, 11, ‘soda’)
  }
  output {
    Q1 (‘00:03’, 11, ‘soda’)
  }
}

Again, subject to change. Especially, don't worry too much about the
syntax; that will certainly change. But it shows what pieces of
information are necessary to define a test without making any
reference to the engine that will execute that test.

If you're interested in participating in this project, you are most
welcome. Please raise your hand by joining the discussion on this
list. Also, start logging cases in the github project, and start
writing pull requests.

Julian

"Standardizing" streaming SQL

Reply via email to