Edmon, If you can make committees aware of this effort, I would be very grateful. It’s very clear to me that people are using streaming SQL in the real world, and there is a need to standardize. Since we’re focusing on a TCK rather than specifications our efforts could complement those of the standards committees.
Julian > On Feb 9, 2018, at 11:53 PM, Edmon Begoli <[email protected]> wrote: > > Julian, > > I am certainly interested in participating in the discussion, and in the > initiative -- time permits. > In my environment, streaming data from large environmental sensor networks > is a common challenge. > > Riccardo Tomassini and I just this week discussed the research interests > and work in streams reasoning. > > In terms of standards and influencing those - I am participating in some > data standards committees, so this participation might be a way for us > (Calcite and related) to have a voice in terms of contributions or > influences of the streaming standards. > > You are right that it is mostly big vendors, but I think there is a room > for us to have a say. > > Thank you for the initiative, > Edmon > > On Sat, Feb 10, 2018 at 2:44 AM, Julian Hyde <[email protected]> wrote: > >> As you know, I am a big believer that SQL is a great language not just >> for data at rest, but also data in flight. Calcite has extensions to >> SQL for streaming queries, and a reference implementation, and I have >> spoken about streaming SQL at several conferences over the years. >> Several projects, including Apex, Beam, Flink and Storm, have >> leveraged Calcite to add streaming SQL support. >> >> But SQL becomes truly valuable when people can assume that its >> features exist in every product in the market. It makes their >> applications portable, and it makes it easier for them to apply their >> skills to new products. So, it is important that streaming SQL becomes >> standard. >> >> The official SQL standard is written by ANSI/ISO and is dominated by >> large vendors, and I don't even know how to engage with them. But the >> interesting work on streaming systems is happening in Apache, so it >> makes sense to start closer to home. After conversations with folks >> from a few projects - some of those mentioned above, plus Kafka and >> Spark - a group of us have concluded that the next step is to develop >> a standard using the Apache way - by open discussion, making decisions >> by consensus, by iteratively developing and reviewing code, and by >> releasing that code periodically. >> >> How can you develop a standard by writing software? The idea is to >> develop a Test Compatibility Kit (TCK), a suite of tests that embodies >> the standard. If you are the author of a streaming engine, you can >> download the TCK and run it against your engine, and the test tells >> you whether you engine is compliant. >> >> The TCK is developed by committers from the participating engines. If >> we want to add a new feature to streaming SQL, say stream-to-stream >> joins, then we would add tests to the TCK, and achieve consensus about >> the SQL syntax and the expected behavior - which rows will be emitted, >> at what times, and in what order, for given inputs to a query. >> >> Our plan is to use this list - dev@calcite - for discussions, and use >> a github project (under Apache license but outside the ASF) for code >> and issues. >> >> Kenn Knowles has already created the project: >> https://github.com/Stream-SQL-TCK/Stream-SQL-TCK >> >> Next steps are to design a language for the tests, figure out which >> features we would like to test in our first release, and start writing >> the first few tests. >> >> Here are the basic features we might test in the first release: >> * SELECT ... FROM >> * WHERE >> * GROUP BY with Hop and Tumble windowing functions >> * UNION ALL >> * Query a table (no streams involved) >> * JOIN a stream to a stream >> * JOIN a stream to a static table >> >> Here are more advanced features we might test in later releases: >> * GROUP BY with Session windowing function >> * MATCH_RECOGNIZE >> * Arbitrary stateful processing >> * Injected UDFs >> * Windowed aggregate functions (OVER) >> * JOIN a stream to time-varying table >> * Mechanism to emit early results (EMIT) >> >> All of the above are subject to discussion & change. >> >> Here is my sketch of a test: >> >> test "filter-equals" { >> decls { >> CREATE Orders (TIMESTAMP rowtime, INT orderId, VARCHAR product); >> } >> queries { >> Q1: SELECT STREAM * FROM Orders WHERE product = ‘soda’ >> } >> input { >> Orders (‘00:01’, 10, ‘beer’) >> Orders (‘00:03’, 11, ‘soda’) >> } >> output { >> Q1 (‘00:03’, 11, ‘soda’) >> } >> } >> >> Again, subject to change. Especially, don't worry too much about the >> syntax; that will certainly change. But it shows what pieces of >> information are necessary to define a test without making any >> reference to the engine that will execute that test. >> >> If you're interested in participating in this project, you are most >> welcome. Please raise your hand by joining the discussion on this >> list. Also, start logging cases in the github project, and start >> writing pull requests. >> >> Julian >>
