I’ve updated the Streaming reference guide as Fabian requested: http://calcite.apache.org/docs/stream.html <http://calcite.apache.org/docs/stream.html>
Julian > On Feb 19, 2016, at 3:11 PM, Julian Hyde <[email protected]> wrote: > > I gave a talk about streaming SQL at a Samza meetup. A lot of it is about the > semantics of streaming SQL, and I cover some ground that I don’t cover in the > streams page[1]. > > The news item[2] gets you to both slides and video. > > In other news, I notice[3] that Spark 2.1 will contain “continuous SQL”. If > the examples[4] are accurate, all queries are heavily based on sliding > windows, and they use a syntax for those windows that is very different to > standard SQL. I think we can deal with their use cases, and in my opinion > our proposed syntax is more elegant and closer to the standard. But we should > discuss. I don’t want to diverge from other efforts because of > hubris/ignorance. > > At the Samza meetup some folks mentioned the use case of a stream that > summarizes, emitting periodic totals even if there were no data in a given > period. Can they re-state that use case here, so we can discuss? > > Julian > > [1] http://calcite.apache.org/docs/stream.html > > [2] http://calcite.apache.org/news/2016/02/17/streaming-sql-talk/ > > [3] > http://www.slideshare.net/databricks/the-future-of-realtime-in-spark-58433411 > slide 29 > > [4] > https://issues.apache.org/jira/secure/attachment/12775265/StreamingDataFrameProposal.pdf > > >> On Feb 17, 2016, at 12:09 PM, Milinda Pathirage <[email protected]> >> wrote: >> >> Hi Fabian, >> >> We did some work on stream joins [1]. I tested stream-to-relation joins >> with Samza. But not stream-to-stream joins. But never updated the streaming >> documentation. I'll send a pull request with some documentation on joins. >> >> Thanks >> Milinda >> >> [1] https://issues.apache.org/jira/browse/CALCITE-968 >> >> On Wed, Feb 17, 2016 at 5:09 AM, Fabian Hueske <[email protected]> wrote: >> >>> Hi, >>> >>> I agree, the Streaming page is a very good starting point for this >>> discussion. As suggested by Julian, I created CALCITE-1090 to update the >>> page such that it reflects the current state of the discussion (adding HOP >>> and TUMBLE functions, punctuations). I can also help with that, e.g., by >>> contributing figures, examples, or text, reviewing, or any other way. >>> >>> From my point of view, the semantics of the window types and the other >>> operators in the Streaming document are very good. What is missing are >>> joins (windowed stream-stream joins, stream-table joins, stream-table joins >>> with table updates) as already noted in the todo section. >>> >>> Regarding the handling of late-arriving events, I am not sure if this is a >>> purely QoS issue as the result of a query might depend on the chosen >>> strategy. Also users might want to pick different strategies for different >>> operations, so late-arriver strategies are not necessarily end-to-end but >>> can be operator specific. However, I think these details should be >>> discussed in a separate thread. >>> >>> I'd like to add a few words about the StreamSQL roadmap of the Flink >>> community. >>> We are currently preparing our codebase and will start to work on support >>> for structured queries on streams in the next weeks. Flink will support two >>> query interface, a SQL interface and a LINQ-style Table API [1]. Both will >>> be optimized and translated by Calcite. As a first step, we want to add >>> simple stream transformations such as selection and projection to both >>> interfaces. Next, we will add windowing support (starting with tumbling and >>> hopping windows) to the Table API (as is said before, our plans here are >>> well aligned with Julian's suggestions). Once this is done, we would extend >>> the SQL interface to support windows which is hopefully as simple as using >>> a parser that accepts window syntax. >>> >>> So from our point of view, fixing the semantics of windows and extending >>> the optimizer accordingly is more urgent than agreeing on a syntax >>> (although the Table API syntax could be inspired by Calcite's StreamSQL >>> syntax [2]). I can also help implementing the missing features in Calcite. >>> >>> Having a reference implementation with tests would be awesome and >>> definitely help. >>> >>> Best, Fabian >>> >>> [1] >>> >>> https://ci.apache.org/projects/flink/flink-docs-master/apis/batch/libs/table.html >>> [2] >>> >>> https://docs.google.com/document/d/19kSOAOINKCSWLBCKRq2WvNtmuaA9o3AyCh2ePqr3V5E >>> >>> 2016-02-14 21:23 GMT+01:00 Julian Hyde <[email protected]>: >>> >>>> Fabian, >>>> >>>> Apologies for the late reply. >>>> >>>> I would rather that the specification for streaming SQL was not too >>>> prescriptive for how late events were handled. Approaches 1, 2 and 3 are >>>> all viable, and engines can differentiate themselves by the strength of >>>> their support for this. But for the SQL to be considered valid, I think >>> the >>>> validator just needs to know that it can make progress. >>>> >>>> There is a large area of functionality I’d call “quality of service” >>>> (QoS). This includes latency, reliability, at-least-once, at-most-once or >>>> in-order guarantees, as well as the late-row-handling this thread is >>>> concerned with. What the QoS metrics have in common is that they are >>>> end-to-end. To deliver a high QoS to the consumer, the producer needs to >>>> conform to a high QoS. The QoS is beyond the control of the SQL >>> statement. >>>> (Although you can ask what a SQL statement is able to deliver, given the >>>> upstream QoS guarantees.) QoS is best managed by the whole system, and in >>>> my opinion this is the biggest reason to have a DSMS. >>>> >>>> For this reason, I would be inclined to put QoS constraints on the stream >>>> definition, not on the query. For example, taking latency as the QoS >>> metric >>>> of interest, you could flag the Orders stream as “at most 10 ms latency >>>> between the record’s timestamp and the wall-clock time of the server >>>> receiving the records, and any records arriving after that time are >>> logged >>>> and discarded”, and that QoS constraint applies to both producers and >>>> consumers. >>>> >>>> Given a query Q ‘select stream * from Orders’, it is valid to ask “what >>> is >>>> the expected latency of Q?” or tell the planner “produce an >>> implementation >>>> of Q with a latency of no more than 15 ms, and if you cannot achieve that >>>> latency, fail”. You could even register Q in the system and tell the >>> system >>>> to tighten up the latency of any upstream streams and the standing >>> queries >>>> that populate them. But it’s not valid to say “execute Q with a latency >>> of >>>> 15 ms”: the system may not be able to achieve it. >>>> >>>> In summary: I would allow latency and late-row-handling and other QoS >>>> annotations in the query but it’s not the most natural or powerful place >>> to >>>> put them. >>>> >>>> Julian >>>> >>>> >>>>> On Feb 6, 2016, at 1:28 AM, Fabian Hueske <[email protected]> wrote: >>>>> >>>>> Excellent! I missed the punctuations in the todo list. >>>>> >>>>> What kind of strategies do you have in mind to handle events that >>> arrive >>>>> too late? I see >>>>> 1. dropping of late events >>>>> 2. computing an updated window result for each late arriving >>>>> element (implies that the window state is stored for a certain period >>>>> before it is discarded) >>>>> 3. computing a delta to the previous window result for each late >>> arriving >>>>> element (requires window state as well, not applicable to all >>> aggregation >>>>> types) >>>>> >>>>> It would be nice if strategies to handle late-arrivers could be defined >>>> in >>>>> the query. >>>>> >>>>> I think the plans of the Flink community are quite well aligned with >>> your >>>>> ideas for SQL on Streams. >>>>> Should we start by updating / extending the Stream document on the >>>> Calcite >>>>> website to include the new window definitions (TUMBLE, HOP) and a >>>>> discussion of punctuations/watermarks/time bounds? >>>>> >>>>> Fabian >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> 2016-02-06 2:35 GMT+01:00 Julian Hyde <[email protected]>: >>>>> >>>>>> Let me rephrase: The *majority* of the literature, of which I cited >>>>>> just one example, calls them punctuation, and a couple of recent >>>>>> papers out of Mountain View doesn't change that. >>>>>> >>>>>> There are some fine distinctions between punctuation, heartbeats, >>>>>> watermarks and rowtime bounds, mostly in terms of how they are >>>>>> generated and propagated, that matter little when planning the query. >>>>>> >>>>>> On Fri, Feb 5, 2016 at 5:18 PM, Ted Dunning <[email protected]> >>>> wrote: >>>>>>> On Fri, Feb 5, 2016 at 5:10 PM, Julian Hyde <[email protected]> >>> wrote: >>>>>>> >>>>>>>> Yes, watermarks, absolutely. The "to do" list has "punctuation", >>> which >>>>>>>> is the same thing. (Actually, I prefer to call it "rowtime bound" >>>>>>>> because it is feels more like a dynamic constraint than a piece of >>>>>>>> data, but the literature[1] calls them punctuation.) >>>>>>>> >>>>>>> >>>>>>> Some of the literature calls them punctuation, other literature [1] >>>> calls >>>>>>> them watermarks. >>>>>>> >>>>>>> [1] http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf >>>>>> >>>> >>>> >>> >> >> >> >> -- >> Milinda Pathirage >> >> PhD Student | Research Assistant >> School of Informatics and Computing | Data to Insight Center >> Indiana University >> >> twitter: milindalakmal >> skype: milinda.pathirage >> blog: http://milinda.pathirage.org >
