Hi Fabian, We did some work on stream joins [1]. I tested stream-to-relation joins with Samza. But not stream-to-stream joins. But never updated the streaming documentation. I'll send a pull request with some documentation on joins.
Thanks Milinda [1] https://issues.apache.org/jira/browse/CALCITE-968 On Wed, Feb 17, 2016 at 5:09 AM, Fabian Hueske <[email protected]> wrote: > Hi, > > I agree, the Streaming page is a very good starting point for this > discussion. As suggested by Julian, I created CALCITE-1090 to update the > page such that it reflects the current state of the discussion (adding HOP > and TUMBLE functions, punctuations). I can also help with that, e.g., by > contributing figures, examples, or text, reviewing, or any other way. > > From my point of view, the semantics of the window types and the other > operators in the Streaming document are very good. What is missing are > joins (windowed stream-stream joins, stream-table joins, stream-table joins > with table updates) as already noted in the todo section. > > Regarding the handling of late-arriving events, I am not sure if this is a > purely QoS issue as the result of a query might depend on the chosen > strategy. Also users might want to pick different strategies for different > operations, so late-arriver strategies are not necessarily end-to-end but > can be operator specific. However, I think these details should be > discussed in a separate thread. > > I'd like to add a few words about the StreamSQL roadmap of the Flink > community. > We are currently preparing our codebase and will start to work on support > for structured queries on streams in the next weeks. Flink will support two > query interface, a SQL interface and a LINQ-style Table API [1]. Both will > be optimized and translated by Calcite. As a first step, we want to add > simple stream transformations such as selection and projection to both > interfaces. Next, we will add windowing support (starting with tumbling and > hopping windows) to the Table API (as is said before, our plans here are > well aligned with Julian's suggestions). Once this is done, we would extend > the SQL interface to support windows which is hopefully as simple as using > a parser that accepts window syntax. > > So from our point of view, fixing the semantics of windows and extending > the optimizer accordingly is more urgent than agreeing on a syntax > (although the Table API syntax could be inspired by Calcite's StreamSQL > syntax [2]). I can also help implementing the missing features in Calcite. > > Having a reference implementation with tests would be awesome and > definitely help. > > Best, Fabian > > [1] > > https://ci.apache.org/projects/flink/flink-docs-master/apis/batch/libs/table.html > [2] > > https://docs.google.com/document/d/19kSOAOINKCSWLBCKRq2WvNtmuaA9o3AyCh2ePqr3V5E > > 2016-02-14 21:23 GMT+01:00 Julian Hyde <[email protected]>: > > > Fabian, > > > > Apologies for the late reply. > > > > I would rather that the specification for streaming SQL was not too > > prescriptive for how late events were handled. Approaches 1, 2 and 3 are > > all viable, and engines can differentiate themselves by the strength of > > their support for this. But for the SQL to be considered valid, I think > the > > validator just needs to know that it can make progress. > > > > There is a large area of functionality I’d call “quality of service” > > (QoS). This includes latency, reliability, at-least-once, at-most-once or > > in-order guarantees, as well as the late-row-handling this thread is > > concerned with. What the QoS metrics have in common is that they are > > end-to-end. To deliver a high QoS to the consumer, the producer needs to > > conform to a high QoS. The QoS is beyond the control of the SQL > statement. > > (Although you can ask what a SQL statement is able to deliver, given the > > upstream QoS guarantees.) QoS is best managed by the whole system, and in > > my opinion this is the biggest reason to have a DSMS. > > > > For this reason, I would be inclined to put QoS constraints on the stream > > definition, not on the query. For example, taking latency as the QoS > metric > > of interest, you could flag the Orders stream as “at most 10 ms latency > > between the record’s timestamp and the wall-clock time of the server > > receiving the records, and any records arriving after that time are > logged > > and discarded”, and that QoS constraint applies to both producers and > > consumers. > > > > Given a query Q ‘select stream * from Orders’, it is valid to ask “what > is > > the expected latency of Q?” or tell the planner “produce an > implementation > > of Q with a latency of no more than 15 ms, and if you cannot achieve that > > latency, fail”. You could even register Q in the system and tell the > system > > to tighten up the latency of any upstream streams and the standing > queries > > that populate them. But it’s not valid to say “execute Q with a latency > of > > 15 ms”: the system may not be able to achieve it. > > > > In summary: I would allow latency and late-row-handling and other QoS > > annotations in the query but it’s not the most natural or powerful place > to > > put them. > > > > Julian > > > > > > > On Feb 6, 2016, at 1:28 AM, Fabian Hueske <[email protected]> wrote: > > > > > > Excellent! I missed the punctuations in the todo list. > > > > > > What kind of strategies do you have in mind to handle events that > arrive > > > too late? I see > > > 1. dropping of late events > > > 2. computing an updated window result for each late arriving > > > element (implies that the window state is stored for a certain period > > > before it is discarded) > > > 3. computing a delta to the previous window result for each late > arriving > > > element (requires window state as well, not applicable to all > aggregation > > > types) > > > > > > It would be nice if strategies to handle late-arrivers could be defined > > in > > > the query. > > > > > > I think the plans of the Flink community are quite well aligned with > your > > > ideas for SQL on Streams. > > > Should we start by updating / extending the Stream document on the > > Calcite > > > website to include the new window definitions (TUMBLE, HOP) and a > > > discussion of punctuations/watermarks/time bounds? > > > > > > Fabian > > > > > > > > > > > > > > > > > > > > > 2016-02-06 2:35 GMT+01:00 Julian Hyde <[email protected]>: > > > > > >> Let me rephrase: The *majority* of the literature, of which I cited > > >> just one example, calls them punctuation, and a couple of recent > > >> papers out of Mountain View doesn't change that. > > >> > > >> There are some fine distinctions between punctuation, heartbeats, > > >> watermarks and rowtime bounds, mostly in terms of how they are > > >> generated and propagated, that matter little when planning the query. > > >> > > >> On Fri, Feb 5, 2016 at 5:18 PM, Ted Dunning <[email protected]> > > wrote: > > >>> On Fri, Feb 5, 2016 at 5:10 PM, Julian Hyde <[email protected]> > wrote: > > >>> > > >>>> Yes, watermarks, absolutely. The "to do" list has "punctuation", > which > > >>>> is the same thing. (Actually, I prefer to call it "rowtime bound" > > >>>> because it is feels more like a dynamic constraint than a piece of > > >>>> data, but the literature[1] calls them punctuation.) > > >>>> > > >>> > > >>> Some of the literature calls them punctuation, other literature [1] > > calls > > >>> them watermarks. > > >>> > > >>> [1] http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf > > >> > > > > > -- Milinda Pathirage PhD Student | Research Assistant School of Informatics and Computing | Data to Insight Center Indiana University twitter: milindalakmal skype: milinda.pathirage blog: http://milinda.pathirage.org
