Re: About Stream SQL

Milinda Pathirage Wed, 17 Feb 2016 12:10:35 -0800

Hi Fabian,

We did some work on stream joins [1]. I tested stream-to-relation joins
with Samza. But not stream-to-stream joins. But never updated the streaming
documentation. I'll send a pull request with some documentation on joins.


Thanks
Milinda

[1] https://issues.apache.org/jira/browse/CALCITE-968

On Wed, Feb 17, 2016 at 5:09 AM, Fabian Hueske <[email protected]> wrote:

> Hi,
>
> I agree, the Streaming page is a very good starting point for this
> discussion. As suggested by Julian, I created CALCITE-1090 to update the
> page such that it reflects the current state of the discussion (adding HOP
> and TUMBLE functions, punctuations). I can also help with that, e.g., by
> contributing figures, examples, or text, reviewing, or any other way.
>
> From my point of view, the semantics of the window types and the other
> operators in the Streaming document are very good. What is missing are
> joins (windowed stream-stream joins, stream-table joins, stream-table joins
> with table updates) as already noted in the todo section.
>
> Regarding the handling of late-arriving events, I am not sure if this is a
> purely QoS issue as the result of a query might depend on the chosen
> strategy. Also users might want to pick different strategies for different
> operations, so late-arriver strategies are not necessarily end-to-end but
> can be operator specific. However, I think these details should be
> discussed in a separate thread.
>
> I'd like to add a few words about the StreamSQL roadmap of the Flink
> community.
> We are currently preparing our codebase and will start to work on support
> for structured queries on streams in the next weeks. Flink will support two
> query interface, a SQL interface and a LINQ-style Table API [1]. Both will
> be optimized and translated by Calcite. As a first step, we want to add
> simple stream transformations such as selection and projection to both
> interfaces. Next, we will add windowing support (starting with tumbling and
> hopping windows) to the Table API (as is said before, our plans here are
> well aligned with Julian's suggestions). Once this is done, we would extend
> the SQL interface to support windows which is hopefully as simple as using
> a parser that accepts window syntax.
>
> So from our point of view, fixing the semantics of windows and extending
> the optimizer accordingly is more urgent than agreeing on a syntax
> (although the Table API syntax could be inspired by Calcite's StreamSQL
> syntax [2]). I can also help implementing the missing features in Calcite.
>
> Having a reference implementation with tests would be awesome and
> definitely help.
>
> Best, Fabian
>
> [1]
>
> https://ci.apache.org/projects/flink/flink-docs-master/apis/batch/libs/table.html
> [2]
>
> https://docs.google.com/document/d/19kSOAOINKCSWLBCKRq2WvNtmuaA9o3AyCh2ePqr3V5E
>
> 2016-02-14 21:23 GMT+01:00 Julian Hyde <[email protected]>:
>
> > Fabian,
> >
> > Apologies for the late reply.
> >
> > I would rather that the specification for streaming SQL was not too
> > prescriptive for how late events were handled. Approaches 1, 2 and 3 are
> > all viable, and engines can differentiate themselves by the strength of
> > their support for this. But for the SQL to be considered valid, I think
> the
> > validator just needs to know that it can make progress.
> >
> > There is a large area of functionality I’d call “quality of service”
> > (QoS). This includes latency, reliability, at-least-once, at-most-once or
> > in-order guarantees, as well as the late-row-handling this thread is
> > concerned with. What the QoS metrics have in common is that they are
> > end-to-end. To deliver a high QoS to the consumer, the producer needs to
> > conform to a high QoS. The QoS is beyond the control of the SQL
> statement.
> > (Although you can ask what a SQL statement is able to deliver, given the
> > upstream QoS guarantees.) QoS is best managed by the whole system, and in
> > my opinion this is the biggest reason to have a DSMS.
> >
> > For this reason, I would be inclined to put QoS constraints on the stream
> > definition, not on the query. For example, taking latency as the QoS
> metric
> > of interest, you could flag the Orders stream as “at most 10 ms latency
> > between the record’s timestamp and the wall-clock time of the server
> > receiving the records, and any records arriving after that time are
> logged
> > and discarded”, and that QoS constraint applies to both producers and
> > consumers.
> >
> > Given a query Q ‘select stream * from Orders’, it is valid to ask “what
> is
> > the expected latency of Q?” or tell the planner “produce an
> implementation
> > of Q with a latency of no more than 15 ms, and if you cannot achieve that
> > latency, fail”. You could even register Q in the system and tell the
> system
> > to tighten up the latency of any upstream streams and the standing
> queries
> > that populate them. But it’s not valid to say “execute Q with a latency
> of
> > 15 ms”: the system may not be able to achieve it.
> >
> > In summary: I would allow latency and late-row-handling and other QoS
> > annotations in the query but it’s not the most natural or powerful place
> to
> > put them.
> >
> > Julian
> >
> >
> > > On Feb 6, 2016, at 1:28 AM, Fabian Hueske <[email protected]> wrote:
> > >
> > > Excellent! I missed the punctuations in the todo list.
> > >
> > > What kind of strategies do you have in mind to handle events that
> arrive
> > > too late? I see
> > > 1. dropping of late events
> > > 2. computing an updated window result for each late arriving
> > > element (implies that the window state is stored for a certain period
> > > before it is discarded)
> > > 3. computing a delta to the previous window result for each late
> arriving
> > > element (requires window state as well, not applicable to all
> aggregation
> > > types)
> > >
> > > It would be nice if strategies to handle late-arrivers could be defined
> > in
> > > the query.
> > >
> > > I think the plans of the Flink community are quite well aligned with
> your
> > > ideas for SQL on Streams.
> > > Should we start by updating / extending the Stream document on the
> > Calcite
> > > website to include the new window definitions (TUMBLE, HOP) and a
> > > discussion of punctuations/watermarks/time bounds?
> > >
> > > Fabian
> > >
> > >
> > >
> > >
> > >
> > >
> > > 2016-02-06 2:35 GMT+01:00 Julian Hyde <[email protected]>:
> > >
> > >> Let me rephrase: The *majority* of the literature, of which I cited
> > >> just one example, calls them punctuation, and a couple of recent
> > >> papers out of Mountain View doesn't change that.
> > >>
> > >> There are some fine distinctions between punctuation, heartbeats,
> > >> watermarks and rowtime bounds, mostly in terms of how they are
> > >> generated and propagated, that matter little when planning the query.
> > >>
> > >> On Fri, Feb 5, 2016 at 5:18 PM, Ted Dunning <[email protected]>
> > wrote:
> > >>> On Fri, Feb 5, 2016 at 5:10 PM, Julian Hyde <[email protected]>
> wrote:
> > >>>
> > >>>> Yes, watermarks, absolutely. The "to do" list has "punctuation",
> which
> > >>>> is the same thing. (Actually, I prefer to call it "rowtime bound"
> > >>>> because it is feels more like a dynamic constraint than a piece of
> > >>>> data, but the literature[1] calls them punctuation.)
> > >>>>
> > >>>
> > >>> Some of the literature calls them punctuation, other literature [1]
> > calls
> > >>> them watermarks.
> > >>>
> > >>> [1] http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf
> > >>
> >
> >
>



-- 
Milinda Pathirage

PhD Student | Research Assistant
School of Informatics and Computing | Data to Insight Center
Indiana University

twitter: milindalakmal
skype: milinda.pathirage
blog: http://milinda.pathirage.org

Re: About Stream SQL

Reply via email to