Hi Julian,

I agree with you. Calcite should stay away from physical properties of
stream as much as possible. I was just trying to clarify the confusion
regarding the punctuations and watermarks. My last question was not related
to Calcite, rather to Flink and other implementations. Sorry for the
confusion.

Milinda

On Tue, Feb 23, 2016 at 11:41 AM, Julian Hyde <[email protected]> wrote:

> As the author of the streaming SQL specification, I don't care at all
> how the system deduces that it is able to make progress. Just as the
> authors of the SQL standard don't care whether a vendor chooses to
> store records sorted and/or compressed.
>
> All the streaming SQL validator/optimizer needs to know is that, say,
> orderTime is monotonic, orderId is quasi-monotonic, and paymentMethod
> is non-monotonic, so it can allow streaming aggregations on orderTime
> and orderId, and disallow them on paymentMethod.
>
> This allows streaming engines to add novel mechanisms without having
> to change the definition of streaming SQL.
>
>
> On Tue, Feb 23, 2016 at 7:42 AM, Milinda Pathirage
> <[email protected]> wrote:
> > Thank you Julian for the document.
> >
> > [1] is also a good read on punctuation. What I understood from reading
> [1]
> > and MillWheel paper is that a low-watermark (or row-time bound) is a
> > property maintained by operators and operators derive low-watermark by
> > processing punctuations.
> >
> > One other thing mentioned in MillWheel is the fact that Google's input
> > streams contain punctuations to communicate stream progress. If
> > punctuations are not there in the input stream we will have to generate
> > them during ingest based on a slack or some similar technique. What do
> you
> > think about this?
> >
> > Thanks
> > Milinda
> >
> >
> > [1] http://www.vldb.org/pvldb/1/1453890.pdf
> >
> > On Mon, Feb 22, 2016 at 9:44 PM, Julian Hyde <[email protected]> wrote:
> >
> >> I’ve updated the Streaming reference guide as Fabian requested:
> >> http://calcite.apache.org/docs/stream.html <
> >> http://calcite.apache.org/docs/stream.html>
> >>
> >> Julian
> >>
> >> > On Feb 19, 2016, at 3:11 PM, Julian Hyde <[email protected]> wrote:
> >> >
> >> > I gave a talk about streaming SQL at a Samza meetup. A lot of it is
> >> about the semantics of streaming SQL, and I cover some ground that I
> don’t
> >> cover in the streams page[1].
> >> >
> >> > The news item[2] gets you to both slides and video.
> >> >
> >> > In other news, I notice[3] that Spark 2.1 will contain “continuous
> SQL”.
> >> If the examples[4] are accurate, all queries are heavily based on
> sliding
> >> windows, and they use a syntax for those windows that is very different
> to
> >> standard SQL.  I think we can deal with their use cases, and in my
> opinion
> >> our proposed syntax is more elegant and closer to the standard. But we
> >> should discuss. I don’t want to diverge from other efforts because of
> >> hubris/ignorance.
> >> >
> >> > At the Samza meetup some folks mentioned the use case of a stream that
> >> summarizes, emitting periodic totals even if there were no data in a
> given
> >> period. Can they re-state that use case here, so we can discuss?
> >> >
> >> > Julian
> >> >
> >> > [1] http://calcite.apache.org/docs/stream.html
> >> >
> >> > [2] http://calcite.apache.org/news/2016/02/17/streaming-sql-talk/
> >> >
> >> > [3]
> >>
> http://www.slideshare.net/databricks/the-future-of-realtime-in-spark-58433411
> >> slide 29
> >> >
> >> > [4]
> >>
> https://issues.apache.org/jira/secure/attachment/12775265/StreamingDataFrameProposal.pdf
> >> >
> >> >> On Feb 17, 2016, at 12:09 PM, Milinda Pathirage <
> [email protected]>
> >> wrote:
> >> >>
> >> >> Hi Fabian,
> >> >>
> >> >> We did some work on stream joins [1]. I tested stream-to-relation
> joins
> >> >> with Samza. But not stream-to-stream joins. But never updated the
> >> streaming
> >> >> documentation. I'll send a pull request with some documentation on
> >> joins.
> >> >>
> >> >> Thanks
> >> >> Milinda
> >> >>
> >> >> [1] https://issues.apache.org/jira/browse/CALCITE-968
> >> >>
> >> >> On Wed, Feb 17, 2016 at 5:09 AM, Fabian Hueske <[email protected]>
> >> wrote:
> >> >>
> >> >>> Hi,
> >> >>>
> >> >>> I agree, the Streaming page is a very good starting point for this
> >> >>> discussion. As suggested by Julian, I created CALCITE-1090 to update
> >> the
> >> >>> page such that it reflects the current state of the discussion
> (adding
> >> HOP
> >> >>> and TUMBLE functions, punctuations). I can also help with that,
> e.g.,
> >> by
> >> >>> contributing figures, examples, or text, reviewing, or any other
> way.
> >> >>>
> >> >>> From my point of view, the semantics of the window types and the
> other
> >> >>> operators in the Streaming document are very good. What is missing
> are
> >> >>> joins (windowed stream-stream joins, stream-table joins,
> stream-table
> >> joins
> >> >>> with table updates) as already noted in the todo section.
> >> >>>
> >> >>> Regarding the handling of late-arriving events, I am not sure if
> this
> >> is a
> >> >>> purely QoS issue as the result of a query might depend on the chosen
> >> >>> strategy. Also users might want to pick different strategies for
> >> different
> >> >>> operations, so late-arriver strategies are not necessarily
> end-to-end
> >> but
> >> >>> can be operator specific. However, I think these details should be
> >> >>> discussed in a separate thread.
> >> >>>
> >> >>> I'd like to add a few words about the StreamSQL roadmap of the Flink
> >> >>> community.
> >> >>> We are currently preparing our codebase and will start to work on
> >> support
> >> >>> for structured queries on streams in the next weeks. Flink will
> >> support two
> >> >>> query interface, a SQL interface and a LINQ-style Table API [1].
> Both
> >> will
> >> >>> be optimized and translated by Calcite. As a first step, we want to
> add
> >> >>> simple stream transformations such as selection and projection to
> both
> >> >>> interfaces. Next, we will add windowing support (starting with
> >> tumbling and
> >> >>> hopping windows) to the Table API (as is said before, our plans here
> >> are
> >> >>> well aligned with Julian's suggestions). Once this is done, we would
> >> extend
> >> >>> the SQL interface to support windows which is hopefully as simple as
> >> using
> >> >>> a parser that accepts window syntax.
> >> >>>
> >> >>> So from our point of view, fixing the semantics of windows and
> >> extending
> >> >>> the optimizer accordingly is more urgent than agreeing on a syntax
> >> >>> (although the Table API syntax could be inspired by Calcite's
> StreamSQL
> >> >>> syntax [2]). I can also help implementing the missing features in
> >> Calcite.
> >> >>>
> >> >>> Having a reference implementation with tests would be awesome and
> >> >>> definitely help.
> >> >>>
> >> >>> Best, Fabian
> >> >>>
> >> >>> [1]
> >> >>>
> >> >>>
> >>
> https://ci.apache.org/projects/flink/flink-docs-master/apis/batch/libs/table.html
> >> >>> [2]
> >> >>>
> >> >>>
> >>
> https://docs.google.com/document/d/19kSOAOINKCSWLBCKRq2WvNtmuaA9o3AyCh2ePqr3V5E
> >> >>>
> >> >>> 2016-02-14 21:23 GMT+01:00 Julian Hyde <[email protected]>:
> >> >>>
> >> >>>> Fabian,
> >> >>>>
> >> >>>> Apologies for the late reply.
> >> >>>>
> >> >>>> I would rather that the specification for streaming SQL was not too
> >> >>>> prescriptive for how late events were handled. Approaches 1, 2 and
> 3
> >> are
> >> >>>> all viable, and engines can differentiate themselves by the
> strength
> >> of
> >> >>>> their support for this. But for the SQL to be considered valid, I
> >> think
> >> >>> the
> >> >>>> validator just needs to know that it can make progress.
> >> >>>>
> >> >>>> There is a large area of functionality I’d call “quality of
> service”
> >> >>>> (QoS). This includes latency, reliability, at-least-once,
> >> at-most-once or
> >> >>>> in-order guarantees, as well as the late-row-handling this thread
> is
> >> >>>> concerned with. What the QoS metrics have in common is that they
> are
> >> >>>> end-to-end. To deliver a high QoS to the consumer, the producer
> needs
> >> to
> >> >>>> conform to a high QoS. The QoS is beyond the control of the SQL
> >> >>> statement.
> >> >>>> (Although you can ask what a SQL statement is able to deliver,
> given
> >> the
> >> >>>> upstream QoS guarantees.) QoS is best managed by the whole system,
> >> and in
> >> >>>> my opinion this is the biggest reason to have a DSMS.
> >> >>>>
> >> >>>> For this reason, I would be inclined to put QoS constraints on the
> >> stream
> >> >>>> definition, not on the query. For example, taking latency as the
> QoS
> >> >>> metric
> >> >>>> of interest, you could flag the Orders stream as “at most 10 ms
> >> latency
> >> >>>> between the record’s timestamp and the wall-clock time of the
> server
> >> >>>> receiving the records, and any records arriving after that time are
> >> >>> logged
> >> >>>> and discarded”, and that QoS constraint applies to both producers
> and
> >> >>>> consumers.
> >> >>>>
> >> >>>> Given a query Q ‘select stream * from Orders’, it is valid to ask
> >> “what
> >> >>> is
> >> >>>> the expected latency of Q?” or tell the planner “produce an
> >> >>> implementation
> >> >>>> of Q with a latency of no more than 15 ms, and if you cannot
> achieve
> >> that
> >> >>>> latency, fail”. You could even register Q in the system and tell
> the
> >> >>> system
> >> >>>> to tighten up the latency of any upstream streams and the standing
> >> >>> queries
> >> >>>> that populate them. But it’s not valid to say “execute Q with a
> >> latency
> >> >>> of
> >> >>>> 15 ms”: the system may not be able to achieve it.
> >> >>>>
> >> >>>> In summary: I would allow latency and late-row-handling and other
> QoS
> >> >>>> annotations in the query but it’s not the most natural or powerful
> >> place
> >> >>> to
> >> >>>> put them.
> >> >>>>
> >> >>>> Julian
> >> >>>>
> >> >>>>
> >> >>>>> On Feb 6, 2016, at 1:28 AM, Fabian Hueske <[email protected]>
> wrote:
> >> >>>>>
> >> >>>>> Excellent! I missed the punctuations in the todo list.
> >> >>>>>
> >> >>>>> What kind of strategies do you have in mind to handle events that
> >> >>> arrive
> >> >>>>> too late? I see
> >> >>>>> 1. dropping of late events
> >> >>>>> 2. computing an updated window result for each late arriving
> >> >>>>> element (implies that the window state is stored for a certain
> period
> >> >>>>> before it is discarded)
> >> >>>>> 3. computing a delta to the previous window result for each late
> >> >>> arriving
> >> >>>>> element (requires window state as well, not applicable to all
> >> >>> aggregation
> >> >>>>> types)
> >> >>>>>
> >> >>>>> It would be nice if strategies to handle late-arrivers could be
> >> defined
> >> >>>> in
> >> >>>>> the query.
> >> >>>>>
> >> >>>>> I think the plans of the Flink community are quite well aligned
> with
> >> >>> your
> >> >>>>> ideas for SQL on Streams.
> >> >>>>> Should we start by updating / extending the Stream document on the
> >> >>>> Calcite
> >> >>>>> website to include the new window definitions (TUMBLE, HOP) and a
> >> >>>>> discussion of punctuations/watermarks/time bounds?
> >> >>>>>
> >> >>>>> Fabian
> >> >>>>>
> >> >>>>>
> >> >>>>>
> >> >>>>>
> >> >>>>>
> >> >>>>>
> >> >>>>> 2016-02-06 2:35 GMT+01:00 Julian Hyde <[email protected]>:
> >> >>>>>
> >> >>>>>> Let me rephrase: The *majority* of the literature, of which I
> cited
> >> >>>>>> just one example, calls them punctuation, and a couple of recent
> >> >>>>>> papers out of Mountain View doesn't change that.
> >> >>>>>>
> >> >>>>>> There are some fine distinctions between punctuation, heartbeats,
> >> >>>>>> watermarks and rowtime bounds, mostly in terms of how they are
> >> >>>>>> generated and propagated, that matter little when planning the
> >> query.
> >> >>>>>>
> >> >>>>>> On Fri, Feb 5, 2016 at 5:18 PM, Ted Dunning <
> [email protected]>
> >> >>>> wrote:
> >> >>>>>>> On Fri, Feb 5, 2016 at 5:10 PM, Julian Hyde <[email protected]>
> >> >>> wrote:
> >> >>>>>>>
> >> >>>>>>>> Yes, watermarks, absolutely. The "to do" list has
> "punctuation",
> >> >>> which
> >> >>>>>>>> is the same thing. (Actually, I prefer to call it "rowtime
> bound"
> >> >>>>>>>> because it is feels more like a dynamic constraint than a
> piece of
> >> >>>>>>>> data, but the literature[1] calls them punctuation.)
> >> >>>>>>>>
> >> >>>>>>>
> >> >>>>>>> Some of the literature calls them punctuation, other literature
> [1]
> >> >>>> calls
> >> >>>>>>> them watermarks.
> >> >>>>>>>
> >> >>>>>>> [1] http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf
> >> >>>>>>
> >> >>>>
> >> >>>>
> >> >>>
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> Milinda Pathirage
> >> >>
> >> >> PhD Student | Research Assistant
> >> >> School of Informatics and Computing | Data to Insight Center
> >> >> Indiana University
> >> >>
> >> >> twitter: milindalakmal
> >> >> skype: milinda.pathirage
> >> >> blog: http://milinda.pathirage.org
> >> >
> >>
> >>
> >
> >
> > --
> > Milinda Pathirage
> >
> > PhD Student | Research Assistant
> > School of Informatics and Computing | Data to Insight Center
> > Indiana University
> >
> > twitter: milindalakmal
> > skype: milinda.pathirage
> > blog: http://milinda.pathirage.org
>



-- 
Milinda Pathirage

PhD Student | Research Assistant
School of Informatics and Computing | Data to Insight Center
Indiana University

twitter: milindalakmal
skype: milinda.pathirage
blog: http://milinda.pathirage.org

Reply via email to