Thanks.
This article is excellent. It completely explains everything.
I would add it as a reference to any and all explanations of structured
streaming (and in the case of watermarking, I simply didn’t understand the
definition before reading this).
Thanks,
Assaf.
From: kostas papageorgopoylos [via Apache Spark Developers List]
[mailto:[email protected]]
Sent: Thursday, October 27, 2016 10:17 AM
To: Mendelson, Assaf
Subject: Re: Watermarking in Structured Streaming to drop late data
Hi all
I would highly recommend to all users-devs interested in the design suggestions
/ discussions for Structured Streaming Spark API watermarking
to take a look on the following links along with the design document. It would
help to understand the notions of watermark , out of order data and possible
use cases.
https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102
Kind Regards
2016-10-27 9:46 GMT+03:00 assaf.mendelson <[hidden
email]</user/SendEmail.jtp?type=node&node=19592&i=0>>:
Hi,
Should comments come here or in the JIRA?
Any, I am a little confused on the need to expose this as an API to begin with.
Let’s consider for a second the most basic behavior: We have some input stream
and we want to aggregate a sum over a time window.
This means that the window we should be looking at would be the maximum time
across our data and back by the window interval. Everything older can be
dropped.
When new data arrives, the maximum time cannot move back so we generally drop
everything tool old.
This basically means we save only the latest time window.
This simpler model would only break if we have a secondary aggregation which
needs the results of multiple windows.
Is this the use case we are trying to solve?
If so, wouldn’t just calculating the bigger time window across the entire
aggregation solve this?
Am I missing something here?
From: Michael Armbrust [via Apache Spark Developers List] [mailto:[hidden
email]</user/SendEmail.jtp?type=node&node=19592&i=1>[hidden
email]<http://user/SendEmail.jtp?type=node&node=19591&i=0>]
Sent: Thursday, October 27, 2016 3:04 AM
To: Mendelson, Assaf
Subject: Re: Watermarking in Structured Streaming to drop late data
And the JIRA: https://issues.apache.org/jira/browse/SPARK-18124
On Wed, Oct 26, 2016 at 4:56 PM, Tathagata Das <[hidden
email]<http://user/SendEmail.jtp?type=node&node=19590&i=0>> wrote:
Hey all,
We are planning implement watermarking in Structured Streaming that would allow
us handle late, out-of-order data better. Specially, when we are aggregating
over windows on event-time, we currently can end up keeping unbounded amount
data as state. We want to define watermarks on the event time in order mark and
drop data that are "too late" and accordingly age out old aggregates that will
not be updated any more.
To enable the user to specify details like lateness threshold, we are
considering adding a new method to Dataset. We would like to get more feedback
on this API. Here is the design doc
https://docs.google.com/document/d/1z-Pazs5v4rA31azvmYhu4I5xwqaNQl6ZLIS03xhkfCQ/
Please comment on the design and proposed APIs.
Thank you very much!
TD
________________________________
If you reply to this email, your message will be added to the discussion below:
http://apache-spark-developers-list.1001551.n3.nabble.com/Watermarking-in-Structured-Streaming-to-drop-late-data-tp19589p19590.html
To start a new topic under Apache Spark Developers List, email [hidden
email]<http://user/SendEmail.jtp?type=node&node=19591&i=1>
To unsubscribe from Apache Spark Developers List, click here.
NAML<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
________________________________
View this message in context: RE: Watermarking in Structured Streaming to drop
late
data<http://apache-spark-developers-list.1001551.n3.nabble.com/Watermarking-in-Structured-Streaming-to-drop-late-data-tp19589p19591.html>
Sent from the Apache Spark Developers List mailing list
archive<http://apache-spark-developers-list.1001551.n3.nabble.com/> at
Nabble.com.
________________________________
If you reply to this email, your message will be added to the discussion below:
http://apache-spark-developers-list.1001551.n3.nabble.com/Watermarking-in-Structured-Streaming-to-drop-late-data-tp19589p19592.html
To start a new topic under Apache Spark Developers List, email
[email protected]<mailto:[email protected]>
To unsubscribe from Apache Spark Developers List, click
here<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=1&code=YXNzYWYubWVuZGVsc29uQHJzYS5jb218MXwtMTI4OTkxNTg1Mg==>.
NAML<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
--
View this message in context:
http://apache-spark-developers-list.1001551.n3.nabble.com/Watermarking-in-Structured-Streaming-to-drop-late-data-tp19589p19600.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.