[GitHub] [flink] NicoK commented on a change in pull request #11843: [FLINK-17239][docs] Streaming Analytics tutorial

GitBox Wed, 22 Apr 2020 04:07:32 -0700


NicoK commented on a change in pull request #11843:
URL: https://github.com/apache/flink/pull/11843#discussion_r412886924




##########
File path: docs/tutorials/streaming_analytics.md
##########
@@ -0,0 +1,460 @@
+---
+title: Streaming Analytics
+nav-id: analytics
+nav-pos: 4
+nav-title: Streaming Analytics
+nav-parent_id: tutorials
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+* This will be replaced by the TOC
+{:toc}
+
+## Event Time and Watermarks
+
+### Introduction
+
+Flink explicitly supports three different notions of time:
+
+* _event time:_ the time when an event occurred, as recorded by the device 
producing (or storing) the event
+
+* _ingestion time:_ a timestamp recorded by Flink at the moment it ingests the 
event
+
+* _processing time:_ the time when a specific operator in your pipeline is 
processing the event
+
+For reproducible results, e.g., when computing the maximum price a stock 
reached during the first
+hour of trading on a given day, you should use event time. In this way the 
result won't depend on
+when the calculation is performed. This kind of real-time application is 
sometimes performed using
+processing time, but then the results are determined by the events that happen 
to be processed
+during that hour, rather than the events that occurred then. Computing 
analytics based on processing
+time causes inconsistencies, and makes it difficult to re-analyze historic 
data or test new
+implementations.
+
+### Working with Event Time
+
+By default, Flink will use processing time. To change this, you can set the 
Time Characteristic:
+
+{% highlight java %}
+final StreamExecutionEnvironment env =
+    StreamExecutionEnvironment.getExecutionEnvironment();
+env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
+{% endhighlight %}
+
+If you want to use event time, you will also need to supply a Timestamp 
Extractor and Watermark
+Generator that Flink will use to track the progress of event time. This will 
be covered in the
+section below on [Working with Watermarks]({{ site.baseurl }}{% link
+tutorials/streaming_analytics.md %}#working-with-watermarks), but first we 
should explain what
+watermarks are.
+
+### Watermarks
+
+Let's work through a simple example that will show why watermarks are needed, 
and how they work.
+
+In this example you have a stream of timestamped events that arrive somewhat 
out of order, as shown
+below. The numbers shown are timestamps that indicate when these events 
actually occurred. The first
+event to arrive happened at time 4, and it is followed by an event that 
happened earlier, at time 2,
+and so on:
+
+    ··· 23 19 22 24 21 14 17 13 12 15 9 11 7 2 4 →
+
+Now imagine that you are trying create a stream sorter. This is meant to be an 
application that
+processes each event from a stream as it arrives, and emits a new stream 
containing the same events,
+but ordered by their timestamps.
+
+Some observations:
+
+(1) The first element your stream sorter sees is the 4, but you can't just 
immediately release it as
+the first element of the sorted stream. It may have arrived out of order, and 
an earlier event might
+yet arrive. In fact, you have the benefit of some god-like knowledge of this 
stream's future, and
+you can see that your stream sorter should wait at least until the 2 arrives 
before producing any
+results.
+
+*Some buffering, and some delay, is necessary.*
+
+(2) If you do this wrong, you could end up waiting forever. First the sorter 
saw an event from time
+4, and then an event from time 2. Will an event with a timestamp less than 2 
ever arrive? Maybe.
+Maybe not. You could wait forever and never see a 1.
+
+*Eventually you have to be courageous and emit the 2 as the start of the 
sorted stream.*
+
+(3) What you need then is some sort of policy that defines when, for any given 
timestamped event, to
+stop waiting for the arrival of earlier events.
+
+*This is precisely what watermarks do* — they define when to stop waiting for 
earlier events.
+
+Event time processing in Flink depends on *watermark generators* that insert 
special timestamped
+elements into the stream, called *watermarks*. A watermark for time _t_ is an 
assertion that the
+stream is (probably) now complete up through time _t_.
+
+When should this stream sorter stop waiting, and push out the 2 to start the 
sorted stream? When a
+watermark arrives with a timestamp of 2, or greater.
+
+(4) You might imagine different policies for deciding how to generate 
watermarks.
+
+Each event arrives after some delay, and these delays vary, so some events are 
delayed more than
+others. One simple approach is to assume that these delays are bounded by some 
maximum delay. Flink
+refers to this strategy as *bounded-out-of-orderness* watermarking. It's easy 
to imagine more
+complex approaches to watermarking, but for most applications a fixed delay 
works well enough.
+
+### Latency vs. Completeness
+
+Another way to think about watermarks is that they give you, the developer of 
a streaming
+application, control over the tradeoff between latency and completeness. 
Unlike in batch processing,
+where one has the luxury of being able to have complete knowledge of the input 
before producing any
+results, with streaming you must eventually stop waiting to see more of the 
input, and produce some
+sort of result.
+
+You can either configure your watermarking aggressively, with a short bounded 
delay, and thereby
+take the risk of producing results with rather incomplete knowledge of the 
input -- i.e., a possibly
+wrong result, produced quickly. Or you can wait longer, and produce results 
that take advantage of
+having more complete knowledge of the input stream(s).
+
+It is also possible to implement hybrid solutions that produce initial results 
quickly, and then
+supply updates to those results as additional (late) data is processed. This 
is a good approach for
+some applications.
+
+### Lateness
+
+Lateness is defined relative to the watermarks. A `Watermark(t)` asserts that 
the stream is complete
+up through time _t_; any event following this watermark whose timestamp is 
&le; _t_ is late.

Review comment:
       oh, I guess, then the other "thru" that I saw in a previous PR may need 
changing...(I was wondering there, looked it up quickly, and then proposed it 
here as a result)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [flink] NicoK commented on a change in pull request #11843: [FLINK-17239][docs] Streaming Analytics tutorial

Reply via email to