NicoK commented on a change in pull request #11843: URL: https://github.com/apache/flink/pull/11843#discussion_r412886924
########## File path: docs/tutorials/streaming_analytics.md ########## @@ -0,0 +1,460 @@ +--- +title: Streaming Analytics +nav-id: analytics +nav-pos: 4 +nav-title: Streaming Analytics +nav-parent_id: tutorials +--- +<!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +--> + +* This will be replaced by the TOC +{:toc} + +## Event Time and Watermarks + +### Introduction + +Flink explicitly supports three different notions of time: + +* _event time:_ the time when an event occurred, as recorded by the device producing (or storing) the event + +* _ingestion time:_ a timestamp recorded by Flink at the moment it ingests the event + +* _processing time:_ the time when a specific operator in your pipeline is processing the event + +For reproducible results, e.g., when computing the maximum price a stock reached during the first +hour of trading on a given day, you should use event time. In this way the result won't depend on +when the calculation is performed. This kind of real-time application is sometimes performed using +processing time, but then the results are determined by the events that happen to be processed +during that hour, rather than the events that occurred then. Computing analytics based on processing +time causes inconsistencies, and makes it difficult to re-analyze historic data or test new +implementations. + +### Working with Event Time + +By default, Flink will use processing time. To change this, you can set the Time Characteristic: + +{% highlight java %} +final StreamExecutionEnvironment env = + StreamExecutionEnvironment.getExecutionEnvironment(); +env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime); +{% endhighlight %} + +If you want to use event time, you will also need to supply a Timestamp Extractor and Watermark +Generator that Flink will use to track the progress of event time. This will be covered in the +section below on [Working with Watermarks]({{ site.baseurl }}{% link +tutorials/streaming_analytics.md %}#working-with-watermarks), but first we should explain what +watermarks are. + +### Watermarks + +Let's work through a simple example that will show why watermarks are needed, and how they work. + +In this example you have a stream of timestamped events that arrive somewhat out of order, as shown +below. The numbers shown are timestamps that indicate when these events actually occurred. The first +event to arrive happened at time 4, and it is followed by an event that happened earlier, at time 2, +and so on: + + ··· 23 19 22 24 21 14 17 13 12 15 9 11 7 2 4 → + +Now imagine that you are trying create a stream sorter. This is meant to be an application that +processes each event from a stream as it arrives, and emits a new stream containing the same events, +but ordered by their timestamps. + +Some observations: + +(1) The first element your stream sorter sees is the 4, but you can't just immediately release it as +the first element of the sorted stream. It may have arrived out of order, and an earlier event might +yet arrive. In fact, you have the benefit of some god-like knowledge of this stream's future, and +you can see that your stream sorter should wait at least until the 2 arrives before producing any +results. + +*Some buffering, and some delay, is necessary.* + +(2) If you do this wrong, you could end up waiting forever. First the sorter saw an event from time +4, and then an event from time 2. Will an event with a timestamp less than 2 ever arrive? Maybe. +Maybe not. You could wait forever and never see a 1. + +*Eventually you have to be courageous and emit the 2 as the start of the sorted stream.* + +(3) What you need then is some sort of policy that defines when, for any given timestamped event, to +stop waiting for the arrival of earlier events. + +*This is precisely what watermarks do* — they define when to stop waiting for earlier events. + +Event time processing in Flink depends on *watermark generators* that insert special timestamped +elements into the stream, called *watermarks*. A watermark for time _t_ is an assertion that the +stream is (probably) now complete up through time _t_. + +When should this stream sorter stop waiting, and push out the 2 to start the sorted stream? When a +watermark arrives with a timestamp of 2, or greater. + +(4) You might imagine different policies for deciding how to generate watermarks. + +Each event arrives after some delay, and these delays vary, so some events are delayed more than +others. One simple approach is to assume that these delays are bounded by some maximum delay. Flink +refers to this strategy as *bounded-out-of-orderness* watermarking. It's easy to imagine more +complex approaches to watermarking, but for most applications a fixed delay works well enough. + +### Latency vs. Completeness + +Another way to think about watermarks is that they give you, the developer of a streaming +application, control over the tradeoff between latency and completeness. Unlike in batch processing, +where one has the luxury of being able to have complete knowledge of the input before producing any +results, with streaming you must eventually stop waiting to see more of the input, and produce some +sort of result. + +You can either configure your watermarking aggressively, with a short bounded delay, and thereby +take the risk of producing results with rather incomplete knowledge of the input -- i.e., a possibly +wrong result, produced quickly. Or you can wait longer, and produce results that take advantage of +having more complete knowledge of the input stream(s). + +It is also possible to implement hybrid solutions that produce initial results quickly, and then +supply updates to those results as additional (late) data is processed. This is a good approach for +some applications. + +### Lateness + +Lateness is defined relative to the watermarks. A `Watermark(t)` asserts that the stream is complete +up through time _t_; any event following this watermark whose timestamp is ≤ _t_ is late. Review comment: oh, I guess, then the other "thru" that I saw in a previous PR may need changing...(I was wondering there, looked it up quickly, and then proposed it here as a result) ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
