NicoK commented on a change in pull request #11843: URL: https://github.com/apache/flink/pull/11843#discussion_r412897634
########## File path: docs/tutorials/streaming_analytics.md ########## @@ -0,0 +1,460 @@ +--- +title: Streaming Analytics +nav-id: analytics +nav-pos: 4 +nav-title: Streaming Analytics +nav-parent_id: tutorials +--- +<!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +--> + +* This will be replaced by the TOC +{:toc} + +## Event Time and Watermarks + +### Introduction + +Flink explicitly supports three different notions of time: + +* _event time:_ the time when an event occurred, as recorded by the device producing (or storing) the event + +* _ingestion time:_ a timestamp recorded by Flink at the moment it ingests the event + +* _processing time:_ the time when a specific operator in your pipeline is processing the event + +For reproducible results, e.g., when computing the maximum price a stock reached during the first +hour of trading on a given day, you should use event time. In this way the result won't depend on +when the calculation is performed. This kind of real-time application is sometimes performed using +processing time, but then the results are determined by the events that happen to be processed +during that hour, rather than the events that occurred then. Computing analytics based on processing +time causes inconsistencies, and makes it difficult to re-analyze historic data or test new +implementations. + +### Working with Event Time + +By default, Flink will use processing time. To change this, you can set the Time Characteristic: + +{% highlight java %} +final StreamExecutionEnvironment env = + StreamExecutionEnvironment.getExecutionEnvironment(); +env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime); +{% endhighlight %} + +If you want to use event time, you will also need to supply a Timestamp Extractor and Watermark +Generator that Flink will use to track the progress of event time. This will be covered in the +section below on [Working with Watermarks]({{ site.baseurl }}{% link +tutorials/streaming_analytics.md %}#working-with-watermarks), but first we should explain what +watermarks are. + +### Watermarks + +Let's work through a simple example that will show why watermarks are needed, and how they work. + +In this example you have a stream of timestamped events that arrive somewhat out of order, as shown +below. The numbers shown are timestamps that indicate when these events actually occurred. The first +event to arrive happened at time 4, and it is followed by an event that happened earlier, at time 2, +and so on: + + ··· 23 19 22 24 21 14 17 13 12 15 9 11 7 2 4 → Review comment: this seems to look nice (I'll apply it to your PR unless you object): ``` <div class="text-center" style="font-size: x-large; word-spacing: 0.5em; margin: 1em 0em;"> ··· 23 19 22 24 21 14 17 13 12 15 9 11 7 2 4 → </div> ``` ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
