[GitHub] flink pull request #3191: [FLINK-5529] [FLINK-4752] [docs] Improve / extends...

aljoscha Wed, 25 Jan 2017 09:32:37 -0800

Github user aljoscha commented on a diff in the pull request:

    https://github.com/apache/flink/pull/3191#discussion_r97824676
  
    --- Diff: docs/dev/windows.md ---
    @@ -23,133 +23,96 @@ specific language governing permissions and limitations
     under the License.
     -->
     
    -Flink uses a concept called *windows* to divide a (potentially) infinite 
`DataStream` into finite
    -slices based on the timestamps of elements or other criteria. This 
division is required when working
    -with infinite streams of data and performing transformations that 
aggregate elements.
    -
    -<span class="label label-info">Info</span> We will mostly talk about 
*keyed windowing* here, i.e.
    -windows that are applied on a `KeyedStream`. Keyed windows have the 
advantage that elements are
    -subdivided based on both window and key before being given to
    -a user function. The work can thus be distributed across the cluster
    -because the elements for different keys can be processed independently. If 
you absolutely have to,
    -you can check out [non-keyed windowing](#non-keyed-windowing) where we 
describe how non-keyed
    -windows work.
    +Windows are at the heart of processing infinite streams. Windows split the 
stream into "buckets" of finite size, 
    +over which we can apply computations. This document focuses on how 
windowing is performed in Flink and how the 
    +programmer can benefit to the maximum from its offered functionality. 
     
    -* This will be replaced by the TOC
    -{:toc}
    +The general structure of a windowed Flink program is presented below. This 
is also going to serve as a roadmap for 
    +the rest of the page.
     
    -## Basics
    +    stream
    +           .keyBy(...)          <-  keyed versus non-keyed windows
    +           .window(...)         <-  required: "assigner"
    +          [.trigger(...)]       <-  optional: "trigger" (else default 
trigger)
    +          [.evictor(...)]       <-  optional: "evictor" (else no evictor)
    +          [.allowedLateness()]  <-  optional, else zero
    +           .reduce/fold/apply() <-  required: "function"
     
    -For a windowed transformation you must at least specify a *key*
    -(see [specifying keys]({{ site.baseurl 
}}/dev/api_concepts.html#specifying-keys)),
    -a *window assigner* and a *window function*. The *key* divides the 
infinite, non-keyed, stream
    -into logical keyed streams while the *window assigner* assigns elements to 
finite per-key windows.
    -Finally, the *window function* is used to process the elements of each 
window.
    +In the above, the commands in square brackets ([...]) are optional. This 
reveals that Flink allows you to customize your 
    +windowing logic in many different ways so that it best fits your needs. 
     
    -The basic structure of a windowed transformation is thus as follows:
    +* This will be replaced by the TOC
    +{:toc}
     
    -<div class="codetabs" markdown="1">
    -<div data-lang="java" markdown="1">
    -{% highlight java %}
    -DataStream<T> input = ...;
    +## Window Lifecycle
     
    -input
    -    .keyBy(<key selector>)
    -    .window(<window assigner>)
    -    .<windowed transformation>(<window function>);
    -{% endhighlight %}
    -</div>
    +In a nutshell, a window is **created** as soon as the first element that 
should belong to this window arrives, and the  
    +window is **completely removed** when the time (event or processing time) 
passes its end timestamp plus the user-specified 
    +`allowed lateness` (see [Allowed Lateness](#allowed-lateness)). Flink 
guarantees removal only for time-based 
    +windows and not for other types, *e.g.* global windows (see [Window 
Assigners](#window-assigners)). For example, with an 
    +event-time-based windowing strategy that creates non-overlapping (or 
tumbling) windows every 5 minutes and has an allowed 
    +lateness of 1 min, Flink will create a new window for the interval between 
`12:00` and `12:05` when the first element with 
    +a timestamp that falls into this interval arrives, and it will remove it 
when the watermark passes the `12:06`
    +timestamp. 
     
    -<div data-lang="scala" markdown="1">
    -{% highlight scala %}
    -val input: DataStream[T] = ...
    +In addition, each window will have a `Trigger` (see [Triggers](#triggers)) 
and a function (`WindowFunction`, `ReduceFunction` or 
    +`FoldFunction`) (see [Window Functions](#window-functions)) attached to 
it. The function will contain the computation to 
    +be applied to the contents of the window, while the `Trigger` specifies 
the conditions under which the window is 
    +considered ready for the function to be applied. A triggering policy might 
be something like "when the number of elements 
    +in the window is more than 4", or "when the watermark passes the end of 
the window". A trigger can also decide to 
    +purge a window's contents any time between its creation and removal. 
Purging in this case only refers to the elements 
    +in the window, and *not* the window metadata. This means that new data can 
still be added to that window.
     
    -input
    -    .keyBy(<key selector>)
    -    .window(<window assigner>)
    -    .<windowed transformation>(<window function>)
    -{% endhighlight %}
    -</div>
    -</div>
    +Apart from the above, you can specify an `Evictor` (see 
[Evictors](#evictors)) which will be able to remove  
    +elements from the window after the trigger fires and before and/or after 
the function is applied.
     
    -We will cover [window assigners](#window-assigners) in a separate section 
below.
    +In the following we go into more detail for each of the components above. 
We start with the required parts in the above 
    +snippet (see [Keyed vs Non-Keyed Windows](#keyed-vs-non-keyed-windows), 
[Window Assigner](#window-assigner), and 
    +[Window Function](#window-function)) before moving to the optional ones.
     
    -The window transformation can be one of `reduce()`, `fold()` or `apply()`. 
Which respectively
    -takes a `ReduceFunction`, `FoldFunction` or `WindowFunction`. We describe 
each of these ways
    -of specifying a windowed transformation in detail below: [window 
functions](#window-functions).
    +## Keyed vs Non-Keyed Windows
     
    -For more advanced use cases you can also specify a `Trigger` that 
determines when exactly a window
    -is being considered as *ready for processing*. These will be covered in 
more detail in
    -[triggers](#triggers).
    +The first thing to specify is whether your stream should be keyed or not. 
This has to be done before defining the window. 
    +Using the `keyBy(...)` will split your infinite stream into logical keyed 
streams. If `keyBy(...)` is not called, your 
    +stream is not keyed.
     
    -## Window Assigners
    +In the case of keyed streams, any attribute of your incoming events can be 
used as a key 
    +(more details [here]({{ site.baseurl 
}}/dev/api_concepts.html#specifying-keys)). Having a keyed stream will 
    +allow your windowed computation to be performed in parallel by multiple 
tasks, as each logical keyed stream can be processed 
    +independently from the rest. All elements referring to the same key will 
be sent to the same parallel task. 
     
    -The window assigner specifies how elements of the stream are divided into 
finite slices. Flink comes
    -with pre-implemented window assigners for the most typical use cases, 
namely *tumbling windows*,
    -*sliding windows*, *session windows* and *global windows*, but you can 
implement your own by
    -extending the `WindowAssigner` class. All the built-in window assigners, 
except for the global
    -windows one, assign elements to windows based on time, which can either be 
processing time or event
    -time. Please take a look at our section on [event time]({{ site.baseurl 
}}/dev/event_time.html) for more
    -information about how Flink deals with time.
    +In case of non-keyed streams, your original stream will not be split into 
multiple logical streams and all the windowing logic 
    +will be performed by a single task, *i.e.* with parallelism of 1.
     
    -Let's first look at how each of these window assigners works before 
looking at how they can be used
    -in a Flink program. We will be using abstract figures to visualize the 
workings of each assigner:
    -in the following, the purple circles are elements of the stream, they are 
partitioned
    -by some key (in this case *user 1*, *user 2* and *user 3*) and the x-axis 
shows the progress
    -of time.
    +## Window Assigners
     
    -### Global Windows
    +After specifying whether your stream is keyed or not, the next step is to 
define a *windowing strategy*. 
    --- End diff --
    
    I think we should stick to `window assigner` here because that's what we're 
describing. In my mind, the ensemble of window assigner, trigger (and evictor) 
is actually the `windowing strategy` since only those together define what 
happens in the end.
    
    What do you think?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request #3191: [FLINK-5529] [FLINK-4752] [docs] Improve / extends...

Reply via email to