[
https://issues.apache.org/jira/browse/FLINK-5529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15838208#comment-15838208
]
ASF GitHub Bot commented on FLINK-5529:
---------------------------------------
Github user aljoscha commented on a diff in the pull request:
https://github.com/apache/flink/pull/3191#discussion_r97833288
--- Diff: docs/dev/windows.md ---
@@ -758,8 +831,33 @@ input
val input: DataStream[T] = ...
input
- .windowAll(<window assigner>)
+ .keyBy(<key selector>)
+ .window(<window assigner>)
+ .allowedLateness(<time>)
.<windowed transformation>(<window function>)
{% endhighlight %}
</div>
</div>
+
+<span class="label label-info">Note</span> When using the `GlobalWindows`
window assigner no
+data is ever considered late because the end timestamp of the global
window is `Long.MAX_VALUE`.
+
+### Late elements considerations
+
+When specifying an allowed lateness greater than 0, the window along with
its content is kept after the watermark passes
+the end of the window. In these cases, when a late but not dropped element
arrives, it will trigger another firing for the
+window. These firings are called `late firings`, as they are triggered by
late events and in contrast to the `main firing`
+which is the first firing of the window. In case of session windows, late
firings can further lead to merging of windows,
+as they may "bridge" the gap between two pre-existing, unmerged windows.
+
+<span class="label label-info">Attention</span> You should be aware that
the elements emitted by a late firing should be treated as updated results of a
previous computation, i.e., your data stream will contain multiple results for
the same computation. Depending on your application, you need to take these
duplicated results into account or deduplicate them.
+
+## Useful state size considerations
+
+Windows can be defined over long periods of time (such as days, weeks, or
months) and therefore accumulate very large state. There are a couple of rules
to keep in mind when estimating the storage requirements of your windowing
computation:
+
+1. Flink creates one copy of each element per window to which it belongs.
Given this, tumbling windows keep one copy of each element (an element belongs
to exactly window unless it is dropped late). In contrast, sliding windows
create several of each element, as explained in the [Window
Assigners](#window-assigners) section. Hence, a sliding window of size 1 day
and slide 1 second might not be a good idea.
+
+2. `FoldFunction` and `ReduceFunction` can significantly reduce the
storage requirements, as they eagerly aggregate elements and store only one
value per window. In contrast a `WindowFunction` must accumulate all elements.
--- End diff --
In contrast, just using a `WindowFunction` requires accumulating all
elements.
> Improve / extends windowing documentation
> -----------------------------------------
>
> Key: FLINK-5529
> URL: https://issues.apache.org/jira/browse/FLINK-5529
> Project: Flink
> Issue Type: Sub-task
> Components: Documentation
> Reporter: Stephan Ewen
> Assignee: Kostas Kloudas
> Fix For: 1.2.0, 1.3.0
>
>
> Suggested Outline:
> {code}
> Windows
> (0) Outline: The anatomy of a window operation
> stream
> [.keyBy(...)] <- keyed versus non-keyed windows
> .window(...) <- required: "assigner"
> [.trigger(...)] <- optional: "trigger" (else default trigger)
> [.evictor(...)] <- optional: "evictor" (else no evictor)
> [.allowedLateness()] <- optional, else zero
> .reduce/fold/apply() <- required: "function"
> (1) Types of windows
> - tumble
> - slide
> - session
> - global
> (2) Pre-defined windows
> timeWindow() (tumble, slide)
> countWindow() (tumble, slide)
> - mention that count windows are inherently
> resource leaky unless limited key space
> (3) Window Functions
> - apply: most basic, iterates over elements in window
>
> - aggregating: reduce and fold, can be used with "apply()" which will get
> one element
>
> - forward reference to state size section
> (4) Advanced Windows
> - assigner
> - simple
> - merging
> - trigger
> - registering timers (processing time, event time)
> - state in triggers
> - life cycle of a window
> - create
> - state
> - cleanup
> - when is window contents purged
> - when is state dropped
> - when is metadata (like merging set) dropped
> (5) Late data
> - picture
> - fire vs fire_and_purge: late accumulates vs late resurrects (cf
> discarding mode)
>
> (6) Evictors
> - TDB
>
> (7) State size: How large will the state be?
> Basic rule: Each element has one copy per window it is assigned to
> --> num windows * num elements in window
> --> example: tumbline is one copy, sliding(n,m) is n/m copies
> --> per key
> Pre-aggregation:
> - if reduce or fold is set -> one element per window (rather than num
> elements in window)
> - evictor voids pre-aggregation from the perspective of state
> Special rules:
> - fold cannot pre-aggregate on session windows (and other merging windows)
> (8) Non-keyed windows
> - all elements through the same windows
> - currently not parallel
> - possible parallel in the future when having pre-aggregation functions
> - inherently (by definition) produce a result stream with parallelism one
> - state similar to one key of keyed windows
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)