GitHub user kl0u opened a pull request:
https://github.com/apache/flink/pull/1819
Flink-3429: adds a histogram-based watermark emitter.
When processing a stream, and for the lateness to be as low as possible,
the watermark should be as close to processing time as possible. Unfortunately,
in streams with late data, i.e. data with timestamps smaller than the last
received watermark, this means that a number of elements will be discarded due
to lateness. In order to avoid this, this extractor periodically samples the
timestamps of the elements in a stream and keeps a histogram of the observed
lateness. Based on this histogram, it sets the watermark lateness to the lowest
possible value that, at the same time, guarantees that a user-specified
percentage of the elements in the stream are covered and not dropped due to
lateness.
More precisely, the user specifies i) the duration of the sampling period,
ii) that of the interval between the end of a sampling period and the start of
the next one, and iii) the percentage referring of elements in the stream (late
and non-late) that she wants to be covered, i.e. considered non-late. Given
this information, during the sampling period the extractor keeps a per-second
lateness histogram, i.e. a histogram showing how may elements were 0, 1, 2...
seconds late, and the maximum (event-time) timestamp seen so far. When the
sampling period ends, it computes the minimum lateness that covers the
user-specified percentage of data, and whenever a watermark is emitted, its
timestamp is the maximum (event-time) timestamp seen up to that point in the
stream, minus the previously computed value. This value is not updated till the
end of the next sampling period.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/kl0u/flink flink-3429
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/flink/pull/1819.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #1819
----
commit d2cf494776b6da2de32e136b31e0b6b7b981ff10
Author: kl0u <[email protected]>
Date: 2016-03-10T09:37:28Z
Adds a AssignerWithPunctuatedWatermarks which keeps a histogram with stats
about arrivals in processing time.
commit 30e01612b080df0fc80603aa1f8b19da7fc2b533
Author: kl0u <[email protected]>
Date: 2016-03-15T17:39:05Z
A first version of the assigner.
commit dbfcec1c29613c5d311bc574b1d945e92b38b19e
Author: kl0u <[email protected]>
Date: 2016-03-16T17:00:18Z
Made the histogram-based watermark emitter an
AssignerWithPeriodicWatermarks.
commit 1b7a0c6d959df84dc87dcaaca5119dd12aeb996b
Author: kl0u <[email protected]>
Date: 2016-03-16T18:04:54Z
Adding the tests.
commit 98f92ab8ca8df4c555079bfd34f5aca510739976
Author: kl0u <[email protected]>
Date: 2016-03-18T15:38:16Z
Made the exctractor an AbstractRichFunction.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---