GitHub user kl0u opened a pull request:

    https://github.com/apache/flink/pull/1819

    Flink-3429: adds a histogram-based watermark emitter.

    When processing a stream, and for the lateness to be as low as possible, 
the watermark should be as close to processing time as possible. Unfortunately, 
in streams with late data, i.e. data with timestamps smaller than the last 
received watermark, this means that a number of elements will be discarded due 
to lateness. In order to avoid this, this extractor periodically samples the 
timestamps of the elements in a stream and keeps a histogram of the observed 
lateness. Based on this histogram, it sets the watermark lateness to the lowest 
possible value that, at the same time, guarantees that a user-specified 
percentage of the elements in the stream are covered and not dropped due to 
lateness.
    
    More precisely, the user specifies i) the duration of the sampling period, 
ii) that of the interval between the end of a sampling period and the start of 
the next one, and iii) the percentage referring of elements in the stream (late 
and non-late) that she wants to be covered, i.e. considered non-late. Given 
this information, during the sampling period the extractor keeps a per-second 
lateness histogram, i.e. a histogram showing how may elements were 0, 1, 2... 
seconds late, and the maximum (event-time) timestamp seen so far. When the 
sampling period ends, it computes the minimum lateness that covers the 
user-specified percentage of data, and whenever a watermark is emitted, its 
timestamp is the maximum (event-time) timestamp seen up to that point in the 
stream, minus the previously computed value. This value is not updated till the 
end of the next sampling period.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/kl0u/flink flink-3429

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/1819.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1819
    
----
commit d2cf494776b6da2de32e136b31e0b6b7b981ff10
Author: kl0u <[email protected]>
Date:   2016-03-10T09:37:28Z

    Adds a AssignerWithPunctuatedWatermarks which keeps a histogram with stats 
about arrivals in processing time.

commit 30e01612b080df0fc80603aa1f8b19da7fc2b533
Author: kl0u <[email protected]>
Date:   2016-03-15T17:39:05Z

    A first version of the assigner.

commit dbfcec1c29613c5d311bc574b1d945e92b38b19e
Author: kl0u <[email protected]>
Date:   2016-03-16T17:00:18Z

    Made the histogram-based watermark emitter an 
AssignerWithPeriodicWatermarks.

commit 1b7a0c6d959df84dc87dcaaca5119dd12aeb996b
Author: kl0u <[email protected]>
Date:   2016-03-16T18:04:54Z

    Adding the tests.

commit 98f92ab8ca8df4c555079bfd34f5aca510739976
Author: kl0u <[email protected]>
Date:   2016-03-18T15:38:16Z

    Made the exctractor an AbstractRichFunction.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

Reply via email to