GitHub user tammymendt opened a pull request:
https://github.com/apache/flink/pull/605
[FLINK-1297] Added OperatorStatsAccumulator for tracking operator related
stats
The accumulator tracks min and max values, and estimates for count distinct
and heavy hitters.
The count distinct algorithms are Linear Counting and HyperLogLog, both
from an imported library (clearspring).
The heavy hitters algorithms are Lossy counting (Manku et.al 2002) and
Count Min Sketch (Cormode 2005).
The heavy hitters algorithms are implemented in the statistics package in
flink-core.
The accumulator currently only uses Linear Counting as default for count
distinct and Lossy Counting as default for heavy hitters.
The accumulator does not only track the globally merged value the way the
other accumulators do. It additionally tracks an array of local statistics
which have been collected at each subtask of a task. It does this by wrapping
an extra class called OperatorStatisticsResult which holds the local and global
accumulated results. The idea of this is to be able to track statistics of data
processed in subtasks, so that they can be used to reason about partitioning
strategies.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/tammymendt/flink FLINK-1297-v2
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/flink/pull/605.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #605
----
commit f365ccd92b513f10d0ba2d1a84b210d36060947c
Author: Tamara Mendt <[email protected]>
Date: 2015-04-16T09:25:16Z
[FLINK-1297] Added an accumulator called OperatorStatsAccumulator capable
of tracking min, max and estimates for count distinct and heavy hitters.
The count distinct algorithms are Linear Counting and HyperLogLog, both
from an imported library from clearspring.
The heavy hitters algorithms are Lossy counting (Manku et.al 2002) and one
based on Count Min Sketch (Cormode 2005).
The heavy hitters algorithms are implemented in the statistics package in
flink-core.
The accumulator does not only track the globally merged value, but tracks
an array of local statistics which have been collected at each subtask of a
task. It does this using an extra class called OperatorStatisticsResult
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---