GitHub user tammymendt opened a pull request:

    https://github.com/apache/flink/pull/605

    [FLINK-1297] Added OperatorStatsAccumulator for tracking operator related 
stats

    The accumulator tracks min and max values, and estimates for count distinct 
and heavy hitters.
    
    The count distinct algorithms are Linear Counting and HyperLogLog, both 
from an imported library (clearspring).
    
    The heavy hitters algorithms are Lossy counting (Manku et.al 2002) and 
Count Min Sketch (Cormode 2005).
    
    The heavy hitters algorithms are implemented in the statistics package in 
flink-core.
    
    The accumulator currently only uses Linear Counting as default for count 
distinct and Lossy Counting as default for heavy hitters. 
    
    The accumulator does not only track the globally merged value the way the 
other accumulators do. It additionally tracks an array of local statistics 
which have been collected at each subtask of a task. It does this by wrapping 
an extra class called OperatorStatisticsResult which holds the local and global 
accumulated results. The idea of this is to be able to track statistics of data 
processed in subtasks, so that they can be used to reason about partitioning 
strategies.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/tammymendt/flink FLINK-1297-v2

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/605.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #605
    
----
commit f365ccd92b513f10d0ba2d1a84b210d36060947c
Author: Tamara Mendt <tammyme...@gmail.com>
Date:   2015-04-16T09:25:16Z

    [FLINK-1297] Added an accumulator called OperatorStatsAccumulator capable 
of tracking min, max and estimates for count distinct and heavy hitters.
    
    The count distinct algorithms are Linear Counting and HyperLogLog, both 
from an imported library from clearspring.
    
    The heavy hitters algorithms are Lossy counting (Manku et.al 2002) and one 
based on Count Min Sketch (Cormode 2005).
    
    The heavy hitters algorithms are implemented in the statistics package in 
flink-core.
    
    The accumulator does not only track the globally merged value, but tracks 
an array of local statistics which have been collected at each subtask of a 
task. It does this using an extra class called OperatorStatisticsResult

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

Reply via email to