[
https://issues.apache.org/jira/browse/FLINK-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14736617#comment-14736617
]
ASF GitHub Bot commented on FLINK-1297:
---------------------------------------
Github user mxm commented on the pull request:
https://github.com/apache/flink/pull/605#issuecomment-138866235
I tried again, this works:
```java
@Override
public OperatorStatistics clone(){
OperatorStatistics clone = new OperatorStatistics(config);
clone.min = min;
clone.max = max;
clone.cardinality = cardinality;
try {
ICardinality copy;
if (countDistinct instanceof LinearCounting) {
copy = new
LinearCounting(config.getCountDbitmap());
} else if (countDistinct instanceof HyperLogLog) {
copy = new HyperLogLog(config.getCountDlog2m());
} else {
throw new IllegalStateException("Unsupported
counter.");
}
clone.countDistinct = copy.merge(countDistinct);
} catch (CardinalityMergeException e) {
throw new RuntimeException("Faild to clone
OperatorStatistics!");
}
try {
HeavyHitter copy;
if (heavyHitter instanceof LossyCounting) {
copy = new
LossyCounting(config.getHeavyHitterFraction(), config.getHeavyHitterError());
} else if (heavyHitter instanceof CountMinHeavyHitter) {
copy = new
CountMinHeavyHitter(config.getHeavyHitterFraction(),
config.getHeavyHitterError(),
config.getHeavyHitterConfidence(),
config.getHeavyHitterSeed());
} else {
throw new IllegalStateException("Unsupported
counter.");
}
copy.merge(heavyHitter);
clone.heavyHitter = copy;
} catch (HeavyHitterMergeException e) {
throw new RuntimeException("Failed to clone
OperatorStatistics!");
}
return clone;
}
```
Do you think we could merge your pull request with this change?
> Add support for tracking statistics of intermediate results
> -----------------------------------------------------------
>
> Key: FLINK-1297
> URL: https://issues.apache.org/jira/browse/FLINK-1297
> Project: Flink
> Issue Type: Improvement
> Components: Distributed Runtime
> Reporter: Alexander Alexandrov
> Assignee: Alexander Alexandrov
> Fix For: 0.10
>
> Original Estimate: 1,008h
> Remaining Estimate: 1,008h
>
> One of the major problems related to the optimizer at the moment is the lack
> of proper statistics.
> With the introduction of staged execution, it is possible to instrument the
> runtime code with a statistics facility that collects the required
> information for optimizing the next execution stage.
> I would therefore like to contribute code that can be used to gather basic
> statistics for the (intermediate) result of dataflows (e.g. min, max, count,
> count distinct) and make them available to the job manager.
> Before I start, I would like to hear some feedback form the other users.
> In particular, to handle skew (e.g. on grouping) it might be good to have
> some sort of detailed sketch about the key distribution of an intermediate
> result. I am not sure whether a simple histogram is the most effective way to
> go. Maybe somebody would propose another lightweight sketch that provides
> better accuracy.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)