[
https://issues.apache.org/jira/browse/STORM-7?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Rick Kellogg updated STORM-7:
-----------------------------
Component/s: storm-core
> storm.trident.operation.Aggregator: include group information in init() method
> ------------------------------------------------------------------------------
>
> Key: STORM-7
> URL: https://issues.apache.org/jira/browse/STORM-7
> Project: Apache Storm
> Issue Type: Improvement
> Components: storm-core
> Reporter: James Xu
> Priority: Minor
>
> Reported by @lorenzfischer
> To be able to share resources between different groups in a grouped
> aggregator, it would be helpful to have information about the group available
> in the init() method of the aggregator interface.
> The concrete use case is the following:
> For our project we need to count the number of unique values in a field of a
> grouped stream. We have hundreds of millions of unique values and millions of
> grouped values. For this reason, we're currently deploying the HyperLogLog
> class that has generously been made available by the people at Clearspring
> >(https://github.com/clearspring/stream-lib). Naturally, we end up with
> millions of counter objects.
> The DSI-Utils library (http://dsiutils.di.unimi.it) offers a class that
> allows one to reduce the overhead incurred by this many HLL objects through
> its HyperLogLogCounterArray class. We're struggling with the implementation
> in Trident though, as the init(Object batchId, TridentCollector collector)
> method of the aggregator interface does not provide any information about the
> current "group" the aggregator should be initialized for.
> (This was initially posted on Google Groups:
> https://groups.google.com/forum/#!topic/storm-user/dthUfkMRNhU)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)