[
https://issues.apache.org/jira/browse/SPARK-22969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16313160#comment-16313160
]
Sean Owen commented on SPARK-22969:
-----------------------------------
Should this start as a discussion on the mailing list? doesn't seem like it's
clear whether there's a change here.
> aggregateByKey with aggregator compression
> ------------------------------------------
>
> Key: SPARK-22969
> URL: https://issues.apache.org/jira/browse/SPARK-22969
> Project: Spark
> Issue Type: Question
> Components: Spark Core
> Affects Versions: 2.4.0
> Reporter: zhengruifeng
> Priority: Minor
>
> I encounter a special case that the aggregator can be represented as two
> types:
> a) high memory-footprint, but fast {{update}}
> b) compact, but must be converted to type a before calling {{update}} and
> {{merge}}.
> I wonder whether it is possible to compress the fat aggregators in
> {{aggregateByKey}} before shuffle, how can I impl it? [~cloud_fan]
> One similar case maybe:
> Using {{aggregateByKey}}/{{reduceByKey}} to compute the nnz vector (number of
> non-zero value) for different keys on a large sparse dataset.
> We can use {{DenseVector}} as the aggregators to count the nnz, and then
> compress it by call {{Vector#compressed}} before send it to the network.
> Another similar case maybe calling {{QuantileSummaries#compress}} before
> communication.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]