zhengruifeng created SPARK-22969:
------------------------------------
Summary: aggregateByKey with aggregator compression
Key: SPARK-22969
URL: https://issues.apache.org/jira/browse/SPARK-22969
Project: Spark
Issue Type: Question
Components: Spark Core
Affects Versions: 2.4.0
Reporter: zhengruifeng
Priority: Minor
I encounter a special case that the aggregator can be represented as two types:
a) high memory-footprint, but fast {{update}}
b) compact, but must be converted to type a before calling {{update}} and
{{merge}}.
I wonder whether it is possible to compress the fat aggregators in
{{aggregateByKey}} before shuffle, how can I impl it? [~cloud_fan]
One similar case maybe:
Using {{aggregateByKey}}/{{reduceByKey}} to compute the nnz vector (number of
non-zero value) for different keys on a large sparse dataset.
We can use {{DenseVector}} as the aggregators to count the nnz, and then
compress it by call {{Vector#compressed}} before send it to the network.
Another similar case maybe calling {{QuantileSummaries#compress}} before
communication.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]