zhengruifeng created SPARK-22969: ------------------------------------ Summary: aggregateByKey with aggregator compression Key: SPARK-22969 URL: https://issues.apache.org/jira/browse/SPARK-22969 Project: Spark Issue Type: Question Components: Spark Core Affects Versions: 2.4.0 Reporter: zhengruifeng Priority: Minor
I encounter a special case that the aggregator can be represented as two types: a) high memory-footprint, but fast {{update}} b) compact, but must be converted to type a before calling {{update}} and {{merge}}. I wonder whether it is possible to compress the fat aggregators in {{aggregateByKey}} before shuffle, how can I impl it? [~cloud_fan] One similar case maybe: Using {{aggregateByKey}}/{{reduceByKey}} to compute the nnz vector (number of non-zero value) for different keys on a large sparse dataset. We can use {{DenseVector}} as the aggregators to count the nnz, and then compress it by call {{Vector#compressed}} before send it to the network. Another similar case maybe calling {{QuantileSummaries#compress}} before communication. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org