[ https://issues.apache.org/jira/browse/SPARK-22969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16313160#comment-16313160 ]
Sean Owen commented on SPARK-22969: ----------------------------------- Should this start as a discussion on the mailing list? doesn't seem like it's clear whether there's a change here. > aggregateByKey with aggregator compression > ------------------------------------------ > > Key: SPARK-22969 > URL: https://issues.apache.org/jira/browse/SPARK-22969 > Project: Spark > Issue Type: Question > Components: Spark Core > Affects Versions: 2.4.0 > Reporter: zhengruifeng > Priority: Minor > > I encounter a special case that the aggregator can be represented as two > types: > a) high memory-footprint, but fast {{update}} > b) compact, but must be converted to type a before calling {{update}} and > {{merge}}. > I wonder whether it is possible to compress the fat aggregators in > {{aggregateByKey}} before shuffle, how can I impl it? [~cloud_fan] > One similar case maybe: > Using {{aggregateByKey}}/{{reduceByKey}} to compute the nnz vector (number of > non-zero value) for different keys on a large sparse dataset. > We can use {{DenseVector}} as the aggregators to count the nnz, and then > compress it by call {{Vector#compressed}} before send it to the network. > Another similar case maybe calling {{QuantileSummaries#compress}} before > communication. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org