[
https://issues.apache.org/jira/browse/SPARK-4743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Josh Rosen resolved SPARK-4743.
-------------------------------
Resolution: Fixed
Fix Version/s: 1.3.0
Issue resolved by pull request 3605
[https://github.com/apache/spark/pull/3605]
> Use SparkEnv.serializer instead of closureSerializer in aggregateByKey and
> foldByKey
> ------------------------------------------------------------------------------------
>
> Key: SPARK-4743
> URL: https://issues.apache.org/jira/browse/SPARK-4743
> Project: Spark
> Issue Type: Improvement
> Components: Spark Core
> Reporter: Ivan Vergiliev
> Labels: performance
> Fix For: 1.3.0
>
>
> AggregateByKey and foldByKey in PairRDDFunctions both use the closure
> serializer to serialize and deserialize the initial value. This means that
> the Java serializer is always used, which can be very expensive if there's a
> large number of groups. Calling combineByKey manually and using the normal
> serializer instead of the closure one improved the performance on the dataset
> I'm testing with by about 30-35%.
> I'm not familiar enough with the codebase to be certain that replacing the
> serializer here is OK, but it works correctly in my tests, and it's only
> serializing a single value of type U, which should be serializable by the
> default one since it can be the output of a job. Let me know if I'm missing
> anything.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]