[ 
https://issues.apache.org/jira/browse/SPARK-4743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-4743.
-------------------------------
       Resolution: Fixed
    Fix Version/s: 1.3.0

Issue resolved by pull request 3605
[https://github.com/apache/spark/pull/3605]

> Use SparkEnv.serializer instead of closureSerializer in aggregateByKey and 
> foldByKey
> ------------------------------------------------------------------------------------
>
>                 Key: SPARK-4743
>                 URL: https://issues.apache.org/jira/browse/SPARK-4743
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>            Reporter: Ivan Vergiliev
>              Labels: performance
>             Fix For: 1.3.0
>
>
> AggregateByKey and foldByKey in PairRDDFunctions both use the closure 
> serializer to serialize and deserialize the initial value. This means that 
> the Java serializer is always used, which can be very expensive if there's a 
> large number of groups. Calling combineByKey manually and using the normal 
> serializer instead of the closure one improved the performance on the dataset 
> I'm testing with by about 30-35%.
> I'm not familiar enough with the codebase to be certain that replacing the 
> serializer here is OK, but it works correctly in my tests, and it's only 
> serializing a single value of type U, which should be serializable by the 
> default one since it can be the output of a job. Let me know if I'm missing 
> anything.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to