[ 
https://issues.apache.org/jira/browse/SPARK-2114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14038558#comment-14038558
 ] 

Patrick Wendell commented on SPARK-2114:
----------------------------------------

Hey Sandy,

One thing that would be really helpful is if you could produce a micro 
benchmark that can create the problems you are seeing. That way we could 
profile this and look at the relative benefits of various optimizations. For 
instance, it's possible that simpler things like keeping only the values 
serialized would yield substantial benefit with less complexity than what is 
proposed here.

A second general note is that use a sort-based shuffle would alleviate a lot of 
these issues. At least my guess is that in the MR shuffle objects are sorted in 
serialized form and only lazily serialized at the last possible second.

> groupByKey and joins on raw data
> --------------------------------
>
>                 Key: SPARK-2114
>                 URL: https://issues.apache.org/jira/browse/SPARK-2114
>             Project: Spark
>          Issue Type: New Feature
>          Components: Shuffle, Spark Core
>    Affects Versions: 1.0.0
>            Reporter: Sandy Ryza
>
> For groupByKey and join transformations, Spark tasks on the reduce side 
> deserialize every record into a Java object before calling any user function. 
>  
> This causes all kinds of problems for garbage collection - when aggregating 
> enough data, objects can escape the young gen and trigger full GCs down the 
> line.  Additionally, when records are spilled, they must be serialized and 
> deserialized multiple times.
> It would be helpful to allow aggregations on serialized data - using some 
> sort of RawHasher interface that could implement hashCode and equals for 
> serialized records.  This would also require encoding record boundaries in 
> the serialized format, which I'm not sure we currently do.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to