[ 
https://issues.apache.org/jira/browse/SPARK-2114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza updated SPARK-2114:
------------------------------

    Summary: groupByKey and joins on raw data  (was: Aggregations on raw data)

> groupByKey and joins on raw data
> --------------------------------
>
>                 Key: SPARK-2114
>                 URL: https://issues.apache.org/jira/browse/SPARK-2114
>             Project: Spark
>          Issue Type: New Feature
>          Components: Shuffle, Spark Core
>    Affects Versions: 1.0.0
>            Reporter: Sandy Ryza
>
> For groupByKey and join transformations, Spark tasks on the reduce side 
> deserialize every record into a Java object before calling any user function. 
>  
> This causes all kinds of problems for garbage collection - when aggregating 
> enough data, objects can escape the young gen and trigger full GCs down the 
> line.  Additionally, when records are spilled, they must be serialized and 
> deserialized multiple times.
> It would be helpful to allow aggregations on serialized data - using some 
> sort of RawHasher interface that could implement hashCode and equals for 
> serialized records.  This would also require encoding record boundaries in 
> the serialized format, which I'm not sure we currently do.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to