[
https://issues.apache.org/jira/browse/SPARK-2114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sandy Ryza updated SPARK-2114:
------------------------------
Summary: groupByKey and joins on raw data (was: Aggregations on raw data)
> groupByKey and joins on raw data
> --------------------------------
>
> Key: SPARK-2114
> URL: https://issues.apache.org/jira/browse/SPARK-2114
> Project: Spark
> Issue Type: New Feature
> Components: Shuffle, Spark Core
> Affects Versions: 1.0.0
> Reporter: Sandy Ryza
>
> For groupByKey and join transformations, Spark tasks on the reduce side
> deserialize every record into a Java object before calling any user function.
>
> This causes all kinds of problems for garbage collection - when aggregating
> enough data, objects can escape the young gen and trigger full GCs down the
> line. Additionally, when records are spilled, they must be serialized and
> deserialized multiple times.
> It would be helpful to allow aggregations on serialized data - using some
> sort of RawHasher interface that could implement hashCode and equals for
> serialized records. This would also require encoding record boundaries in
> the serialized format, which I'm not sure we currently do.
--
This message was sent by Atlassian JIRA
(v6.2#6252)