Sandy Ryza created SPARK-2114:
---------------------------------
Summary: Aggregations on raw data
Key: SPARK-2114
URL: https://issues.apache.org/jira/browse/SPARK-2114
Project: Spark
Issue Type: New Feature
Components: Shuffle, Spark Core
Affects Versions: 1.0.0
Reporter: Sandy Ryza
For all the ByKey transformations, Spark tasks on the reduce side deserialize
every record into a Java object before calling any user function.
This causes all kinds of problems for garbage collection - when aggregating
enough data, objects can escape the young gen and trigger full GCs down the
line. Additionally, when records are spilled, they must be serialized and
deserialized multiple times.
It would be helpful to allow aggregations on serialized data - using some sort
of RawHasher interface that could implement for hashCode and equals for
serialized records. This would also require encoding record boundaries in the
serialized format, which I'm not sure we currently do.
--
This message was sent by Atlassian JIRA
(v6.2#6252)