Matei Zaharia created SPARK-2048:
------------------------------------
Summary: Optimizations to CPU usage of external spilling code
Key: SPARK-2048
URL: https://issues.apache.org/jira/browse/SPARK-2048
Project: Spark
Issue Type: Improvement
Components: Spark Core
Reporter: Matei Zaharia
In the external spilling code in ExternalAppendOnlyMap and CoGroupedRDD, there
are a few opportunities for optimization:
- There are lots of uses of pattern-matching on Tuple2 (e.g. val (k, v) =
pair), which we found to be much slower than accessing fields directly
- Hash codes for each element are computed many times in
StreamBuffer.minKeyHash, which will be expensive for some data types
- Uses of buffer.remove() may be expensive if there are lots of hash collisions
(better to swap in the last element into that position)
- More objects are allocated than is probably necessary, e.g. ArrayBuffers and
pairs
These should help because situations where we're spilling are also ones where
there is presumably a lot of GC pressure in the new generation.
--
This message was sent by Atlassian JIRA
(v6.2#6252)