[jira] [Created] (SPARK-2048) Optimizations to CPU usage of external spilling code

Matei Zaharia (JIRA) Thu, 05 Jun 2014 18:12:20 -0700

Matei Zaharia created SPARK-2048:
------------------------------------

             Summary: Optimizations to CPU usage of external spilling code
                 Key: SPARK-2048
                 URL: https://issues.apache.org/jira/browse/SPARK-2048
             Project: Spark
          Issue Type: Improvement
          Components: Spark Core
            Reporter: Matei Zaharia



In the external spilling code in ExternalAppendOnlyMap and CoGroupedRDD, there 
are a few opportunities for optimization:
- There are lots of uses of pattern-matching on Tuple2 (e.g. val (k, v) = 
pair), which we found to be much slower than accessing fields directly
- Hash codes for each element are computed many times in 
StreamBuffer.minKeyHash, which will be expensive for some data types
- Uses of buffer.remove() may be expensive if there are lots of hash collisions 
(better to swap in the last element into that position)
- More objects are allocated than is probably necessary, e.g. ArrayBuffers and 
pairs

These should help because situations where we're spilling are also ones where 
there is presumably a lot of GC pressure in the new generation.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2048) Optimizations to CPU usage of external spilling code

Reply via email to