Matei Zaharia created SPARK-2048:
------------------------------------

             Summary: Optimizations to CPU usage of external spilling code
                 Key: SPARK-2048
                 URL: https://issues.apache.org/jira/browse/SPARK-2048
             Project: Spark
          Issue Type: Improvement
          Components: Spark Core
            Reporter: Matei Zaharia


In the external spilling code in ExternalAppendOnlyMap and CoGroupedRDD, there 
are a few opportunities for optimization:
- There are lots of uses of pattern-matching on Tuple2 (e.g. val (k, v) = 
pair), which we found to be much slower than accessing fields directly
- Hash codes for each element are computed many times in 
StreamBuffer.minKeyHash, which will be expensive for some data types
- Uses of buffer.remove() may be expensive if there are lots of hash collisions 
(better to swap in the last element into that position)
- More objects are allocated than is probably necessary, e.g. ArrayBuffers and 
pairs

These should help because situations where we're spilling are also ones where 
there is presumably a lot of GC pressure in the new generation.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to