[jira] [Created] (SPARK-15671) performance regression CoalesceRDD large # partitions

Thomas Graves (JIRA) Tue, 31 May 2016 08:40:57 -0700

Thomas Graves created SPARK-15671:
-------------------------------------

             Summary: performance regression CoalesceRDD large # partitions
                 Key: SPARK-15671
                 URL: https://issues.apache.org/jira/browse/SPARK-15671
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 2.0.0
            Reporter: Thomas Graves
            Priority: Critical



I was running a 15TB join job with 202000 partitions. It looks like the changes 
I made to CoalesceRDD in pickBin() are really slow with that large of 
partitions.  The array filter with that many elements just takes to long.

 It took about an hour for it to pickBins for all the partitions.
original change:
https://github.com/apache/spark/commit/83ee92f60345f016a390d61a82f1d924f64ddf90

Just reverting the pickBin code back to get currpreflocs fixes the issue



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (SPARK-15671) performance regression CoalesceRDD large # partitions

Reply via email to