Thomas Graves created SPARK-15671:
-------------------------------------
Summary: performance regression CoalesceRDD large # partitions
Key: SPARK-15671
URL: https://issues.apache.org/jira/browse/SPARK-15671
Project: Spark
Issue Type: Bug
Components: Spark Core
Affects Versions: 2.0.0
Reporter: Thomas Graves
Priority: Critical
I was running a 15TB join job with 202000 partitions. It looks like the changes
I made to CoalesceRDD in pickBin() are really slow with that large of
partitions. The array filter with that many elements just takes to long.
It took about an hour for it to pickBins for all the partitions.
original change:
https://github.com/apache/spark/commit/83ee92f60345f016a390d61a82f1d924f64ddf90
Just reverting the pickBin code back to get currpreflocs fixes the issue
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]