GitHub user JoshRosen opened a pull request:

    https://github.com/apache/spark/pull/8178

    [SPARK-9952] Fix N^2 loop when DAGScheduler.getPreferredLocsInternal 
accesses cacheLocs

    In Scala, `Seq.fill` always seems to return a List. Accessing a list by 
index is an O(N) operation. Thus, the following code will be really slow (~10 
seconds on my machine):
    
    ```scala
    val numItems = 100000
    val s = Seq.fill(numItems)(1)
    for (i <- 0 until numItems) s(i)
    ```
    
    It turns out that we had a loop like this in DAGScheduler code, although 
it's a little tricky to spot. In `getPreferredLocsInternal`, there's a call to 
`getCacheLocs(rdd)(partition)`.  The `getCacheLocs` call returns a Seq. If this 
Seq is a List and the RDD contains many partitions, then indexing into this 
list will cost O(partitions). Thus, when we loop over our tasks to compute 
their individual preferred locations we implicitly perform an N^2 loop, 
reducing scheduling throughput.
    
    This patch fixes this by replacing `Seq` with `Array`.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/JoshRosen/spark dagscheduler-perf

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/8178.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #8178
    
----
commit f2fd11e386da810d4b97f30dc1f28feb3c53d1e1
Author: Josh Rosen <[email protected]>
Date:   2015-08-13T18:49:36Z

    Replace .size with .length

commit b8e6552dc29505c8a27855b54eb630700d9f2fd9
Author: Josh Rosen <[email protected]>
Date:   2015-08-13T18:54:07Z

    Remove redundant .toArray conversions

commit fe918a9298f50a6f4179f7a69ec8450794f3abfc
Author: Josh Rosen <[email protected]>
Date:   2015-08-13T21:42:18Z

    Replace outer Seq with Array.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to