GitHub user JoshRosen opened a pull request:
https://github.com/apache/spark/pull/8178
[SPARK-9952] Fix N^2 loop when DAGScheduler.getPreferredLocsInternal
accesses cacheLocs
In Scala, `Seq.fill` always seems to return a List. Accessing a list by
index is an O(N) operation. Thus, the following code will be really slow (~10
seconds on my machine):
```scala
val numItems = 100000
val s = Seq.fill(numItems)(1)
for (i <- 0 until numItems) s(i)
```
It turns out that we had a loop like this in DAGScheduler code, although
it's a little tricky to spot. In `getPreferredLocsInternal`, there's a call to
`getCacheLocs(rdd)(partition)`. The `getCacheLocs` call returns a Seq. If this
Seq is a List and the RDD contains many partitions, then indexing into this
list will cost O(partitions). Thus, when we loop over our tasks to compute
their individual preferred locations we implicitly perform an N^2 loop,
reducing scheduling throughput.
This patch fixes this by replacing `Seq` with `Array`.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/JoshRosen/spark dagscheduler-perf
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/8178.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #8178
----
commit f2fd11e386da810d4b97f30dc1f28feb3c53d1e1
Author: Josh Rosen <[email protected]>
Date: 2015-08-13T18:49:36Z
Replace .size with .length
commit b8e6552dc29505c8a27855b54eb630700d9f2fd9
Author: Josh Rosen <[email protected]>
Date: 2015-08-13T18:54:07Z
Remove redundant .toArray conversions
commit fe918a9298f50a6f4179f7a69ec8450794f3abfc
Author: Josh Rosen <[email protected]>
Date: 2015-08-13T21:42:18Z
Replace outer Seq with Array.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]