GitHub user tgravescs opened a pull request:
https://github.com/apache/spark/pull/11327
[SPARK-11316] coalesce setupGroups doesn't handle UnionRDD with partial
locality properly
## What changes were proposed in this pull request?
coalesce doesn't handle UnionRDD with partial locality properly. I had a
user who had a UnionRDD that was made up of mapPartitionRDD without preferred
locations and a checkpointedRDD with preferred locations (getting from hdfs).
It took the driver over 20 minutes to setup the groups and put the partitions
into those groups before it even started any tasks. Even perhaps worse is it
didn't end up with the number of partitions he was asking for because it didn't
put a partition in each of the groups properly.
The changes in this patch get rid of a n^2 while loop that was causing the
20 minutes and it also properly distributes the partitions to have at least one
per group.
Note that the n^2 while loop that I removed in setupGroups took so long
because all of the partitions with preferred locations were already assigned to
group, so it basically looped through every single one and wasn't ever able to
assign it. At the time I had 960 partitions with preferred locations and 1020
without and did the outer while loop 319 times because that is the # of groups
left to create. Note that each of those times through the inner while loop is
going off to hdfs to get the block locations, so this is extremely inefficient.
## How was the this patch tested?
Added unit tests for this case and ran existing ones that applied to make
sure no regressions.
Also manually tested on the users production job to make sure it fixed
their issue. It created the proper number of partitions and now it takes under
a minute rather then 20 minutes.
I did also run some basic manual tests with spark-shell doing coalesced to
smaller number, same number, and then greater with shuffle.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/tgravescs/spark SPARK-11316
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/11327.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #11327
----
commit afe14dce508b1e51820f16e33f09c9aa402bca3e
Author: Thomas Graves <[email protected]>
Date: 2016-02-23T16:54:30Z
[SPARK-11316] coalesce setupGroups doesn't handle UnionRDD with partial
localtiy properly
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]