GitHub user tgravescs opened a pull request:

    https://github.com/apache/spark/pull/11327

    [SPARK-11316] coalesce setupGroups doesn't handle UnionRDD with partial 
locality properly

    ## What changes were proposed in this pull request?
    
    coalesce doesn't handle UnionRDD with partial locality properly.  I had a 
user who had a UnionRDD that was made up of mapPartitionRDD without preferred 
locations and a checkpointedRDD with preferred locations (getting from hdfs).  
It took the driver over 20 minutes to setup the groups and put the partitions 
into those groups before it even started any tasks.  Even perhaps worse is it 
didn't end up with the number of partitions he was asking for because it didn't 
put a partition in each of the groups properly.
    
    The changes in this patch get rid of a n^2 while loop that was causing the 
20 minutes and it also properly distributes the partitions to have at least one 
per group.  
    
    Note that the n^2 while loop that I removed in setupGroups took so long 
because all of the partitions with preferred locations were already assigned to 
group, so it basically looped through every single one and wasn't ever able to 
assign it.  At the time I had 960 partitions with preferred locations and 1020 
without and did the outer while loop 319 times because that is the # of groups 
left to create.  Note that each of those times through the inner while loop is 
going off to hdfs to get the block locations, so this is extremely inefficient.
    
    ## How was the this patch tested?
    
    Added unit tests for this case and ran existing ones that applied to make 
sure no regressions. 
    Also manually tested on the users production job to make sure it fixed 
their issue.  It created the proper number of partitions and now it takes under 
a minute rather then 20 minutes.
     I did also run some basic manual tests with spark-shell doing coalesced to 
smaller number, same number, and then greater with shuffle.
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/tgravescs/spark SPARK-11316

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/11327.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #11327
    
----
commit afe14dce508b1e51820f16e33f09c9aa402bca3e
Author: Thomas Graves <[email protected]>
Date:   2016-02-23T16:54:30Z

    [SPARK-11316] coalesce setupGroups doesn't handle UnionRDD with partial 
localtiy properly

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to