Ala Luszczak created SPARK-23496:

             Summary: Locality of coalesced partitions can be severely skewed 
by the order of input partitions
                 Key: SPARK-23496
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 3.0.0
            Reporter: Ala Luszczak


Consider RDD "R" with 100 partitions, half of which have locality preference 
"hostA" and half have "hostB".
 * Assume odd-numbered input partitions of R prefer "hostA" and even-numbered 
prefer "hostB". Then R.coalesce(50) will have 25 partitions with preference 
"hostA" and 25 with "hostB" (even distribution).
 * Assume partitions with index 0-49 of R prefer "hostA" and partitions with 
index 50-99 prefer "hostB". Then R.coalesce(50) will have 49 partitions with 
"hostA" and 1 with "hostB" (extremely skewed distribution).


The algorithm in {{DefaultPartitionCoalescer.setupGroups}} is responsible for 
picking preferred locations for coalesced partitions. It analyzes the preferred 
locations of input partitions. It starts by trying to create one partition for 
each unique location in the input. However, if the the requested number of 
coalesced partitions is higher that the number of unique locations, it has to 
pick duplicate locations.

Currently, the duplicate locations are picked by iterating over the input 
partitions in order, and copying their preferred locations to coalesced 
partitions. If the input partitions are clustered by location, this can result 
in severe skew.

Instead of iterating over the list of input partitions in order, we should pick 
them at random.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

Reply via email to