Gabriel Reid created CRUNCH-673:
-----------------------------------

             Summary: Sort fails when using more reducers than records
                 Key: CRUNCH-673
                 URL: https://issues.apache.org/jira/browse/CRUNCH-673
             Project: Crunch
          Issue Type: Bug
            Reporter: Gabriel Reid


We've run into an issue where running Sort with a number of reducers that is 
higher than the number of records to be sorted fails.

The way in which this occurs is that a large PCollection is filtered down to 
almost nothing (say 10 records), and that filtered PCollection is passed in to 
Sort. Sort configures n reducers for the small PCollection (because it doesn't 
realize that it has been filtered so aggressively), so then there are for 
example 20 reducers configured. Reservoir sampling is used to build up the 
partition definitions for the TotalOrderPartitioner, but because there are only 
10 records in the filtered PCollection, only 10 partitions are defined for the 
TotalOrderPartitioner. This then causes a precondition in TotalOrderPartitioner 
to fail, because the number of partitions in the partitions file doesn't match 
up with the number of configured reducers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to