[ 
https://issues.apache.org/jira/browse/CRUNCH-51?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Wills updated CRUNCH-51:
-----------------------------

    Attachment: CRUNCH-51.patch

Here's my (still incomplete/ugly) take on this, based on using the reservoir 
sampling stuff that was just added and the notion of dependencies across Crunch 
jobs that we introduced for mapside joins. I'm not sure I'm ready for a review 
yet, but wanted to get this posted in case I get hit by a bus.
                
> PCollection#sort relies on using a single reducer for total order sorting
> -------------------------------------------------------------------------
>
>                 Key: CRUNCH-51
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-51
>             Project: Crunch
>          Issue Type: Improvement
>    Affects Versions: 0.3.0
>            Reporter: Gabriel Reid
>         Attachments: 0001-CRUNCH-51-Total-Order-Sort.patch, CRUNCH-51.patch, 
> CRUNCH-51.patch, CRUNCH-51.patch, SortTest.java
>
>
> The total-order sorting provided by the Sort class (and therefore 
> PCollection#sort) relies on using a single reducer in order to provide 
> total-order sorting. This is very inefficient for large datasets, and should 
> be replaced with a total order partitioner instead.
> For more information, see CRUNCH-23 (and possibly also MAPREDUCE-4574).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to