[jira] [Commented] (CRUNCH-51) PCollection#sort relies on using a single reducer for total order sorting

Rahul Sharma (JIRA) Thu, 20 Sep 2012 01:39:12 -0700

    [ 
https://issues.apache.org/jira/browse/CRUNCH-51?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13459454#comment-13459454
 ]


Rahul Sharma commented on CRUNCH-51:
------------------------------------

I had to develop Reservoir stuff because CrunchTotalOrderPartitioner would 
require a sequential file(having keys) to work with. The approach you are 
advocating is definitely more efficient and simpler but you will be required to 
hack your way through the partitioner for that. On a distributed cache you 
would have T type data, but you would get corresponding mapped  type in the 
partitioner. The binary tree it has will be required to be built of the 
corresponding mapped type. 

As for CrunchTotalOrderPartitionerTest , I wrote it for unit testing 
CrunchTotalOrderPartitioner to understand its working. I feel we should still 
keep it and modify it according to the changes we are making to the partitioner.
                
> PCollection#sort relies on using a single reducer for total order sorting
> -------------------------------------------------------------------------
>
>                 Key: CRUNCH-51
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-51
>             Project: Crunch
>          Issue Type: Improvement
>    Affects Versions: 0.3.0
>            Reporter: Gabriel Reid
>         Attachments: 0001-CRUNCH-51-Total-Order-Sort.patch, CRUNCH-51.patch, 
> CRUNCH-51.patch, SortTest.java
>
>
> The total-order sorting provided by the Sort class (and therefore 
> PCollection#sort) relies on using a single reducer in order to provide 
> total-order sorting. This is very inefficient for large datasets, and should 
> be replaced with a total order partitioner instead.
> For more information, see CRUNCH-23 (and possibly also MAPREDUCE-4574).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CRUNCH-51) PCollection#sort relies on using a single reducer for total order sorting

Reply via email to