[
https://issues.apache.org/jira/browse/CRUNCH-51?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13459454#comment-13459454
]
Rahul Sharma commented on CRUNCH-51:
------------------------------------
I had to develop Reservoir stuff because CrunchTotalOrderPartitioner would
require a sequential file(having keys) to work with. The approach you are
advocating is definitely more efficient and simpler but you will be required to
hack your way through the partitioner for that. On a distributed cache you
would have T type data, but you would get corresponding mapped type in the
partitioner. The binary tree it has will be required to be built of the
corresponding mapped type.
As for CrunchTotalOrderPartitionerTest , I wrote it for unit testing
CrunchTotalOrderPartitioner to understand its working. I feel we should still
keep it and modify it according to the changes we are making to the partitioner.
> PCollection#sort relies on using a single reducer for total order sorting
> -------------------------------------------------------------------------
>
> Key: CRUNCH-51
> URL: https://issues.apache.org/jira/browse/CRUNCH-51
> Project: Crunch
> Issue Type: Improvement
> Affects Versions: 0.3.0
> Reporter: Gabriel Reid
> Attachments: 0001-CRUNCH-51-Total-Order-Sort.patch, CRUNCH-51.patch,
> CRUNCH-51.patch, SortTest.java
>
>
> The total-order sorting provided by the Sort class (and therefore
> PCollection#sort) relies on using a single reducer in order to provide
> total-order sorting. This is very inefficient for large datasets, and should
> be replaced with a total order partitioner instead.
> For more information, see CRUNCH-23 (and possibly also MAPREDUCE-4574).
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira