[ 
https://issues.apache.org/jira/browse/CRUNCH-23?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13424718#comment-13424718
 ] 

Rahul Sharma edited comment on CRUNCH-23 at 7/30/12 6:38 AM:
-------------------------------------------------------------

This is a first cut solution to this issue. But this solution suffers from a 
drawback. The keys in the partition file are not evenly distributed. In the 
worst case i.e if the file is already sorted as the words.txt, most of the work 
is done by the last reducer.  Is there a way of improving this ?

Also I donno if the same problem is there in other sorting of Ptable/Pairs etc. 
I could not create a test case for the same. All the tests eventually ran on 
the PCollection sort API. 
                
      was (Author: rahul.sharma):
    This is a first cut solution to this issues. But this solution suffers from 
a drawback. The keys in the partition file are not evenly distributed. In the 
worst case i.e if the file is sort the most of the work is done by the last 
reducer.Is there a way of improving this ?

Also I donno if the same problem is there in other sorting of Ptable/Pairs etc. 
I could not create a test case for the same. All the tests eventually ran on 
the PCollection sort API. 
                  
> PCollection#sort doesn't do a full sort on values
> -------------------------------------------------
>
>                 Key: CRUNCH-23
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-23
>             Project: Crunch
>          Issue Type: Bug
>            Reporter: Gabriel Reid
>            Assignee: Rahul Sharma
>         Attachments: 
> CRUNCH-23-used-TotalOrderpartioner-for-sorting-keys.patch, SortTest.java
>
>
> When a PCollection is sorted (using PCollection#sort), the sorting that is 
> performed is only per reducer, and not an absolute sort over all values. This 
> means that the values are not in sorted order if they are iterated over on a 
> materialized collection. It also means that the sorted files that are output 
> from a sort operation can not be simply concatenated to come to a single 
> sorted file.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to