[ 
https://issues.apache.org/jira/browse/HADOOP-485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12490078
 ] 

Tahir Hashmi commented on HADOOP-485:
-------------------------------------

Got a bit of enlightenment from Sameer on this and it turns out that the whole 
thing really is quite simple. As I understand, here's how things go:

 * The intermediate K,V pairs are partitioned into buckets using a partitioning 
function that by default hashes keys into buckets. User defined partitioning 
functions are supported and they can basically hash on one part of the 
composite key. So we're cool here

 * The K,V pairs are sorted within individual maps and across merges from 
different maps. This is done through a user defined comparator for keys. This 
comparator can sort on multiple parts of a composite key so we're cool here too.

 * Before being passed to the user defined reducer, the K,V pairs are collated 
but this collation is based on the comparator used in the sort. This is where 
we're not cool. The sort comparator wants to look at more than one fields in 
the composite key, while the collation/grouping comparator should only look at 
the primary key part. We just need to be able to load a different comparator. 
This doesn't look like an overly complex change.

So for now, the filter chains shebang can be put in cold storage :)

> allow a different comparator for grouping keys in calls to reduce
> -----------------------------------------------------------------
>
>                 Key: HADOOP-485
>                 URL: https://issues.apache.org/jira/browse/HADOOP-485
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.5.0
>            Reporter: Owen O'Malley
>         Assigned To: Tahir Hashmi
>         Attachments: Hadoop-485-pre.patch, TestUserValueGrouping.java.patch
>
>
> Some algorithms require that the values to the reduce be sorted in a 
> particular order, but extending the key with the additional fields causes  
> them to be handled by different calls to reduce. (The user then collects the 
> values until they detect a "real" key change and then processes them.)
> It would be much easier if the framework let you define a second comparator 
> that did the grouping of values for reduces. So your reduce inputs look like:
> A1, V1
> A2, V2
> A3, V3
> B1, V4
> B2, V5
> instead of getting calls to reduce that look like:
> reduce(A1, {V1}); reduce(A2, {V2}); reduce(A3, {V3}); reduce(B1, {V4}); 
> reduce(B2, {V5});
> you could define the grouping comparator to just compare the letters and end 
> up with:
> reduce(A1, {V1,V2,V3}); reduce(B1, {V4,V5});
> which is the desired outcome. Note that this assumes that the "extra" part of 
> the key is just for sorting because the reduce will only see the first 
> representative of each equivalence class.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to