[jira] Commented: (HADOOP-485) allow a different comparator for grouping keys in calls to reduce

Doug Cutting (JIRA) Wed, 18 Apr 2007 10:24:35 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12489822
 ]


Doug Cutting commented on HADOOP-485:
-------------------------------------

Another simple approach to implementing this would be to, when a value 
comparator is configured, write map output as compound keys that include both 
the key and value.  Then the normal sorting would consider values.  Finally, 
the reducer would unpack the compound keys back to separate keys and values 
before calling reduce().  This could probably even be implemented entirely as 
user code:

ValueSorting.configure(job);  // sets mapOutputKeyClass to compound, mapper & 
reducer
ValueSorting.setMapper(MyMapper.class);
ValueSorting.setMapper(MyReducer.class);

The ValueSorting mapper would wrap the original OutputCollector in a version 
that creates compound keys.  The ValueSorting reducer would unwrap compound 
keys.

> allow a different comparator for grouping keys in calls to reduce
> -----------------------------------------------------------------
>
>                 Key: HADOOP-485
>                 URL: https://issues.apache.org/jira/browse/HADOOP-485
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.5.0
>            Reporter: Owen O'Malley
>         Assigned To: Tahir Hashmi
>         Attachments: Hadoop-485-pre.patch, TestUserValueGrouping.java.patch
>
>
> Some algorithms require that the values to the reduce be sorted in a 
> particular order, but extending the key with the additional fields causes  
> them to be handled by different calls to reduce. (The user then collects the 
> values until they detect a "real" key change and then processes them.)
> It would be much easier if the framework let you define a second comparator 
> that did the grouping of values for reduces. So your reduce inputs look like:
> A1, V1
> A2, V2
> A3, V3
> B1, V4
> B2, V5
> instead of getting calls to reduce that look like:
> reduce(A1, {V1}); reduce(A2, {V2}); reduce(A3, {V3}); reduce(B1, {V4}); 
> reduce(B2, {V5});
> you could define the grouping comparator to just compare the letters and end 
> up with:
> reduce(A1, {V1,V2,V3}); reduce(B1, {V4,V5});
> which is the desired outcome. Note that this assumes that the "extra" part of 
> the key is just for sorting because the reduce will only see the first 
> representative of each equivalence class.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-485) allow a different comparator for grouping keys in calls to reduce

Reply via email to