[ 
https://issues.apache.org/jira/browse/HADOOP-485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12489986
 ] 

Tahir Hashmi commented on HADOOP-485:
-------------------------------------

Sameer, Devaraj, Arun and I had some involved discussions regarding this issue. 
The conclusions seem to be:

1. There's some degree of consensus that this problem can be solved entirely at 
the application layer but legacy applications might not have the luxury of 
modifiability

2. From the examples cited above, it seems like the end goal is really to sort 
the values. One of the approaches is to make the keys composite, where one part 
is used for sorting/partitioning the actual (primary) keys and the other part 
is used to order the values. This approach requires the framework to handle 
composite keys and group values by the primary subkey, sort them within a group 
by the secondary subkey and finally strip off the secondary subkey before 
calling reduce.

IMO, a better way to do this might be to support filter chains and map and 
reduce stages. The first step is that the user defined mapper emits 
intermediate K,V pairs and each of these pairs is passed through the map filter 
chain before being considered for sorting/spilling. Likewise, at the reduce end 
the reduce filters operate on K,[V] pairs and carry out necessary 
transformations before passing the output to user defined reducers. 
Additionally, we also support sorting of values through user defined 
comparators.

In this scheme of things, the application (as I understand it) would work as 
such:
   * the mapper emits PK, V pairs
   * a map filter transforms V to SK-V composite
   * the pairs are sorted and partitioned on PK to generate PK, [SK-V] groups 
and within the [SK-V] sequence, the elements are sorted by the user defined 
value comparator.
   * the reduce filter strips secondary keys from values and transforms each 
PK,[SK-V] to PK,[V]
   * the user defined reducer finally processes the PK,[V] pairs.

An alternate solution might be to attach secondary key to the primary key 
instead of values, as in the original proposal, except that in this case the 
framework would have to run the equivalent of an intermediate reduce that takes 
PK-SK, V pairs, strips off SK from each composite key and groups the values 
together before calling the user defined reducer. To me this doesn't seem to be 
a very neat approach.

Before discussing this further, I'd like to know the specific real world 
application that triggered this requirement, along with the relevant operating 
constraints it has.

> allow a different comparator for grouping keys in calls to reduce
> -----------------------------------------------------------------
>
>                 Key: HADOOP-485
>                 URL: https://issues.apache.org/jira/browse/HADOOP-485
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.5.0
>            Reporter: Owen O'Malley
>         Assigned To: Tahir Hashmi
>         Attachments: Hadoop-485-pre.patch, TestUserValueGrouping.java.patch
>
>
> Some algorithms require that the values to the reduce be sorted in a 
> particular order, but extending the key with the additional fields causes  
> them to be handled by different calls to reduce. (The user then collects the 
> values until they detect a "real" key change and then processes them.)
> It would be much easier if the framework let you define a second comparator 
> that did the grouping of values for reduces. So your reduce inputs look like:
> A1, V1
> A2, V2
> A3, V3
> B1, V4
> B2, V5
> instead of getting calls to reduce that look like:
> reduce(A1, {V1}); reduce(A2, {V2}); reduce(A3, {V3}); reduce(B1, {V4}); 
> reduce(B2, {V5});
> you could define the grouping comparator to just compare the letters and end 
> up with:
> reduce(A1, {V1,V2,V3}); reduce(B1, {V4,V5});
> which is the desired outcome. Note that this assumes that the "extra" part of 
> the key is just for sorting because the reduce will only see the first 
> representative of each equivalence class.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to