[
https://issues.apache.org/jira/browse/HADOOP-485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12489986
]
Tahir Hashmi commented on HADOOP-485:
-------------------------------------
Sameer, Devaraj, Arun and I had some involved discussions regarding this issue.
The conclusions seem to be:
1. There's some degree of consensus that this problem can be solved entirely at
the application layer but legacy applications might not have the luxury of
modifiability
2. From the examples cited above, it seems like the end goal is really to sort
the values. One of the approaches is to make the keys composite, where one part
is used for sorting/partitioning the actual (primary) keys and the other part
is used to order the values. This approach requires the framework to handle
composite keys and group values by the primary subkey, sort them within a group
by the secondary subkey and finally strip off the secondary subkey before
calling reduce.
IMO, a better way to do this might be to support filter chains and map and
reduce stages. The first step is that the user defined mapper emits
intermediate K,V pairs and each of these pairs is passed through the map filter
chain before being considered for sorting/spilling. Likewise, at the reduce end
the reduce filters operate on K,[V] pairs and carry out necessary
transformations before passing the output to user defined reducers.
Additionally, we also support sorting of values through user defined
comparators.
In this scheme of things, the application (as I understand it) would work as
such:
* the mapper emits PK, V pairs
* a map filter transforms V to SK-V composite
* the pairs are sorted and partitioned on PK to generate PK, [SK-V] groups
and within the [SK-V] sequence, the elements are sorted by the user defined
value comparator.
* the reduce filter strips secondary keys from values and transforms each
PK,[SK-V] to PK,[V]
* the user defined reducer finally processes the PK,[V] pairs.
An alternate solution might be to attach secondary key to the primary key
instead of values, as in the original proposal, except that in this case the
framework would have to run the equivalent of an intermediate reduce that takes
PK-SK, V pairs, strips off SK from each composite key and groups the values
together before calling the user defined reducer. To me this doesn't seem to be
a very neat approach.
Before discussing this further, I'd like to know the specific real world
application that triggered this requirement, along with the relevant operating
constraints it has.
> allow a different comparator for grouping keys in calls to reduce
> -----------------------------------------------------------------
>
> Key: HADOOP-485
> URL: https://issues.apache.org/jira/browse/HADOOP-485
> Project: Hadoop
> Issue Type: New Feature
> Components: mapred
> Affects Versions: 0.5.0
> Reporter: Owen O'Malley
> Assigned To: Tahir Hashmi
> Attachments: Hadoop-485-pre.patch, TestUserValueGrouping.java.patch
>
>
> Some algorithms require that the values to the reduce be sorted in a
> particular order, but extending the key with the additional fields causes
> them to be handled by different calls to reduce. (The user then collects the
> values until they detect a "real" key change and then processes them.)
> It would be much easier if the framework let you define a second comparator
> that did the grouping of values for reduces. So your reduce inputs look like:
> A1, V1
> A2, V2
> A3, V3
> B1, V4
> B2, V5
> instead of getting calls to reduce that look like:
> reduce(A1, {V1}); reduce(A2, {V2}); reduce(A3, {V3}); reduce(B1, {V4});
> reduce(B2, {V5});
> you could define the grouping comparator to just compare the letters and end
> up with:
> reduce(A1, {V1,V2,V3}); reduce(B1, {V4,V5});
> which is the desired outcome. Note that this assumes that the "extra" part of
> the key is just for sorting because the reduce will only see the first
> representative of each equivalence class.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.