[
https://issues.apache.org/jira/browse/HADOOP-475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508528
]
Vivek Ratan commented on HADOOP-475:
------------------------------------
I wanted to expand on my previous comment. When I said "this can probably be
done just as well in user code", I didn't necessarily imply that we let each
user write his/her code to do this. What I was implying was that either we
build a set of user-level classes (i.e., perhaps not part of the core platform
conceptually, but still written by us) or we develop sample code and maybe let
users copy from it. Seems to me like everytime you want someone to define a new
iterator over a set of values, you need to clone the set of values and sort the
copy using a different comparator, and then provide an iterator over it. This
can be a bit tricky if the values don't all fit in memory - we'll need disk
support for it. As Doug points out in one of his comments for HADOOP-485, we
could maybe use SequenceFile for that.
But I think we first need to figure out how to present this to the user - what
classes should they have, how will the functionality appear to them, etc.
Runping, you should probably have some good insight into this.
> The value iterator to reduce function should be clonable
> --------------------------------------------------------
>
> Key: HADOOP-475
> URL: https://issues.apache.org/jira/browse/HADOOP-475
> Project: Hadoop
> Issue Type: New Feature
> Components: mapred
> Reporter: Runping Qi
> Assignee: Owen O'Malley
>
> In the current framework, when the user implements the reduce method of
> Reducer class,
> the user can only iterate through the value iterator once.
> This makes it hard for the user to perform join-like operations with in the
> reduce method.
> To address problem, one approach is to make the input value iterator
> clonable. Then the user can iterate the values in different ways.
> If the iterator can be reset, then the user can perform nested iterations
> over the data, thus
> carry out join-likeoperations.
> The user code in reduce method would be something like:
> iterator1 = values.clone();
> iterator2 = values.clone();
> while (iterator1.hasNext()) {
> val1 = iterator1.next();
> iterator2.reset();
> while (iterator2.hasNext()) {
> val2 = iterator.next();
> do something vased on val1 and val2
> .......................
> }
> }
> One possible optimization is that if the values are sorted based on a
> secondary key,
> the reset function can take a secondary key as an argument and reset the
> iterator to the begining
> position of the secondary key. It will be very helpful if there is a utility
> that returns a list of iterators,
> one per secondary key value, from the given iterator:
> TreeMap getIteratorsBasedOnSecondaryKey(iterator);
> Each entry in the returned map object is a pair of <secondary key, iterator
> for the values with the same secondary key>.
>
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.