[jira] Commented: (HADOOP-475) The value iterator to reduce function should be clonable

Vivek Ratan (JIRA) Wed, 27 Jun 2007 05:54:46 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508528
 ]


Vivek Ratan commented on HADOOP-475:
------------------------------------

I wanted to expand on my previous comment. When I said "this can probably be 
done just as well in user code", I didn't necessarily imply that we let each 
user write his/her code to do this. What I was implying was that either we 
build a set of user-level classes (i.e., perhaps not part of the core platform 
conceptually, but still written by us) or we develop sample code and maybe let 
users copy from it. Seems to me like everytime you want someone to define a new 
iterator over a set of values, you need to clone the set of values and sort the 
copy using a different comparator, and then provide an iterator over it. This 
can be a bit tricky if the values don't all fit in memory - we'll need disk 
support for it. As Doug points out in one of his comments for HADOOP-485, we 
could maybe use SequenceFile for that. 

But I think we first need to figure out how to present this to the user - what 
classes should they have, how will the functionality appear to them, etc. 
Runping, you should probably have some good insight into this. 

> The value iterator to reduce function should be clonable
> --------------------------------------------------------
>
>                 Key: HADOOP-475
>                 URL: https://issues.apache.org/jira/browse/HADOOP-475
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Runping Qi
>            Assignee: Owen O'Malley
>
> In the current framework, when the user implements the reduce method of 
> Reducer class, 
> the user can only iterate through the value iterator once. 
> This makes it hard for the user to perform join-like operations with in the 
> reduce method. 
> To address problem, one approach is to make the input value iterator 
> clonable. Then the user can iterate the values in different ways.
> If the iterator can be reset, then the user can perform nested iterations 
> over the data, thus 
> carry out join-likeoperations.
> The user code in reduce method would be something like:
>                   iterator1 = values.clone();
>                   iterator2 = values.clone();
>                  while (iterator1.hasNext()) {
>                       val1 = iterator1.next();
>                       iterator2.reset();
>                       while (iterator2.hasNext()) {
>                            val2 = iterator.next();
>                            do something vased on val1 and val2
>                            .......................
>                       }
>                  }
> One possible optimization is that if the values are sorted based on a 
> secondary key, 
> the reset function can take a secondary key as an argument and reset the 
> iterator to the begining
> position of the secondary key. It will be very helpful if there is a utility 
> that returns a list of iterators,
> one per secondary key value, from the given iterator:
>                           TreeMap getIteratorsBasedOnSecondaryKey(iterator);
> Each entry in the returned map object is a pair of <secondary key, iterator 
> for the values with the same secondary key>.
>   

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-475) The value iterator to reduce function should be clonable

Reply via email to