[jira] Updated: (HADOOP-475) The value iterator to reduce function should be clonable

Sameer Paranjpye (JIRA) Thu, 28 Sep 2006 13:25:35 -0700

     [ http://issues.apache.org/jira/browse/HADOOP-475?page=all ]


Sameer Paranjpye updated HADOOP-475:
------------------------------------

    Component/s: mapred
    Description: 
In the current framework, when the user implements the reduce method of Reducer 
class, 
the user can only iterate through the value iterator once. 
This makes it hard for the user to perform join-like operations with in the 
reduce method. 
To address problem, one approach is to make the input value iterator clonable. 
Then the user can iterate the values in different ways.
If the iterator can be reset, then the user can perform nested iterations over 
the data, thus 
carry out join-likeoperations.

The user code in reduce method would be something like:

                  iterator1 = values.clone();
                  iterator2 = values.clone();
                 while (iterator1.hasNext()) {
                      val1 = iterator1.next();
                      iterator2.reset();
                      while (iterator2.hasNext()) {
                           val2 = iterator.next();
                           do something vased on val1 and val2
                           .......................
                      }
                 }

One possible optimization is that if the values are sorted based on a secondary 
key, 
the reset function can take a secondary key as an argument and reset the 
iterator to the begining
position of the secondary key. It will be very helpful if there is a utility 
that returns a list of iterators,
one per secondary key value, from the given iterator:

                          TreeMap getIteratorsBasedOnSecondaryKey(iterator);

Each entry in the returned map object is a pair of <secondary key, iterator for 
the values with the same secondary key>.

  

  was:

In the current framework, when the user implements the reduce method of Reducer 
class, 
the user can only iterate through the value iterator once. 
This makes it hard for the user to perform join-like operations with in the 
reduce method. 
To address problem, one approach is to make the input value iterator clonable. 
Then the user can iterate the values in different ways.
If the iterator can be reset, then the user can perform nested iterations over 
the data, thus 
carry out join-likeoperations.

The user code in reduce method would be something like:

                  iterator1 = values.clone();
                  iterator2 = values.clone();
                 while (iterator1.hasNext()) {
                      val1 = iterator1.next();
                      iterator2.reset();
                      while (iterator2.hasNext()) {
                           val2 = iterator.next();
                           do something vased on val1 and val2
                           .......................
                      }
                 }

One possible optimization is that if the values are sorted based on a secondary 
key, 
the reset function can take a secondary key as an argument and reset the 
iterator to the begining
position of the secondary key. It will be very helpful if there is a utility 
that returns a list of iterators,
one per secondary key value, from the given iterator:

                          TreeMap getIteratorsBasedOnSecondaryKey(iterator);

Each entry in the returned map object is a pair of <secondary key, iterator for 
the values with the same secondary key>.

  


> The value iterator to reduce function should be clonable
> --------------------------------------------------------
>
>                 Key: HADOOP-475
>                 URL: http://issues.apache.org/jira/browse/HADOOP-475
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Runping Qi
>
> In the current framework, when the user implements the reduce method of 
> Reducer class, 
> the user can only iterate through the value iterator once. 
> This makes it hard for the user to perform join-like operations with in the 
> reduce method. 
> To address problem, one approach is to make the input value iterator 
> clonable. Then the user can iterate the values in different ways.
> If the iterator can be reset, then the user can perform nested iterations 
> over the data, thus 
> carry out join-likeoperations.
> The user code in reduce method would be something like:
>                   iterator1 = values.clone();
>                   iterator2 = values.clone();
>                  while (iterator1.hasNext()) {
>                       val1 = iterator1.next();
>                       iterator2.reset();
>                       while (iterator2.hasNext()) {
>                            val2 = iterator.next();
>                            do something vased on val1 and val2
>                            .......................
>                       }
>                  }
> One possible optimization is that if the values are sorted based on a 
> secondary key, 
> the reset function can take a secondary key as an argument and reset the 
> iterator to the begining
> position of the secondary key. It will be very helpful if there is a utility 
> that returns a list of iterators,
> one per secondary key value, from the given iterator:
>                           TreeMap getIteratorsBasedOnSecondaryKey(iterator);
> Each entry in the returned map object is a pair of <secondary key, iterator 
> for the values with the same secondary key>.
>   

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (HADOOP-475) The value iterator to reduce function should be clonable

Reply via email to