Set the Order of the Keys in Reduce

2009-01-22 Thread Brian MacKay
Hello,

 

Any tips would be greatly appreciated.

 

Is there a way to set the order of the keys in reduce as shown below, no
matter what order the collection in MAP occurs in.

 

Thanks, Brian

 

 

public void map(WritableComparable key, Text values,

OutputCollectorText, Text output, Reporter reporter)
throws IOException {

 

//collect many CAT_A and CAT_B in random order

output.collect(CAT_A, details);

output.collect(CAT_B, details);

 

 }

 

 

 

   public void reduce(Text key, IteratorText values,

OutputCollectorText, Text output, Reporter
reporter) throws IOException {

 

//always reduce CAT_A first, then reduce CAT_B

 

  }

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

The information transmitted is intended only for the person or entity to 
which it is addressed and may contain confidential and/or privileged 
material. Any review, retransmission, dissemination or other use of, or 
taking of any action in reliance upon, this information by persons or 
entities other than the intended recipient is prohibited. If you received 
this message in error, please contact the sender and delete the material 
from any computer.



Re: Set the Order of the Keys in Reduce

2009-01-22 Thread Tom White
Hi Brian,

The CAT_A and CAT_B keys will be processed by different reducer
instances, so they run independently and may run in any order. What's
the output that you're trying to get?

Cheers,
Tom

On Thu, Jan 22, 2009 at 3:25 PM, Brian MacKay
brian.mac...@medecision.com wrote:
 Hello,



 Any tips would be greatly appreciated.



 Is there a way to set the order of the keys in reduce as shown below, no
 matter what order the collection in MAP occurs in.



 Thanks, Brian





public void map(WritableComparable key, Text values,

OutputCollectorText, Text output, Reporter reporter)
 throws IOException {



//collect many CAT_A and CAT_B in random order

output.collect(CAT_A, details);

output.collect(CAT_B, details);



 }







   public void reduce(Text key, IteratorText values,

OutputCollectorText, Text output, Reporter
 reporter) throws IOException {



//always reduce CAT_A first, then reduce CAT_B



  }

 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

 The information transmitted is intended only for the person or entity to
 which it is addressed and may contain confidential and/or privileged
 material. Any review, retransmission, dissemination or other use of, or
 taking of any action in reliance upon, this information by persons or
 entities other than the intended recipient is prohibited. If you received
 this message in error, please contact the sender and delete the material
 from any computer.




RE: Set the Order of the Keys in Reduce

2009-01-22 Thread Brian MacKay
Hello Tom,

Would like to apply some rules To CAT_A, then use the output of CAT_A to
reduce CAT_B.   I'd rather not run two JOBS, so perhaps I need two
reducers?


First Reducer processes CAT_A, then when complete second reducer does
CAT_B?

I suppose this would accomplish the same thing?



-Original Message-
From: Tom White [mailto:t...@cloudera.com] 
Sent: Thursday, January 22, 2009 10:41 AM
To: core-user@hadoop.apache.org
Subject: Re: Set the Order of the Keys in Reduce

Hi Brian,

The CAT_A and CAT_B keys will be processed by different reducer
instances, so they run independently and may run in any order. What's
the output that you're trying to get?

Cheers,
Tom

On Thu, Jan 22, 2009 at 3:25 PM, Brian MacKay
brian.mac...@medecision.com wrote:
 Hello,



 Any tips would be greatly appreciated.



 Is there a way to set the order of the keys in reduce as shown below,
no
 matter what order the collection in MAP occurs in.



 Thanks, Brian





public void map(WritableComparable key, Text values,

OutputCollectorText, Text output, Reporter reporter)
 throws IOException {



//collect many CAT_A and CAT_B in random order

output.collect(CAT_A, details);

output.collect(CAT_B, details);



 }







   public void reduce(Text key, IteratorText values,

OutputCollectorText, Text output, Reporter
 reporter) throws IOException {



//always reduce CAT_A first, then reduce CAT_B



  }

 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
_ _ _

 The information transmitted is intended only for the person or entity
to
 which it is addressed and may contain confidential and/or privileged
 material. Any review, retransmission, dissemination or other use of,
or
 taking of any action in reliance upon, this information by persons or
 entities other than the intended recipient is prohibited. If you
received
 this message in error, please contact the sender and delete the
material
 from any computer.



_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

The information transmitted is intended only for the person or entity to 
which it is addressed and may contain confidential and/or privileged 
material. Any review, retransmission, dissemination or other use of, or 
taking of any action in reliance upon, this information by persons or 
entities other than the intended recipient is prohibited. If you received 
this message in error, please contact the sender and delete the material 
from any computer.




Re: Set the Order of the Keys in Reduce

2009-01-22 Thread Tom White
Reducers run independently and without knowledge of one another, so
you can't get one reducer to depend on the output of another. I think
having two jobs is the simplest way to achieve what you're trying to
do.

Tom

On Thu, Jan 22, 2009 at 3:48 PM, Brian MacKay
brian.mac...@medecision.com wrote:
 Hello Tom,

 Would like to apply some rules To CAT_A, then use the output of CAT_A to
 reduce CAT_B.   I'd rather not run two JOBS, so perhaps I need two
 reducers?


 First Reducer processes CAT_A, then when complete second reducer does
 CAT_B?

 I suppose this would accomplish the same thing?



 -Original Message-
 From: Tom White [mailto:t...@cloudera.com]
 Sent: Thursday, January 22, 2009 10:41 AM
 To: core-user@hadoop.apache.org
 Subject: Re: Set the Order of the Keys in Reduce

 Hi Brian,

 The CAT_A and CAT_B keys will be processed by different reducer
 instances, so they run independently and may run in any order. What's
 the output that you're trying to get?

 Cheers,
 Tom

 On Thu, Jan 22, 2009 at 3:25 PM, Brian MacKay
 brian.mac...@medecision.com wrote:
 Hello,



 Any tips would be greatly appreciated.



 Is there a way to set the order of the keys in reduce as shown below,
 no
 matter what order the collection in MAP occurs in.



 Thanks, Brian





public void map(WritableComparable key, Text values,

OutputCollectorText, Text output, Reporter reporter)
 throws IOException {



//collect many CAT_A and CAT_B in random order

output.collect(CAT_A, details);

output.collect(CAT_B, details);



 }







   public void reduce(Text key, IteratorText values,

OutputCollectorText, Text output, Reporter
 reporter) throws IOException {



//always reduce CAT_A first, then reduce CAT_B



  }

 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
 _ _ _

 The information transmitted is intended only for the person or entity
 to
 which it is addressed and may contain confidential and/or privileged
 material. Any review, retransmission, dissemination or other use of,
 or
 taking of any action in reliance upon, this information by persons or
 entities other than the intended recipient is prohibited. If you
 received
 this message in error, please contact the sender and delete the
 material
 from any computer.



 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

 The information transmitted is intended only for the person or entity to
 which it is addressed and may contain confidential and/or privileged
 material. Any review, retransmission, dissemination or other use of, or
 taking of any action in reliance upon, this information by persons or
 entities other than the intended recipient is prohibited. If you received
 this message in error, please contact the sender and delete the material
 from any computer.





Re: Set the Order of the Keys in Reduce

2009-01-22 Thread Owen O'Malley


On Jan 22, 2009, at 7:25 AM, Brian MacKay wrote:

Is there a way to set the order of the keys in reduce as shown  
below, no

matter what order the collection in MAP occurs in.


The keys to reduce are *always* sorted. If the default order is not  
correct, you can change the compare function.


As Tom points out, the critical thing is making sure that all of the  
keys that you need to group together go to the same reduce. So let's  
make it a little more concrete and say that you have:


public class TextPair implements Writable {
  public TextPair() {}
  public void set(String left, String right);
  public String getLeft();
  ...
}

And your map 0 does:
  key.set(CAT, B);
  output.collect(key, value);
  key.set(DOG, A);
  output.collect(key, value);

While map 1 does:
  key.set(CAT, A);
  output.collect(key, value);
  key.set(DOG,B);
  output.collect(key,value);

And you want to make sure that all of the cats go to the same reduces  
and that the dogs go to the same reduce, you would need to set the  
partitioner. It would look like:


public class MyPartitionerV implements PartitionerTextPair, V {

  public void configure(JobConf job) {}

  public int getPartition(TextPair key, V value,
 int numReduceTasks) {
return (key.getLeft().hashCode()  Integer.MAX_VALUE) %  
numReduceTasks;

  }
}

Then define a raw comparator that sorts based on both the left and  
right part of the TextPair, and you are set.


-- Owen


RE: Set the Order of the Keys in Reduce

2009-01-22 Thread Brian MacKay
Owen, Thanks for joining in..

I suppose what is needed is a new config setting called
SequenceReducer.  In it you would specify multiple reducer classes in
the order you would like executed by JobTracker.   When Map completes,
MyReducerA.class would run, and in it would be specified the keys it
should reduce, not all existing. In Owen's example, this could be CAT.
When all instances of the MyReducerA complete reducing CAT, JobTracker
would move on to the next reducer in the list. MyReducerB could then
retrieve the values reduced down from CAT in HDFS  as a filter to
reduce DOG.

List list = new ArrayList();

List.add( MyReducerA.class ) //Reduces CAT
List.add( MyReducerB.class ) //Reduces DOG

conf.setSequenceReducer (list);


I agree with the previous posts and appreciate everyone insights and
participation.  What I proposed above is not simple. But when one
considers the size of the job, running it twice doesn't make a lot of
sense.  Should one rerun a 40 gb job file because the values reduced in
CAT are needed to filter the reduce of DOG? A better way must exist!

Owen, maybe I misunderstood your message, but it seems like even with
the addition of a partitioner and raw comparator Tom's post would still
prevent what I'm trying to do without having what is suggested above in
some fashion.

you can't get one reducer to depend on the output of another.


Thanks, Brian



-Original Message-
From: Tom White [mailto:t...@cloudera.com] 
Sent: Thursday, January 22, 2009 11:04 AM
To: core-user@hadoop.apache.org
Subject: Re: Set the Order of the Keys in Reduce

Reducers run independently and without knowledge of one another, so
you can't get one reducer to depend on the output of another. I think
having two jobs is the simplest way to achieve what you're trying to
do.

Tom

On Thu, Jan 22, 2009 at 3:48 PM, Brian MacKay
brian.mac...@medecision.com wrote:
 Hello Tom,

 Would like to apply some rules To CAT_A, then use the output of CAT_A
to
 reduce CAT_B.   I'd rather not run two JOBS, so perhaps I need two
 reducers?


 First Reducer processes CAT_A, then when complete second reducer does
 CAT_B?

 I suppose this would accomplish the same thing?



 -Original Message-
 From: Tom White [mailto:t...@cloudera.com]
 Sent: Thursday, January 22, 2009 10:41 AM
 To: core-user@hadoop.apache.org
 Subject: Re: Set the Order of the Keys in Reduce

 Hi Brian,

 The CAT_A and CAT_B keys will be processed by different reducer
 instances, so they run independently and may run in any order. What's
 the output that you're trying to get?

 Cheers,
 Tom

 On Thu, Jan 22, 2009 at 3:25 PM, Brian MacKay
 brian.mac...@medecision.com wrote:
 Hello,



 Any tips would be greatly appreciated.



 Is there a way to set the order of the keys in reduce as shown below,
 no
 matter what order the collection in MAP occurs in.



 Thanks, Brian





public void map(WritableComparable key, Text values,

OutputCollectorText, Text output, Reporter reporter)
 throws IOException {



//collect many CAT_A and CAT_B in random order

output.collect(CAT_A, details);

output.collect(CAT_B, details);



 }







   public void reduce(Text key, IteratorText values,

OutputCollectorText, Text output, Reporter
 reporter) throws IOException {



//always reduce CAT_A first, then reduce CAT_B



  }

 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
 _ _ _

 The information transmitted is intended only for the person or entity
 to
 which it is addressed and may contain confidential and/or privileged
 material. Any review, retransmission, dissemination or other use of,
 or
 taking of any action in reliance upon, this information by persons or
 entities other than the intended recipient is prohibited. If you
 received
 this message in error, please contact the sender and delete the
 material
 from any computer.



 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
_ _ _

 The information transmitted is intended only for the person or entity
to
 which it is addressed and may contain confidential and/or privileged
 material. Any review, retransmission, dissemination or other use of,
or
 taking of any action in reliance upon, this information by persons or
 entities other than the intended recipient is prohibited. If you
received
 this message in error, please contact the sender and delete the
material
 from any computer.




_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

The information transmitted is intended only for the person or entity to 
which it is addressed and may contain confidential and/or privileged 
material. Any review, retransmission, dissemination or other use of, or 
taking of any action in reliance upon, this information by persons or 
entities other than the intended recipient is prohibited. If you received 
this message in error, please