Set the Order of the Keys in Reduce
Hello, Any tips would be greatly appreciated. Is there a way to set the order of the keys in reduce as shown below, no matter what order the collection in MAP occurs in. Thanks, Brian public void map(WritableComparable key, Text values, OutputCollectorText, Text output, Reporter reporter) throws IOException { //collect many CAT_A and CAT_B in random order output.collect(CAT_A, details); output.collect(CAT_B, details); } public void reduce(Text key, IteratorText values, OutputCollectorText, Text output, Reporter reporter) throws IOException { //always reduce CAT_A first, then reduce CAT_B } _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this message in error, please contact the sender and delete the material from any computer.
Re: Set the Order of the Keys in Reduce
Hi Brian, The CAT_A and CAT_B keys will be processed by different reducer instances, so they run independently and may run in any order. What's the output that you're trying to get? Cheers, Tom On Thu, Jan 22, 2009 at 3:25 PM, Brian MacKay brian.mac...@medecision.com wrote: Hello, Any tips would be greatly appreciated. Is there a way to set the order of the keys in reduce as shown below, no matter what order the collection in MAP occurs in. Thanks, Brian public void map(WritableComparable key, Text values, OutputCollectorText, Text output, Reporter reporter) throws IOException { //collect many CAT_A and CAT_B in random order output.collect(CAT_A, details); output.collect(CAT_B, details); } public void reduce(Text key, IteratorText values, OutputCollectorText, Text output, Reporter reporter) throws IOException { //always reduce CAT_A first, then reduce CAT_B } _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this message in error, please contact the sender and delete the material from any computer.
RE: Set the Order of the Keys in Reduce
Hello Tom, Would like to apply some rules To CAT_A, then use the output of CAT_A to reduce CAT_B. I'd rather not run two JOBS, so perhaps I need two reducers? First Reducer processes CAT_A, then when complete second reducer does CAT_B? I suppose this would accomplish the same thing? -Original Message- From: Tom White [mailto:t...@cloudera.com] Sent: Thursday, January 22, 2009 10:41 AM To: core-user@hadoop.apache.org Subject: Re: Set the Order of the Keys in Reduce Hi Brian, The CAT_A and CAT_B keys will be processed by different reducer instances, so they run independently and may run in any order. What's the output that you're trying to get? Cheers, Tom On Thu, Jan 22, 2009 at 3:25 PM, Brian MacKay brian.mac...@medecision.com wrote: Hello, Any tips would be greatly appreciated. Is there a way to set the order of the keys in reduce as shown below, no matter what order the collection in MAP occurs in. Thanks, Brian public void map(WritableComparable key, Text values, OutputCollectorText, Text output, Reporter reporter) throws IOException { //collect many CAT_A and CAT_B in random order output.collect(CAT_A, details); output.collect(CAT_B, details); } public void reduce(Text key, IteratorText values, OutputCollectorText, Text output, Reporter reporter) throws IOException { //always reduce CAT_A first, then reduce CAT_B } _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this message in error, please contact the sender and delete the material from any computer. _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this message in error, please contact the sender and delete the material from any computer.
Re: Set the Order of the Keys in Reduce
Reducers run independently and without knowledge of one another, so you can't get one reducer to depend on the output of another. I think having two jobs is the simplest way to achieve what you're trying to do. Tom On Thu, Jan 22, 2009 at 3:48 PM, Brian MacKay brian.mac...@medecision.com wrote: Hello Tom, Would like to apply some rules To CAT_A, then use the output of CAT_A to reduce CAT_B. I'd rather not run two JOBS, so perhaps I need two reducers? First Reducer processes CAT_A, then when complete second reducer does CAT_B? I suppose this would accomplish the same thing? -Original Message- From: Tom White [mailto:t...@cloudera.com] Sent: Thursday, January 22, 2009 10:41 AM To: core-user@hadoop.apache.org Subject: Re: Set the Order of the Keys in Reduce Hi Brian, The CAT_A and CAT_B keys will be processed by different reducer instances, so they run independently and may run in any order. What's the output that you're trying to get? Cheers, Tom On Thu, Jan 22, 2009 at 3:25 PM, Brian MacKay brian.mac...@medecision.com wrote: Hello, Any tips would be greatly appreciated. Is there a way to set the order of the keys in reduce as shown below, no matter what order the collection in MAP occurs in. Thanks, Brian public void map(WritableComparable key, Text values, OutputCollectorText, Text output, Reporter reporter) throws IOException { //collect many CAT_A and CAT_B in random order output.collect(CAT_A, details); output.collect(CAT_B, details); } public void reduce(Text key, IteratorText values, OutputCollectorText, Text output, Reporter reporter) throws IOException { //always reduce CAT_A first, then reduce CAT_B } _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this message in error, please contact the sender and delete the material from any computer. _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this message in error, please contact the sender and delete the material from any computer.
Re: Set the Order of the Keys in Reduce
On Jan 22, 2009, at 7:25 AM, Brian MacKay wrote: Is there a way to set the order of the keys in reduce as shown below, no matter what order the collection in MAP occurs in. The keys to reduce are *always* sorted. If the default order is not correct, you can change the compare function. As Tom points out, the critical thing is making sure that all of the keys that you need to group together go to the same reduce. So let's make it a little more concrete and say that you have: public class TextPair implements Writable { public TextPair() {} public void set(String left, String right); public String getLeft(); ... } And your map 0 does: key.set(CAT, B); output.collect(key, value); key.set(DOG, A); output.collect(key, value); While map 1 does: key.set(CAT, A); output.collect(key, value); key.set(DOG,B); output.collect(key,value); And you want to make sure that all of the cats go to the same reduces and that the dogs go to the same reduce, you would need to set the partitioner. It would look like: public class MyPartitionerV implements PartitionerTextPair, V { public void configure(JobConf job) {} public int getPartition(TextPair key, V value, int numReduceTasks) { return (key.getLeft().hashCode() Integer.MAX_VALUE) % numReduceTasks; } } Then define a raw comparator that sorts based on both the left and right part of the TextPair, and you are set. -- Owen
RE: Set the Order of the Keys in Reduce
Owen, Thanks for joining in.. I suppose what is needed is a new config setting called SequenceReducer. In it you would specify multiple reducer classes in the order you would like executed by JobTracker. When Map completes, MyReducerA.class would run, and in it would be specified the keys it should reduce, not all existing. In Owen's example, this could be CAT. When all instances of the MyReducerA complete reducing CAT, JobTracker would move on to the next reducer in the list. MyReducerB could then retrieve the values reduced down from CAT in HDFS as a filter to reduce DOG. List list = new ArrayList(); List.add( MyReducerA.class ) //Reduces CAT List.add( MyReducerB.class ) //Reduces DOG conf.setSequenceReducer (list); I agree with the previous posts and appreciate everyone insights and participation. What I proposed above is not simple. But when one considers the size of the job, running it twice doesn't make a lot of sense. Should one rerun a 40 gb job file because the values reduced in CAT are needed to filter the reduce of DOG? A better way must exist! Owen, maybe I misunderstood your message, but it seems like even with the addition of a partitioner and raw comparator Tom's post would still prevent what I'm trying to do without having what is suggested above in some fashion. you can't get one reducer to depend on the output of another. Thanks, Brian -Original Message- From: Tom White [mailto:t...@cloudera.com] Sent: Thursday, January 22, 2009 11:04 AM To: core-user@hadoop.apache.org Subject: Re: Set the Order of the Keys in Reduce Reducers run independently and without knowledge of one another, so you can't get one reducer to depend on the output of another. I think having two jobs is the simplest way to achieve what you're trying to do. Tom On Thu, Jan 22, 2009 at 3:48 PM, Brian MacKay brian.mac...@medecision.com wrote: Hello Tom, Would like to apply some rules To CAT_A, then use the output of CAT_A to reduce CAT_B. I'd rather not run two JOBS, so perhaps I need two reducers? First Reducer processes CAT_A, then when complete second reducer does CAT_B? I suppose this would accomplish the same thing? -Original Message- From: Tom White [mailto:t...@cloudera.com] Sent: Thursday, January 22, 2009 10:41 AM To: core-user@hadoop.apache.org Subject: Re: Set the Order of the Keys in Reduce Hi Brian, The CAT_A and CAT_B keys will be processed by different reducer instances, so they run independently and may run in any order. What's the output that you're trying to get? Cheers, Tom On Thu, Jan 22, 2009 at 3:25 PM, Brian MacKay brian.mac...@medecision.com wrote: Hello, Any tips would be greatly appreciated. Is there a way to set the order of the keys in reduce as shown below, no matter what order the collection in MAP occurs in. Thanks, Brian public void map(WritableComparable key, Text values, OutputCollectorText, Text output, Reporter reporter) throws IOException { //collect many CAT_A and CAT_B in random order output.collect(CAT_A, details); output.collect(CAT_B, details); } public void reduce(Text key, IteratorText values, OutputCollectorText, Text output, Reporter reporter) throws IOException { //always reduce CAT_A first, then reduce CAT_B } _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this message in error, please contact the sender and delete the material from any computer. _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this message in error, please contact the sender and delete the material from any computer. _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this message in error, please