Re: correct pattern for using setOutputValueGroupingComparator?

2009-01-06 Thread Devaraj Das



On 1/6/09 9:47 AM, Meng Mao meng...@gmail.com wrote:

 Unfortunately, my team is on 0.15 :(. We are looking to upgrade to 0.18 as
 soon as we upgrade our hardware (long story).
 From comparing the 0.15 and 0.19 mapreduce tutorials, and looking at the
 4545 patch, I don't see anything that seems majorly different about the
 MapReduce API?
 - There's a Partitioner that's used, but that seems optional?
 - I see that 0.19 still provides setOutputValueGroupingComparator; is the
 setGroupingComparatorClass in the patch from the 0.20 API?
 
Yes, setGroupingComparator got defined in the new MapReduce API and is doing
the same thing.

 I have an associated question -- is it possible to use this
 GroupingComparator technique to perform essentially a one-to-many mapping?
 Let's say I have records like so:
 id_1  -   metadata
 id_2  -   metadata
 id_1  A  numbers
 id_2  B  numbers
 id_1  C  numbers
 
 Would it be possible for a key,value pair for id_1, -, metadata to map
 to both the groups for the keys id_1, A and id_1, C ?  The comparator
 seems easy to achieve; but I don't see multiple copies of a record being
 sent to multiple groups.  I know it's a bit unusual, but it would be useful
 for us to have this kind of wildcard behavior.
 
Not that's not possible without changing your app to generate that many
records. So for example, in your map, you could output multiple records
corresponding to the wild-card records..
 
 Meng
 
 
 
 On Mon, Jan 5, 2009 at 6:58 PM, Owen O'Malley omal...@apache.org wrote:
 
 This is exactly what the setOutputValueGroupingComparator is for. Take a
 look at HADOOP-4545, for an example using the secondary sort. If you are
 using trunk or 0.20, look at
 src/examples/org/apache/hadoop/examples/SecondarySort.java. The checked in
 example uses the new map/reduce api that was introduced in 0.20.
 
 -- Owen
 




correct pattern for using setOutputValueGroupingComparator?

2009-01-05 Thread Meng Mao
I'm trying to use use map reduce to merge two classes of files, each class
using the same keys for grouping. An example:
class 1 input file:
id_1 A metadatum
id_2 A metadatum
id_1 A metadatum

class 2 input file:
id_1 B some numbers
id_1 B some numbers
id_2 B some numbers

I map using the first token, an id string, as the key. Ideally, the
intermediate input to the reducer class would be this (for the key id_1):
id_1 A metadatum
id_1 A metadatum
id_1 B some numbers
id_1 B some numbers

But because there's no guarantee on sorting for the values, we can see:
id_1 B some numbers
id_1 A metadatum
id_1 B some numbers
id_1 A metadatum


I was wondering if I could use setOutputValueGroupingComparator to force
records of the first class to sort to the top. I'm having a hard time
interpreting the documentation though:
If equivalence rules for grouping the intermediate keys are required to be
different from those for grouping keys before reduction, then one may
specify a Comparator via
JobConf.setOutputValueGroupingComparator(Class)http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/JobConf.html#setOutputValueGroupingComparator%28java.lang.Class%29.
Since 
JobConf.setOutputKeyComparatorClass(Class)http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/JobConf.html#setOutputKeyComparatorClass%28java.lang.Class%29can
be used to control how intermediate keys are grouped, these can be
used
in conjunction to simulate *secondary sort on values*.

My interpretation is as follows:
--
class 1 input file:
id_1 A metadatum
id_1 A metadatum

class 2 input file:
id_1 B some numbers
id_2 B some numbers

Map with key = first column + delimiter + second column. Supply
setOutputKeyComparatorClass such that it only compares based on the first
half of the key. Supply setOutputValueGroupingComparator such that it only
compares based on the second half of the key. Thus, all keys like id_1* go
to the same group, and then it is sorted within that group with As first,
and then Bs (or reverse if needed).
--

Am I vastly overthinking how setOutputValueGroupingComparator works? I can't
tell from the docs if it is possible to peek at the values associated with
the pair of keys in each comparison. If it is, I probably wouldn't have to
use a different key as done in my interpretation.


Re: correct pattern for using setOutputValueGroupingComparator?

2009-01-05 Thread Owen O'Malley
This is exactly what the setOutputValueGroupingComparator is for. Take  
a look at HADOOP-4545, for an example using the secondary sort. If you  
are using trunk or 0.20, look at src/examples/org/apache/hadoop/ 
examples/SecondarySort.java. The checked in example uses the new map/ 
reduce api that was introduced in 0.20.


-- Owen