Re: correct pattern for using setOutputValueGroupingComparator?
On 1/6/09 9:47 AM, Meng Mao meng...@gmail.com wrote: Unfortunately, my team is on 0.15 :(. We are looking to upgrade to 0.18 as soon as we upgrade our hardware (long story). From comparing the 0.15 and 0.19 mapreduce tutorials, and looking at the 4545 patch, I don't see anything that seems majorly different about the MapReduce API? - There's a Partitioner that's used, but that seems optional? - I see that 0.19 still provides setOutputValueGroupingComparator; is the setGroupingComparatorClass in the patch from the 0.20 API? Yes, setGroupingComparator got defined in the new MapReduce API and is doing the same thing. I have an associated question -- is it possible to use this GroupingComparator technique to perform essentially a one-to-many mapping? Let's say I have records like so: id_1 - metadata id_2 - metadata id_1 A numbers id_2 B numbers id_1 C numbers Would it be possible for a key,value pair for id_1, -, metadata to map to both the groups for the keys id_1, A and id_1, C ? The comparator seems easy to achieve; but I don't see multiple copies of a record being sent to multiple groups. I know it's a bit unusual, but it would be useful for us to have this kind of wildcard behavior. Not that's not possible without changing your app to generate that many records. So for example, in your map, you could output multiple records corresponding to the wild-card records.. Meng On Mon, Jan 5, 2009 at 6:58 PM, Owen O'Malley omal...@apache.org wrote: This is exactly what the setOutputValueGroupingComparator is for. Take a look at HADOOP-4545, for an example using the secondary sort. If you are using trunk or 0.20, look at src/examples/org/apache/hadoop/examples/SecondarySort.java. The checked in example uses the new map/reduce api that was introduced in 0.20. -- Owen
correct pattern for using setOutputValueGroupingComparator?
I'm trying to use use map reduce to merge two classes of files, each class using the same keys for grouping. An example: class 1 input file: id_1 A metadatum id_2 A metadatum id_1 A metadatum class 2 input file: id_1 B some numbers id_1 B some numbers id_2 B some numbers I map using the first token, an id string, as the key. Ideally, the intermediate input to the reducer class would be this (for the key id_1): id_1 A metadatum id_1 A metadatum id_1 B some numbers id_1 B some numbers But because there's no guarantee on sorting for the values, we can see: id_1 B some numbers id_1 A metadatum id_1 B some numbers id_1 A metadatum I was wondering if I could use setOutputValueGroupingComparator to force records of the first class to sort to the top. I'm having a hard time interpreting the documentation though: If equivalence rules for grouping the intermediate keys are required to be different from those for grouping keys before reduction, then one may specify a Comparator via JobConf.setOutputValueGroupingComparator(Class)http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/JobConf.html#setOutputValueGroupingComparator%28java.lang.Class%29. Since JobConf.setOutputKeyComparatorClass(Class)http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/JobConf.html#setOutputKeyComparatorClass%28java.lang.Class%29can be used to control how intermediate keys are grouped, these can be used in conjunction to simulate *secondary sort on values*. My interpretation is as follows: -- class 1 input file: id_1 A metadatum id_1 A metadatum class 2 input file: id_1 B some numbers id_2 B some numbers Map with key = first column + delimiter + second column. Supply setOutputKeyComparatorClass such that it only compares based on the first half of the key. Supply setOutputValueGroupingComparator such that it only compares based on the second half of the key. Thus, all keys like id_1* go to the same group, and then it is sorted within that group with As first, and then Bs (or reverse if needed). -- Am I vastly overthinking how setOutputValueGroupingComparator works? I can't tell from the docs if it is possible to peek at the values associated with the pair of keys in each comparison. If it is, I probably wouldn't have to use a different key as done in my interpretation.
Re: correct pattern for using setOutputValueGroupingComparator?
This is exactly what the setOutputValueGroupingComparator is for. Take a look at HADOOP-4545, for an example using the secondary sort. If you are using trunk or 0.20, look at src/examples/org/apache/hadoop/ examples/SecondarySort.java. The checked in example uses the new map/ reduce api that was introduced in 0.20. -- Owen