I'm trying to use use map reduce to merge two classes of files, each class using the same keys for grouping. An example: class 1 input file: id_1 A metadatum id_2 A metadatum id_1 A metadatum
class 2 input file: id_1 B some numbers id_1 B some numbers id_2 B some numbers I map using the first token, an id string, as the key. Ideally, the intermediate input to the reducer class would be this (for the key id_1): id_1 A metadatum id_1 A metadatum id_1 B some numbers id_1 B some numbers But because there's no guarantee on sorting for the values, we can see: id_1 B some numbers id_1 A metadatum id_1 B some numbers id_1 A metadatum I was wondering if I could use setOutputValueGroupingComparator to force records of the first class to sort to the top. I'm having a hard time interpreting the documentation though: If equivalence rules for grouping the intermediate keys are required to be different from those for grouping keys before reduction, then one may specify a Comparator via JobConf.setOutputValueGroupingComparator(Class)<http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/JobConf.html#setOutputValueGroupingComparator%28java.lang.Class%29>. Since JobConf.setOutputKeyComparatorClass(Class)<http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/JobConf.html#setOutputKeyComparatorClass%28java.lang.Class%29>can be used to control how intermediate keys are grouped, these can be used in conjunction to simulate *secondary sort on values*. My interpretation is as follows: ---------- class 1 input file: id_1 A metadatum id_1 A metadatum class 2 input file: id_1 B some numbers id_2 B some numbers Map with key = first column + delimiter + second column. Supply setOutputKeyComparatorClass such that it only compares based on the first half of the key. Supply setOutputValueGroupingComparator such that it only compares based on the second half of the key. Thus, all keys like id_1* go to the same group, and then it is sorted within that group with As first, and then Bs (or reverse if needed). ---------- Am I vastly overthinking how setOutputValueGroupingComparator works? I can't tell from the docs if it is possible to peek at the values associated with the pair of keys in each comparison. If it is, I probably wouldn't have to use a different key as done in my interpretation.
