On Wed, Dec 23, 2009 at 12:46 PM, Saptarshi Guha <[email protected]>wrote:
> Hello, > > I re-wrote MapFileOutputFormat for use with Hadoop 0.20.1 and have a > question. > Suppose my Map sends key-value pairs to the reducers. > In my reducer, for a given key value, i emit key1,value1, key2,value2, ... > , > keyn,valuen > > e.g the key (sent to reduce) is e780f987932c84d41e4f14d7607fcb69c6889 > (stored as bytes writable > variation) and value is several lines > > In the reduce, i emit > > key=(e780f987932c84d41e4f14d7607fcb69c6889, 1), value= subset of values > key=(e780f987932c84d41e4f14d7607fcb69c6889, 2), value= subset of values and > so > on > > (The key, values stored in a binary form, the comparator is a binary > comparator). > > So the reduce will be emitting keys in a not necessarily sorted order and > MapOutputFormat throws the following exception: > > Reduce:java.io.IOException: key out of order: > "e780f987932c84d41e4f14d7607fcb69c6889" "1" > after "e72e96c506c4e5cefbc2889e124228f67d121" "10" > at org.apache.hadoop.io.MapFile$Writer.checkKey(MapFile.java:206) > at org.apache.hadoop.io.MapFile$Writer.append(MapFile.java:192) > at > org.godhuli.rhipe.RHMapFileOutputFormat$1.write(RHMapFileOutputFormat.java:79) > (out of order using binary comparator) > > I know the reduce receives keys in sorted order, but the keys it emits may > not > be, so I'm not totally surprised. > > Q1: Is this expected with MapFileOutputFormat? > Yes. MapFiles are stored in sorted order so that lookup can be done with binary search. Q2: Is the work around to emit as SequenceFileOutputFormat, then run an > identity > map (with a reduce) and output as MapFileOutputFormat? If so, doesn't this > force > the user to use double the space(at least)? > > Yes, but you can remove the intermediate output when you're done. On most clusters, the size of data that you can run jobs on in a reasonably short timeframe is fairly small compared to the total capacity. That is to say, on a 10 node cluster, each with 4x1TB disks, it will take many hours to sort even 5-10TB. -Todd
