I'm trying to perform a mapreduce of IntWritable/{URL,CrawlDatum} ->
URL/CrawlDatum but I want the output to be sorted by the initial
IntWritable and the partitioner to partition by host.  I wrote a
mapreduce with an identity mapper, a partitioner that pulls out the
host from the url and the reducer outputs just url, crawldatum,
however every time I run it, as soon as the reduce phase begin Reduce
> Reduce it gives me this error:

java.io.IOException: key out of order: http://web1.incl.ne.jp/ after
http://who2.com/
        at org.apache.hadoop.io.MapFile$Writer.checkKey(MapFile.java:169)
        at org.apache.hadoop.io.MapFile$Writer.append(MapFile.java:155)
        at 
org.apache.hadoop.mapred.MapFileOutputFormat$1.write(MapFileOutputFormat.java:56)
        at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:340)
        at 
org.apache.nutch.crawl.TimeSorter$FinalTimeSortMR.reduce(TimeSorter.java:96)
        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:355)
        at 
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1707)


When I checked out the MapFileOutputFormat.append() method, it says
the keys must be sorted, so I figured a quick change to
job.setOutputFormat(SequenceFileOutputFormat.class) would fix it, but
I still see the exact same error message.  Is this something others
have seen or would this be better fit in the hadoop-user mailing list?

Thanks,
Ned

Reply via email to