On Jan 14, 2009, at 12:46 AM, Rasit OZDAS wrote:
Jim,
As far as I know, there is no operation done after Reducer.
Correct, other than output promotion, which moves the output file to
the final filename.
But if you are a little experienced, you already know these.
Ordered list means one final file, or am I missing something?
There is no value and a lot of cost associated with creating a single
file for the output. The question is how you want the keys divided
between the reduces (and therefore output files). The default
partitioner hashes the key and mods by the number of reduces, which
"stripes" the keys across the output files. You can use the
mapred.lib.InputSampler to generate good partition keys and
mapred.lib.TotalOrderPartitioner to get completely sorted output based
on the partition keys. With the total order partitioner, each reduce
gets an increasing range of keys and thus has all of the nice
properties of a single file without the costs.
-- Owen