Vishal Shah wrote:
Hi,
While running a fetch on 3M urls with -noParsing option set, I noticed
that the reduce is taking too long. Since the reducer class is the
IdentityReducer class in this case, couldn't hadoop handle it internally
by setting the output path of map directly to the final output path? Or,
do a simple rename of the temp output directory to the final output
directory?
For the reduce phase, it seems that the copy is unnecessary in this
case. I am unfamiliar with the details of Hadoop, so maybe there is a
strong reason to do things the way they are done right now, or maybe I
am mistaken about how they are done. Can the experts please throw some
light on this?
Unfortunately, it's not possible - partial outputs from map tasks are
basically in random key order, and need to be sorted in order to produce
output MapFile-s. IdentityReducer indeed does nothing, but the
ReduceTask framework is doing the sorting ...
Do you think the copy/sort phase is a performance bottleneck in your
case? You could try changing the number of reduce tasks, or perhaps the
number of tasks per tasktracker - please see
http://wiki.apache.org/lucene-hadoop/HowManyMapsAndReduces for more details.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com