Re: IdentityReducer while fetching

Andrzej Bialecki Thu, 12 Oct 2006 05:44:08 -0700

Vishal Shah wrote:

Hi,

While running a fetch on 3M urls with -noParsing option set, I noticed

that the reduce is taking too long. Since the reducer class is the
IdentityReducer class in this case, couldn't hadoop handle it internally
by setting the output path of map directly to the final output path? Or,
do a simple rename of the temp output directory to the final output
directory?

For the reduce phase, it seems that the copy is unnecessary in this

case. I am unfamiliar with the details of Hadoop, so maybe there is a
strong reason to do things the way they are done right now, or maybe I
am mistaken about how they are done. Can the experts please throw some
light on this?

Unfortunately, it's not possible - partial outputs from map tasks arebasically in random key order, and need to be sorted in order to produceoutput MapFile-s. IdentityReducer indeed does nothing, but theReduceTask framework is doing the sorting ...

Do you think the copy/sort phase is a performance bottleneck in yourcase? You could try changing the number of reduce tasks, or perhaps thenumber of tasks per tasktracker - please seehttp://wiki.apache.org/lucene-hadoop/HowManyMapsAndReduces for more details.



--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: IdentityReducer while fetching

Reply via email to