Hi,
While running a fetch on 3M urls with -noParsing option set, I noticed
that the reduce is taking too long. Since the reducer class is the
IdentityReducer class in this case, couldn't hadoop handle it internally
by setting the output path of map directly to the final output path? Or,
do a simple rename of the temp output directory to the final output
directory?
For the reduce phase, it seems that the copy is unnecessary in this
case. I am unfamiliar with the details of Hadoop, so maybe there is a
strong reason to do things the way they are done right now, or maybe I
am mistaken about how they are done. Can the experts please throw some
light on this?
Thank you,
-vishal.