Re: Fetcher2 Reduce Phase Question

Andrzej Bialecki Fri, 11 Apr 2008 15:33:27 -0700

Sandeep Tata wrote:

Hi Folks,


I was just wondering what computation really happens in the reduce
phase for Fetcher2 ?

If Fetcher was running in the parsing mode, then in the reduce phaseOutlinks are separated from Parse output and stored in crawl_parse, andother data in parse_text and parse_data. This actually happens inFetcherOutputFormat / ParseOutputFormat, so there is no need for anyReduce apart from the IdentityReduce (default)


I know that it is implemented as a MapRunnable -- but I see no
explicit reducer being set for the job. Is the identity reducer being
used ? Why can't we simply use job.setNumReduceTasks(0) ?
Wouldn't this be faster?

First, when Fetcher / Fetcher2 were written there was no such option inHadoop. Second, the meaning of this setting is that the output from mapsbecomes the final output - but this won't cut it, because map outputsare always simple SequenceFile's, whereas we need to split theFetcherOutput into a bunch of Sequence and MapFile-s (which have to besorted) ...



--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Fetcher2 Reduce Phase Question

Reply via email to