Re: Fetching inefficiency

Andrzej Bialecki Wed, 23 Apr 2008 01:24:38 -0700

[EMAIL PROTECTED] wrote:

Siddhartha,

I think decreasing generate.max.per.host will limit the 'wait time'
for each fetch run, but I have a feeling that the overall time will
be roughly the same.  As a matter of fact, it may be even higher,
because you'll have to run generate more times, and if your fetch
jobs are too short, you will be spending more time waiting on
MapReduce jobs (JVM instantiation, job initialization....)

That's correct in case of very short jobs. In case of longer jobs andfetchlists consisting of many urls from the same hosts, the fetch timewill be dominated by 'wait time'.

A different point of view on the effects of generate.max.per.host isthat it gives a better chance to smaller hosts to be included in afetchlist - otherwise fetchlists would be dominated by urls from largehosts. So, in a sense it helps to differentiate your crawling frontier,with a silent assumption that N pages from X hosts is more interestingthan the same N pages from a single host.


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Fetching inefficiency

Reply via email to