[EMAIL PROTECTED] wrote:
Siddhartha,


I think decreasing generate.max.per.host will limit the 'wait time'
for each fetch run, but I have a feeling that the overall time will
be roughly the same.  As a matter of fact, it may be even higher,
because you'll have to run generate more times, and if your fetch
jobs are too short, you will be spending more time waiting on
MapReduce jobs (JVM instantiation, job initialization....)

That's correct in case of very short jobs. In case of longer jobs and fetchlists consisting of many urls from the same hosts, the fetch time will be dominated by 'wait time'.

A different point of view on the effects of generate.max.per.host is that it gives a better chance to smaller hosts to be included in a fetchlist - otherwise fetchlists would be dominated by urls from large hosts. So, in a sense it helps to differentiate your crawling frontier, with a silent assumption that N pages from X hosts is more interesting than the same N pages from a single host.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to