Vince Filby wrote:
I am also evaluating performance but on a single machine.  I am finding that
it crawls about two urls per second.  The fetch list is mainly unique so I
am looking for other performance bottlenecks.  The machine is an old PIII
with 512MB of RAM that is running with a load average of 3-4, so I am going
to try a faster machine next week.

What details about the network or the dns setup should I find out to
determine bottle necks in that area?

If you see a lot of SocketTimeout exceptions in your log, then most likely it's a bandwidth problem on your side.

If you see a lot of UnknownHostException-s, then it's a problem of slow DNS resolution - you should use a caching DNS.

If you see a lot of idle time, with nothing happening, check what is the following ratio: number_of_urls / number_of_hosts. In general case this rate should be much higher than the number of fetcher threads.

Use Fetcher2 instead of Fetcher - in situations of crawl-delay contention it's supposed to work better.

Check for sites with very high crawl-delay values in robots.txt - these can considerably slow down the crawl.


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to