Vince Filby wrote:
I am also evaluating performance but on a single machine. I am finding that
it crawls about two urls per second. The fetch list is mainly unique so I
am looking for other performance bottlenecks. The machine is an old PIII
with 512MB of RAM that is running with a load average of 3-4, so I am going
to try a faster machine next week.
What details about the network or the dns setup should I find out to
determine bottle necks in that area?
If you see a lot of SocketTimeout exceptions in your log, then most
likely it's a bandwidth problem on your side.
If you see a lot of UnknownHostException-s, then it's a problem of slow
DNS resolution - you should use a caching DNS.
If you see a lot of idle time, with nothing happening, check what is the
following ratio: number_of_urls / number_of_hosts. In general case
this rate should be much higher than the number of fetcher threads.
Use Fetcher2 instead of Fetcher - in situations of crawl-delay
contention it's supposed to work better.
Check for sites with very high crawl-delay values in robots.txt - these
can considerably slow down the crawl.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com