I don't think 10% is bad, but also look at URLs that come from the same host.
Nutch fetcher does rate control when it hits the same host multiple times in
sequence.  Look at "Retry Later..." errors, how much of the 10% is Retry...

-Ledio

-----Original Message-----
From: Dennis Kubes [mailto:[EMAIL PROTECTED]
Sent: Thursday, May 18, 2006 2:55 PM
To: [email protected]
Subject: Timeout Errors Percentages on Large Fetches


What is anybody else seeing for timeout percentages on large fetches? 

We are running a 1M page crawl and seeing about a 10% timeout rate on a 
2Mbps line running about 165 fetchers I think.  We have the 
fetcher.threads.fetch set to 3 but have 55 map tasks as a default on a 
11 node cluster. If I am not mistaken this works out to 165 fetchers.  
It is running about 16 pages / second  with about 10% timeout and I 
didn't know if that would be due to my settings pretty much pegging the 
available bandwidth or the websites I am crawling being down or 
non-responsive.  10% seemed a little high to be downed or non-responding 
sites.

Dennis


-------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid0709&bid&3057&dat1642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to