unavailable robots.txt kills fetch ---------------------------------- Key: NUTCH-419 URL: http://issues.apache.org/jira/browse/NUTCH-419 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 0.8.1 Environment: Fetcher is behind a squid proxy, but I am pretty sure this is irrelevant. Nutch in local mode, running on a linux machine with 2GB RAM. Reporter: Carsten Lehmann
I think there is another robots.txt-related problem which is not adressed by NUTCH-344, but also results in an aborted fetch. I am sure that in my last fetch all 17 fetcher threads died while they were waiting for a robots.txt-file to be delivered by a not properly responding web server. I looked at the squid access log, which is used by all fetch threads. It ends with many HTTP-504-errors ("gateway timeout") caused by a certain robots.txt url: <....> 1166652253.332 899427 127.0.0.1 TCP_MISS/504 1450 GET http://gso.gbv.de/robots.txt - DIRECT/193.174.240.8 text/html 1166652343.350 899664 127.0.0.1 TCP_MISS/504 1450 GET http://gso.gbv.de/robots.txt - DIRECT/193.174.240.8 text/html 1166652353.560 899871 127.0.0.1 TCP_MISS/504 1450 GET http://gso.gbv.de/robots.txt - DIRECT/193.174.240.8 text/html These entries mean that it takes 15 minutes before the request ends with a timeout. This can be calculated from the squid log, the first column is the request time (in UTC seconds), the second column is the duration of the request (in ms): 900000/1000/60=15 minutes. As far as I understand it, every time a fetch thread tries to get this robots.txt-file the thread busy waits for the duration of the request (15 minutes). If this is right, then all 17 fetcher threads were caught in this trap at the time when fetching was aborted, as there are 17 requests in the squid log which did not timeout before the message "aborting with 17 threads" was written to the nutch-logfile. Setting fetcher.max.crawl.delay can not help here. I see 296 access attempts in total concerning this robots.txt-url in the squid log of this crawl, but fetcher.max.crawl.delay is set to 30. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers