[ http://issues.apache.org/jira/browse/NUTCH-419?page=comments#action_12460696 ] Carsten Lehmann commented on NUTCH-419: ---------------------------------------
Some more explanations: Above I meant http://gso.gbv.de/XYZ, not http://XYZ.gso.gbv.de of course. I have attached two other log extracts: a) squid_access_log_tail1000.txt this file contains the last 1000 lines of the squid access log. it shows what the fetcher has actually been doing before the fetch gets aborted. It ends with a number of requests to that certain robots.txt-url. b) last_robots.txt_requests_squidlog.txt this files shows the last requests to that certain robot.txt-url. it might be of concern that near the end of this file the line 1166652145.652 1042451 127.0.0.1 TCP_MISS/504 1450 GET http://gso.gbv.de/robots.txt - DIRECT/193.174.240.8 text/html repeats 14 times. this means that there have been 14 simultaenous requests to this url, right? are requests to the robots.txt-file not included in "fetcher.server.delay", which is set to "2.0" in my configuration? anyway, this seems to be ill behaviour. > unavailable robots.txt kills fetch > ---------------------------------- > > Key: NUTCH-419 > URL: http://issues.apache.org/jira/browse/NUTCH-419 > Project: Nutch > Issue Type: Bug > Components: fetcher > Affects Versions: 0.8.1 > Environment: Fetcher is behind a squid proxy, but I am pretty sure > this is irrelevant. > Nutch in local mode, running on a linux machine with 2GB RAM. > Reporter: Carsten Lehmann > Attachments: last_robots.txt_requests_squidlog.txt, nutch-log.txt, > squid_access_log_tail1000.txt > > > I think there is another robots.txt-related problem which is not > adressed by NUTCH-344, > but also results in an aborted fetch. > I am sure that in my last fetch all 17 fetcher threads died > while they were waiting for a robots.txt-file to be delivered by a not > properly responding web server. > I looked at the squid access log, which is used by all fetch threads. > It ends with many HTTP-504-errors ("gateway timeout") caused by a > certain robots.txt url: > <....> > 1166652253.332 899427 127.0.0.1 TCP_MISS/504 1450 GET > http://gso.gbv.de/robots.txt - DIRECT/193.174.240.8 text/html > 1166652343.350 899664 127.0.0.1 TCP_MISS/504 1450 GET > http://gso.gbv.de/robots.txt - DIRECT/193.174.240.8 text/html > 1166652353.560 899871 127.0.0.1 TCP_MISS/504 1450 GET > http://gso.gbv.de/robots.txt - DIRECT/193.174.240.8 text/html > These entries mean that it takes 15 minutes before the request ends > with a timeout. > This can be calculated from the squid log, the first column is the > request time (in UTC seconds), the second column is the duration of > the request (in ms): > 900000/1000/60=15 minutes. > As far as I understand it, every time a fetch thread tries to get this > robots.txt-file the thread busy waits for the duration of the request > (15 minutes). > If this is right, then all 17 fetcher threads were caught in this trap > at the time when fetching was aborted, as there are 17 requests in > the squid log which did not timeout before the message "aborting with > 17 threads" was written to the nutch-logfile. > Setting fetcher.max.crawl.delay can not help here. > I see 296 access attempts in total concerning this robots.txt-url in > the squid log of this crawl, but fetcher.max.crawl.delay is set to 30. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers