unavailable robots.txt kills fetch
----------------------------------

                 Key: NUTCH-419
                 URL: http://issues.apache.org/jira/browse/NUTCH-419
             Project: Nutch
          Issue Type: Bug
          Components: fetcher
    Affects Versions: 0.8.1
         Environment: Fetcher is behind a squid proxy, but I am pretty sure 
this is irrelevant. 
Nutch in local mode, running on a linux machine with 2GB RAM. 
            Reporter: Carsten Lehmann


I think there is another robots.txt-related problem which is not
adressed by NUTCH-344,
but also results in an aborted fetch.

I am sure that in my last fetch all 17 fetcher threads died
while they were waiting for a robots.txt-file to be delivered by a not
properly responding web server.

I looked at the squid access log, which is used by all fetch threads.
It ends with many  HTTP-504-errors ("gateway timeout") caused by a
certain robots.txt url:

<....>
1166652253.332 899427 127.0.0.1 TCP_MISS/504 1450 GET
http://gso.gbv.de/robots.txt - DIRECT/193.174.240.8 text/html
1166652343.350 899664 127.0.0.1 TCP_MISS/504 1450 GET
http://gso.gbv.de/robots.txt - DIRECT/193.174.240.8 text/html
1166652353.560 899871 127.0.0.1 TCP_MISS/504 1450 GET
http://gso.gbv.de/robots.txt - DIRECT/193.174.240.8 text/html

These entries mean that it takes 15 minutes before the request ends
with a timeout.
This can be calculated from the squid log, the first column is the
request  time (in UTC seconds), the second column is the duration of
the request (in ms):
900000/1000/60=15 minutes.

As far as I understand it, every time a fetch thread tries to get this
robots.txt-file the thread busy waits for the duration of the request
(15 minutes).
If this is right, then all 17 fetcher threads were caught in this trap
at the time when  fetching was aborted, as there are 17 requests in
the squid log which did not timeout before the message  "aborting with
17 threads" was written to the nutch-logfile.

Setting fetcher.max.crawl.delay can not help here.
I see 296 access attempts in total concerning this robots.txt-url in
the squid log of this crawl, but fetcher.max.crawl.delay is set to 30.


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to