[GitHub] [nutch] sebastian-nagel opened a new pull request #724: NUTCH-2573 Suspend crawling if robots.txt fails to fetch with 5xx status

GitBox Sat, 15 Jan 2022 05:16:36 -0800


sebastian-nagel opened a new pull request #724:
URL: https://github.com/apache/nutch/pull/724



   - add properties
     - `http.robots.503.defer.visits` :
       enable/disable the feature (default: enabled)
     - `http.robots.503.defer.visits.delay` :
       delay to wait before the next trial to fetch the properties
       (default: wait 5 minutes)
     - `http.robots.503.defer.visits.retries` :
       max. number of retries before giving up and dropping all URLs from the 
given host / queue
       (default: give up after the 3rd retry, ie. after 4 attempts)
   - handle HTTP 5xx in robots.txt parser
   - handle delay, retries and dropping queues in Fetcher
   
   Stop queuing fetch items if timelimit is reached
   - re-queuing items where the robots.txt request returned a 5xx
   - redirects (http.redirect.max > 0) or
   - outlinks (fetcher.follow.outlinks.depth > 0)
   
   In a first version, I forgot to verify whether the Fetcher timelimit 
(`fetcher.timelimit.mins`) was already reached before re-queuing the fetch 
item. This caused very few fetcher task to end up in an infinite loop. In 
detail, this happened:
   1. fetcher thread starts fetching an item and requests the corresponding 
robots.txt. Ev., the server responds slowly.
   2. fetcher timelimit is reached, all fetcher queues are flushed
   3. robots.txt response "arrived". Because it's a 5xx the fetch item is 
re-queued and the fetch is delayed for 30 min. (custom configuration).
   
   Then steps 1 and 3 are retried until the max number of retries is reached. 
But this was fixed and I've also made sure that redirects or outlinks are not 
queued if the timelimit is reached.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [nutch] sebastian-nagel opened a new pull request #724: NUTCH-2573 Suspend crawling if robots.txt fails to fetch with 5xx status

Reply via email to