Sebastian Nagel created NUTCH-2767:
--------------------------------------

             Summary: Fetcher to stop filling queues skipped due to repeated 
exceptions
                 Key: NUTCH-2767
                 URL: https://issues.apache.org/jira/browse/NUTCH-2767
             Project: Nutch
          Issue Type: Improvement
          Components: fetcher
    Affects Versions: 1.16
            Reporter: Sebastian Nagel
             Fix For: 1.17


Since NUTCH-769 the fetcher skips URLs from queues which already got more 
exceptions than configured by "fetcher.max.exceptions.per.queue". Such queues 
are emptied when the threshold is reached. However, the QueueFeeder may still 
feeding queues and add again URLs to the queues which are already over the 
exception threshold. The first URL in the queue is then fetched, consecutive 
ones are eventually removed if the next exception is observed.

Here one example:
{noformat}
2020-02-19 06:26:48,877 INFO [FetcherThread] o.a.n.fetcher.FetchItemQueues: * 
queue: ww.example.com >> removed 61 URLs from queue because 40 exceptions 
occurred
2020-02-19 06:26:53,551 INFO [FetcherThread] o.a.n.fetcher.FetcherThread: 
FetcherThread 172 fetching https://www.example.com/... (queue crawl 
delay=5000ms)
2020-02-19 06:26:54,073 INFO [FetcherThread] o.a.n.fetcher.FetcherThread: 
FetcherThread 172 fetch of https://www.example.com/... failed with: ...
2020-02-19 06:26:58,766 INFO [FetcherThread] o.a.n.fetcher.FetcherThread: 
FetcherThread 111 fetching https://www.example.com/... (queue crawl 
delay=5000ms)
2020-02-19 06:26:59,290 INFO [FetcherThread] o.a.n.fetcher.FetcherThread: 
FetcherThread 111 fetch of https://www.example.com/... failed with: ...
2020-02-19 06:27:03,960 INFO [FetcherThread] o.a.n.fetcher.FetcherThread: 
FetcherThread 103 fetching https://www.example.com/... (queue crawl 
delay=5000ms)
2020-02-19 06:27:04,482 INFO [FetcherThread] o.a.n.fetcher.FetcherThread: 
FetcherThread 103 fetch of https://www.example.com/... failed with: ...
2020-02-19 06:27:04,484 INFO [FetcherThread] o.a.n.fetcher.FetchItemQueues: * 
queue: www.example.com >> removed 1 URLs from queue because 41 exceptions 
occurred
... (fetching again 30 URLs, all failed)
2020-02-19 06:28:23,578 INFO [FetcherThread] 
org.apache.nutch.fetcher.FetchItemQueues: * queue: www.example.com >> removed 1 
URLs from queue because 42 exceptions occurred
{noformat}

QueueFeeder should check whether the exception threshold is already reached and 
if yes not add further URLs to the queue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to