Sebastian Nagel created NUTCH-2767:
--------------------------------------
Summary: Fetcher to stop filling queues skipped due to repeated
exceptions
Key: NUTCH-2767
URL: https://issues.apache.org/jira/browse/NUTCH-2767
Project: Nutch
Issue Type: Improvement
Components: fetcher
Affects Versions: 1.16
Reporter: Sebastian Nagel
Fix For: 1.17
Since NUTCH-769 the fetcher skips URLs from queues which already got more
exceptions than configured by "fetcher.max.exceptions.per.queue". Such queues
are emptied when the threshold is reached. However, the QueueFeeder may still
feeding queues and add again URLs to the queues which are already over the
exception threshold. The first URL in the queue is then fetched, consecutive
ones are eventually removed if the next exception is observed.
Here one example:
{noformat}
2020-02-19 06:26:48,877 INFO [FetcherThread] o.a.n.fetcher.FetchItemQueues: *
queue: ww.example.com >> removed 61 URLs from queue because 40 exceptions
occurred
2020-02-19 06:26:53,551 INFO [FetcherThread] o.a.n.fetcher.FetcherThread:
FetcherThread 172 fetching https://www.example.com/... (queue crawl
delay=5000ms)
2020-02-19 06:26:54,073 INFO [FetcherThread] o.a.n.fetcher.FetcherThread:
FetcherThread 172 fetch of https://www.example.com/... failed with: ...
2020-02-19 06:26:58,766 INFO [FetcherThread] o.a.n.fetcher.FetcherThread:
FetcherThread 111 fetching https://www.example.com/... (queue crawl
delay=5000ms)
2020-02-19 06:26:59,290 INFO [FetcherThread] o.a.n.fetcher.FetcherThread:
FetcherThread 111 fetch of https://www.example.com/... failed with: ...
2020-02-19 06:27:03,960 INFO [FetcherThread] o.a.n.fetcher.FetcherThread:
FetcherThread 103 fetching https://www.example.com/... (queue crawl
delay=5000ms)
2020-02-19 06:27:04,482 INFO [FetcherThread] o.a.n.fetcher.FetcherThread:
FetcherThread 103 fetch of https://www.example.com/... failed with: ...
2020-02-19 06:27:04,484 INFO [FetcherThread] o.a.n.fetcher.FetchItemQueues: *
queue: www.example.com >> removed 1 URLs from queue because 41 exceptions
occurred
... (fetching again 30 URLs, all failed)
2020-02-19 06:28:23,578 INFO [FetcherThread]
org.apache.nutch.fetcher.FetchItemQueues: * queue: www.example.com >> removed 1
URLs from queue because 42 exceptions occurred
{noformat}
QueueFeeder should check whether the exception threshold is already reached and
if yes not add further URLs to the queue.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)