[
https://issues.apache.org/jira/browse/NUTCH-2767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sebastian Nagel reassigned NUTCH-2767:
--------------------------------------
Assignee: Sebastian Nagel
> Fetcher to stop filling queues skipped due to repeated exceptions
> -----------------------------------------------------------------
>
> Key: NUTCH-2767
> URL: https://issues.apache.org/jira/browse/NUTCH-2767
> Project: Nutch
> Issue Type: Improvement
> Components: fetcher
> Affects Versions: 1.16
> Reporter: Sebastian Nagel
> Assignee: Sebastian Nagel
> Priority: Minor
> Fix For: 1.17
>
>
> Since NUTCH-769 the fetcher skips URLs from queues which already got more
> exceptions than configured by "fetcher.max.exceptions.per.queue". Such queues
> are emptied when the threshold is reached. However, the QueueFeeder may still
> feeding queues and add again URLs to the queues which are already over the
> exception threshold. The first URL in the queue is then fetched, consecutive
> ones are eventually removed if the next exception is observed.
> Here one example:
> {noformat}
> 2020-02-19 06:26:48,877 INFO [FetcherThread] o.a.n.fetcher.FetchItemQueues: *
> queue: ww.example.com >> removed 61 URLs from queue because 40 exceptions
> occurred
> 2020-02-19 06:26:53,551 INFO [FetcherThread] o.a.n.fetcher.FetcherThread:
> FetcherThread 172 fetching https://www.example.com/... (queue crawl
> delay=5000ms)
> 2020-02-19 06:26:54,073 INFO [FetcherThread] o.a.n.fetcher.FetcherThread:
> FetcherThread 172 fetch of https://www.example.com/... failed with: ...
> 2020-02-19 06:26:58,766 INFO [FetcherThread] o.a.n.fetcher.FetcherThread:
> FetcherThread 111 fetching https://www.example.com/... (queue crawl
> delay=5000ms)
> 2020-02-19 06:26:59,290 INFO [FetcherThread] o.a.n.fetcher.FetcherThread:
> FetcherThread 111 fetch of https://www.example.com/... failed with: ...
> 2020-02-19 06:27:03,960 INFO [FetcherThread] o.a.n.fetcher.FetcherThread:
> FetcherThread 103 fetching https://www.example.com/... (queue crawl
> delay=5000ms)
> 2020-02-19 06:27:04,482 INFO [FetcherThread] o.a.n.fetcher.FetcherThread:
> FetcherThread 103 fetch of https://www.example.com/... failed with: ...
> 2020-02-19 06:27:04,484 INFO [FetcherThread] o.a.n.fetcher.FetchItemQueues: *
> queue: www.example.com >> removed 1 URLs from queue because 41 exceptions
> occurred
> ... (fetching again 30 URLs, all failed)
> 2020-02-19 06:28:23,578 INFO [FetcherThread]
> org.apache.nutch.fetcher.FetchItemQueues: * queue: www.example.com >> removed
> 1 URLs from queue because 42 exceptions occurred
> {noformat}
> QueueFeeder should check whether the exception threshold is already reached
> and if yes not add further URLs to the queue.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)