[ 
https://issues.apache.org/jira/browse/NUTCH-2767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17046465#comment-17046465
 ] 

ASF GitHub Bot commented on NUTCH-2767:
---------------------------------------

sebastian-nagel commented on pull request #497: NUTCH-2767 Fetcher to stop 
filling queues skipped due to repeated exception
URL: https://github.com/apache/nutch/pull/497
 
 
   
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


> Fetcher to stop filling queues skipped due to repeated exceptions
> -----------------------------------------------------------------
>
>                 Key: NUTCH-2767
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2767
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 1.16
>            Reporter: Sebastian Nagel
>            Assignee: Sebastian Nagel
>            Priority: Minor
>             Fix For: 1.17
>
>
> Since NUTCH-769 the fetcher skips URLs from queues which already got more 
> exceptions than configured by "fetcher.max.exceptions.per.queue". Such queues 
> are emptied when the threshold is reached. However, the QueueFeeder may still 
> feeding queues and add again URLs to the queues which are already over the 
> exception threshold. The first URL in the queue is then fetched, consecutive 
> ones are eventually removed if the next exception is observed.
> Here one example:
> {noformat}
> 2020-02-19 06:26:48,877 INFO [FetcherThread] o.a.n.fetcher.FetchItemQueues: * 
> queue: ww.example.com >> removed 61 URLs from queue because 40 exceptions 
> occurred
> 2020-02-19 06:26:53,551 INFO [FetcherThread] o.a.n.fetcher.FetcherThread: 
> FetcherThread 172 fetching https://www.example.com/... (queue crawl 
> delay=5000ms)
> 2020-02-19 06:26:54,073 INFO [FetcherThread] o.a.n.fetcher.FetcherThread: 
> FetcherThread 172 fetch of https://www.example.com/... failed with: ...
> 2020-02-19 06:26:58,766 INFO [FetcherThread] o.a.n.fetcher.FetcherThread: 
> FetcherThread 111 fetching https://www.example.com/... (queue crawl 
> delay=5000ms)
> 2020-02-19 06:26:59,290 INFO [FetcherThread] o.a.n.fetcher.FetcherThread: 
> FetcherThread 111 fetch of https://www.example.com/... failed with: ...
> 2020-02-19 06:27:03,960 INFO [FetcherThread] o.a.n.fetcher.FetcherThread: 
> FetcherThread 103 fetching https://www.example.com/... (queue crawl 
> delay=5000ms)
> 2020-02-19 06:27:04,482 INFO [FetcherThread] o.a.n.fetcher.FetcherThread: 
> FetcherThread 103 fetch of https://www.example.com/... failed with: ...
> 2020-02-19 06:27:04,484 INFO [FetcherThread] o.a.n.fetcher.FetchItemQueues: * 
> queue: www.example.com >> removed 1 URLs from queue because 41 exceptions 
> occurred
> ... (fetching again 30 URLs, all failed)
> 2020-02-19 06:28:23,578 INFO [FetcherThread] 
> org.apache.nutch.fetcher.FetchItemQueues: * queue: www.example.com >> removed 
> 1 URLs from queue because 42 exceptions occurred
> {noformat}
> QueueFeeder should check whether the exception threshold is already reached 
> and if yes not add further URLs to the queue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to