[
https://issues.apache.org/jira/browse/NUTCH-2767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17040995#comment-17040995
]
Sebastian Nagel commented on NUTCH-2767:
----------------------------------------
With NUTCH-2171 we switched to Java 8 three years ago. It should be almost save
to use Java 8 features now. But after a second look: could also use an int[]
array to count the occurrences of the enum QueuingStatus (using .ordinal())
which should be also faster than a map. I'll update the patch/PR.
Shortly how this issue has been detected:
- I've seen very few fetcher tasks which where significantly slower
- a profiler showed that a significant amount of the CPU time in the slow tasks
is spent to fill the stack traces in the constructor of socket and
NoRouteToHostExceptions. Note: protocol-okhttp is used which has deeper stacks
(20-30 levels) due to the interceptor pattern used
- the fetcher tasks run for 3 hours, the QueueFeeder is feeding new URLs into
the queues for about 80% of the time
The patch is currently tested in production. Looks like the situation has
improved.
However, there is still one problem: empty queues are reaped in
FetchItemQueues.getFetchItem(), created anew later and then the exception
counter for the same host/domain is set to zero again. I'll try to address this
point as well.
> Fetcher to stop filling queues skipped due to repeated exceptions
> -----------------------------------------------------------------
>
> Key: NUTCH-2767
> URL: https://issues.apache.org/jira/browse/NUTCH-2767
> Project: Nutch
> Issue Type: Improvement
> Components: fetcher
> Affects Versions: 1.16
> Reporter: Sebastian Nagel
> Assignee: Sebastian Nagel
> Priority: Minor
> Fix For: 1.17
>
>
> Since NUTCH-769 the fetcher skips URLs from queues which already got more
> exceptions than configured by "fetcher.max.exceptions.per.queue". Such queues
> are emptied when the threshold is reached. However, the QueueFeeder may still
> feeding queues and add again URLs to the queues which are already over the
> exception threshold. The first URL in the queue is then fetched, consecutive
> ones are eventually removed if the next exception is observed.
> Here one example:
> {noformat}
> 2020-02-19 06:26:48,877 INFO [FetcherThread] o.a.n.fetcher.FetchItemQueues: *
> queue: ww.example.com >> removed 61 URLs from queue because 40 exceptions
> occurred
> 2020-02-19 06:26:53,551 INFO [FetcherThread] o.a.n.fetcher.FetcherThread:
> FetcherThread 172 fetching https://www.example.com/... (queue crawl
> delay=5000ms)
> 2020-02-19 06:26:54,073 INFO [FetcherThread] o.a.n.fetcher.FetcherThread:
> FetcherThread 172 fetch of https://www.example.com/... failed with: ...
> 2020-02-19 06:26:58,766 INFO [FetcherThread] o.a.n.fetcher.FetcherThread:
> FetcherThread 111 fetching https://www.example.com/... (queue crawl
> delay=5000ms)
> 2020-02-19 06:26:59,290 INFO [FetcherThread] o.a.n.fetcher.FetcherThread:
> FetcherThread 111 fetch of https://www.example.com/... failed with: ...
> 2020-02-19 06:27:03,960 INFO [FetcherThread] o.a.n.fetcher.FetcherThread:
> FetcherThread 103 fetching https://www.example.com/... (queue crawl
> delay=5000ms)
> 2020-02-19 06:27:04,482 INFO [FetcherThread] o.a.n.fetcher.FetcherThread:
> FetcherThread 103 fetch of https://www.example.com/... failed with: ...
> 2020-02-19 06:27:04,484 INFO [FetcherThread] o.a.n.fetcher.FetchItemQueues: *
> queue: www.example.com >> removed 1 URLs from queue because 41 exceptions
> occurred
> ... (fetching again 30 URLs, all failed)
> 2020-02-19 06:28:23,578 INFO [FetcherThread]
> org.apache.nutch.fetcher.FetchItemQueues: * queue: www.example.com >> removed
> 1 URLs from queue because 42 exceptions occurred
> {noformat}
> QueueFeeder should check whether the exception threshold is already reached
> and if yes not add further URLs to the queue.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)