[
https://issues.apache.org/jira/browse/NUTCH-3177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18086896#comment-18086896
]
Hudson commented on NUTCH-3177:
-------------------------------
SUCCESS: Integrated in Jenkins build Nutch ยป Nutch-trunk #240 (See
[https://ci-builds.apache.org/job/Nutch/job/Nutch-trunk/240/])
NUTCH-3177 Fetcher to report idle threads not as hung threads (snagel:
[https://github.com/apache/nutch/commit/a0814a7b116dbf5b0053f7b2fa2428f3d868c95f])
* (edit) src/java/org/apache/nutch/fetcher/Fetcher.java
* (edit) src/java/org/apache/nutch/fetcher/FetcherThread.java
> Fetcher to report idle threads not as hung threads
> --------------------------------------------------
>
> Key: NUTCH-3177
> URL: https://issues.apache.org/jira/browse/NUTCH-3177
> Project: Nutch
> Issue Type: Bug
> Components: fetcher
> Affects Versions: 1.22
> Reporter: Sebastian Nagel
> Assignee: Sebastian Nagel
> Priority: Minor
> Fix For: 1.23
>
>
> If there is no URL fetched during half of the MapReduce task timeout, Fetcher
> is shutting down to avoid that the fetcher map task fails because of missing
> progress. Before the shut-down Fetcher reports the remaining FetcherThreads
> as "hung threads" together with the fetched URL. This should allow to debug
> the URLs / pages causing timeouts. For the reporting the field {{reprUrl}} of
> FetcherThread is used. However, the field is not reset after a fetch is done.
> In consequence, the reported URL is not necessarily the one where the fetch
> is in process. It might a the URL that was fetched last, but the thread is
> now idle and waiting for the next fetch item to be ready. This happens if
> there are still fetch queues, but with long delays because of a robots.txt
> Crawl-delay or a longer delay because of the exponential back-off.
> FetcherThread should reset the {{reprUrl}} once a fetch is finished. Idle
> FetcherThread shouldn't be reported as hanging.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)