[
https://issues.apache.org/jira/browse/NUTCH-3177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sebastian Nagel updated NUTCH-3177:
-----------------------------------
Description:
If there is no URL fetched during half of the MapReduce task timeout, Fetcher
is shutting down to avoid that the fetcher map task fails because of missing
progress. Before the shut-down Fetcher reports the remaining FetcherThreads as
"hung threads" together with the fetched URL. This should allow to debug the
URLs / pages causing timeouts. For the reporting the field {{reprUrl}} of
FetcherThread is used. However, the field is not reset after a fetch is done.
In consequence, the reported URL is not necessarily the one where the fetch is
in process. It might a the URL that was fetched last, but the thread is now
idle and waiting for the next fetch item to be ready. This happens if there are
still fetch queues, but with long delays because of a robots.txt Crawl-delay or
a longer delay because of the exponential back-off.
FetcherThread should reset the {{reprUrl}} once a fetch is finished. Idle
FetcherThread shouldn't be reported as hanging.
was:
If there is no URL fetched during half of the MapReduce task timeout, Fetcher
is shutting down to avoid that the fetcher map task fails because of missing
progress. Before the shut-down Fetcher reports the remaining FetcherThreads as
"hung threads" together with the fetched URL. This should allow to debug the
URLs / pages causing timeouts. For the reporting the field {{reprUrl}} of
FetcherThread is used. However, the field is not reset after a fetch is done.
In consequence, the reported URL is not necessarily one where the fetch is in
process. It might a the URL that was fetched last, but the thread is now idle
and waiting for the next fetch item to be ready. This happens if there are
still fetch queues, but with long delays because of a robots.txt Crawl-delay or
a longer delay because of the exponential back-off.
FetcherThread should reset the {{reprUrl}} once a fetch is finished. Idle
FetcherThread shouldn't be reported as hanging.
> Fetcher to report idle threads not as hung threads
> --------------------------------------------------
>
> Key: NUTCH-3177
> URL: https://issues.apache.org/jira/browse/NUTCH-3177
> Project: Nutch
> Issue Type: Bug
> Components: fetcher
> Affects Versions: 1.22
> Reporter: Sebastian Nagel
> Priority: Minor
> Fix For: 1.23
>
>
> If there is no URL fetched during half of the MapReduce task timeout, Fetcher
> is shutting down to avoid that the fetcher map task fails because of missing
> progress. Before the shut-down Fetcher reports the remaining FetcherThreads
> as "hung threads" together with the fetched URL. This should allow to debug
> the URLs / pages causing timeouts. For the reporting the field {{reprUrl}} of
> FetcherThread is used. However, the field is not reset after a fetch is done.
> In consequence, the reported URL is not necessarily the one where the fetch
> is in process. It might a the URL that was fetched last, but the thread is
> now idle and waiting for the next fetch item to be ready. This happens if
> there are still fetch queues, but with long delays because of a robots.txt
> Crawl-delay or a longer delay because of the exponential back-off.
> FetcherThread should reset the {{reprUrl}} once a fetch is finished. Idle
> FetcherThread shouldn't be reported as hanging.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)