[ 
https://issues.apache.org/jira/browse/NUTCH-3177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-3177:
-----------------------------------
    Description: 
If there is no URL fetched during half of the MapReduce task timeout, Fetcher 
is shutting down to avoid that the fetcher map task fails because of missing 
progress. Before the shut-down Fetcher reports the remaining FetcherThreads as 
"hung threads" together with the fetched URL. This should allow to debug the 
URLs / pages causing timeouts. For the reporting the field {{reprUrl}} of 
FetcherThread is used. However, the field is not reset after a fetch is done. 
In consequence, the reported URL is not necessarily the one where the fetch is 
in process. It might a the URL that was fetched last, but the thread is now 
idle and waiting for the next fetch item to be ready. This happens if there are 
still fetch queues, but with long delays because of a robots.txt Crawl-delay or 
a longer delay because of the exponential back-off.

FetcherThread should reset the {{reprUrl}} once a fetch is finished. Idle 
FetcherThread shouldn't be reported as hanging.

  was:
If there is no URL fetched during half of the MapReduce task timeout, Fetcher 
is shutting down to avoid that the fetcher map task fails because of missing 
progress. Before the shut-down Fetcher reports the remaining FetcherThreads as 
"hung threads" together with the fetched URL. This should allow to debug the 
URLs / pages causing timeouts. For the reporting the field {{reprUrl}} of 
FetcherThread is used. However, the field is not reset after a fetch is done. 
In consequence, the reported URL is not necessarily one where the fetch is in 
process. It might a the URL that was fetched last, but the thread is now idle 
and waiting for the next fetch item to be ready. This happens if there are 
still fetch queues, but with long delays because of a robots.txt Crawl-delay or 
a longer delay because of the exponential back-off.

FetcherThread should reset the {{reprUrl}} once a fetch is finished. Idle 
FetcherThread shouldn't be reported as hanging.


> Fetcher to report idle threads not as hung threads
> --------------------------------------------------
>
>                 Key: NUTCH-3177
>                 URL: https://issues.apache.org/jira/browse/NUTCH-3177
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 1.22
>            Reporter: Sebastian Nagel
>            Priority: Minor
>             Fix For: 1.23
>
>
> If there is no URL fetched during half of the MapReduce task timeout, Fetcher 
> is shutting down to avoid that the fetcher map task fails because of missing 
> progress. Before the shut-down Fetcher reports the remaining FetcherThreads 
> as "hung threads" together with the fetched URL. This should allow to debug 
> the URLs / pages causing timeouts. For the reporting the field {{reprUrl}} of 
> FetcherThread is used. However, the field is not reset after a fetch is done. 
> In consequence, the reported URL is not necessarily the one where the fetch 
> is in process. It might a the URL that was fetched last, but the thread is 
> now idle and waiting for the next fetch item to be ready. This happens if 
> there are still fetch queues, but with long delays because of a robots.txt 
> Crawl-delay or a longer delay because of the exponential back-off.
> FetcherThread should reset the {{reprUrl}} once a fetch is finished. Idle 
> FetcherThread shouldn't be reported as hanging.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to