[ 
https://issues.apache.org/jira/browse/NUTCH-1057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13067686#comment-13067686
 ] 

Markus Jelsma commented on NUTCH-1057:
--------------------------------------

No. This is a tuning option for users that experience very long pauses in the 
merge phase after a map finishes. It takes long because there are many GB's of 
map output and/or slow IO.

To prevent the task tracker from killing the merge (default 600s time out) 
users need to raise the mapred.timeout value to a value higher than the actual 
duration of the merge phase. 

Fetcher threads have a time out that is configured to be half the tasktracker 
time out value. This means that with a high (e.g. 20m) task timeout, the 
fetcher will wait 10m before killing hanging threads. This is a waste of time. 
In large crawl there are always a few threads unable to finish properly. 
Killing them sooner makes the merge begin earlier.

Sorry if i was unclear before. 

> Make fetcher thread time out configurable
> -----------------------------------------
>
>                 Key: NUTCH-1057
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1057
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.4, 2.0
>
>         Attachments: NUTCH-1057-1.4-1.patch
>
>
> The fetcher sets a time out value based of half the mapred.task.timeout 
> value. This is not a proper value for all cases. Add an option 
> (fetcher.thread.timeout.divisor) to configure the divisor used and default it 
> to two.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to