[
https://issues.apache.org/jira/browse/NUTCH-1057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13067686#comment-13067686
]
Markus Jelsma commented on NUTCH-1057:
--------------------------------------
No. This is a tuning option for users that experience very long pauses in the
merge phase after a map finishes. It takes long because there are many GB's of
map output and/or slow IO.
To prevent the task tracker from killing the merge (default 600s time out)
users need to raise the mapred.timeout value to a value higher than the actual
duration of the merge phase.
Fetcher threads have a time out that is configured to be half the tasktracker
time out value. This means that with a high (e.g. 20m) task timeout, the
fetcher will wait 10m before killing hanging threads. This is a waste of time.
In large crawl there are always a few threads unable to finish properly.
Killing them sooner makes the merge begin earlier.
Sorry if i was unclear before.
> Make fetcher thread time out configurable
> -----------------------------------------
>
> Key: NUTCH-1057
> URL: https://issues.apache.org/jira/browse/NUTCH-1057
> Project: Nutch
> Issue Type: Improvement
> Components: fetcher
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Fix For: 1.4, 2.0
>
> Attachments: NUTCH-1057-1.4-1.patch
>
>
> The fetcher sets a time out value based of half the mapred.task.timeout
> value. This is not a proper value for all cases. Add an option
> (fetcher.thread.timeout.divisor) to configure the divisor used and default it
> to two.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira