[jira] Commented: (HADOOP-1431) Map tasks can't timeout for failing to call progress

Devaraj Das (JIRA) Fri, 25 May 2007 09:42:46 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-1431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12499152
 ]


Devaraj Das commented on HADOOP-1431:
-------------------------------------

The main requirement we are after in this issue is that we need to allow sort 
to report progress. From the architecture point of view, I think it makes sense 
to have at least the MapReduce kernel part of sort aware of that - i.e., the 
generic BufferSorter. 
My major objection to this patch is that we are kind of short circuiting things 
making the thing look hacky IMO. I would much rather do it the following way:
1) Add a method to the BufferSorter interface called setReporter(Reporter).
2) Implementors of the interface, in this case the BasicTypeSorterBase would 
implement the method, and in this case would just store the reporter object. 
This is similar to the BufferSorter.setInputBuffer method.
3) The BasicTypeSorterBase would periodically invoke the reporter.progress() to 
report progress. The compare method in the BasicTypeSorterBase class is a 
potential place where reporter.progress can be called.
This way, we don't make the sort library (currently the MergeSorter, MergeSort 
classes) aware of the Reporter object but have everything in the MapReduce 
kernel. This preserves the boundaries that i originally intended to have 
between the various layers (HADOOP-331).

For the reduceTask, we have threads for reporting progress for two phases:
1) during the shuffle (and here we implicitly do the progress reporting for the 
ramfs merges too)
2) during the merge of the on-disk files in the reduce phase
The thread for the first case is still there in the current patch. If we are to 
really remove the issue, we should ideally remove the thread for the shuffle 
also since the ramfs merge might also get stuck (since user code is involved 
there). 

Similarly to BufferSorter, we could have an API for merge that takes a Reporter 
object and calls reporter.progress periodically. ReduceTask as well as the 
final merge on the MapTask could use that for the merges. Again, the argument 
here is that we do expect merge to report us progress and hence we enable it to 
do so.

> Map tasks can't timeout for failing to call progress
> ----------------------------------------------------
>
>                 Key: HADOOP-1431
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1431
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.13.0
>            Reporter: Owen O'Malley
>         Assigned To: Arun C Murthy
>             Fix For: 0.13.0
>
>         Attachments: HADOOP-1431_1_20070525.patch
>
>
> Currently the map task runner creates a thread that calls progress every 
> second to keep the system from killing the map if the sort takes too long. 
> This is the wrong approach, because it will cause stuck tasks to not be 
> killed. The right solution is to have the sort call progress as it actually 
> makes progress. This is part of what is going on in HADOOP-1374. A map gets 
> stuck at 100% progress, but not done.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1431) Map tasks can't timeout for failing to call progress

Reply via email to