[
https://issues.apache.org/jira/browse/HADOOP-1431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12499152
]
Devaraj Das commented on HADOOP-1431:
-------------------------------------
The main requirement we are after in this issue is that we need to allow sort
to report progress. From the architecture point of view, I think it makes sense
to have at least the MapReduce kernel part of sort aware of that - i.e., the
generic BufferSorter.
My major objection to this patch is that we are kind of short circuiting things
making the thing look hacky IMO. I would much rather do it the following way:
1) Add a method to the BufferSorter interface called setReporter(Reporter).
2) Implementors of the interface, in this case the BasicTypeSorterBase would
implement the method, and in this case would just store the reporter object.
This is similar to the BufferSorter.setInputBuffer method.
3) The BasicTypeSorterBase would periodically invoke the reporter.progress() to
report progress. The compare method in the BasicTypeSorterBase class is a
potential place where reporter.progress can be called.
This way, we don't make the sort library (currently the MergeSorter, MergeSort
classes) aware of the Reporter object but have everything in the MapReduce
kernel. This preserves the boundaries that i originally intended to have
between the various layers (HADOOP-331).
For the reduceTask, we have threads for reporting progress for two phases:
1) during the shuffle (and here we implicitly do the progress reporting for the
ramfs merges too)
2) during the merge of the on-disk files in the reduce phase
The thread for the first case is still there in the current patch. If we are to
really remove the issue, we should ideally remove the thread for the shuffle
also since the ramfs merge might also get stuck (since user code is involved
there).
Similarly to BufferSorter, we could have an API for merge that takes a Reporter
object and calls reporter.progress periodically. ReduceTask as well as the
final merge on the MapTask could use that for the merges. Again, the argument
here is that we do expect merge to report us progress and hence we enable it to
do so.
> Map tasks can't timeout for failing to call progress
> ----------------------------------------------------
>
> Key: HADOOP-1431
> URL: https://issues.apache.org/jira/browse/HADOOP-1431
> Project: Hadoop
> Issue Type: Bug
> Components: mapred
> Affects Versions: 0.13.0
> Reporter: Owen O'Malley
> Assigned To: Arun C Murthy
> Fix For: 0.13.0
>
> Attachments: HADOOP-1431_1_20070525.patch
>
>
> Currently the map task runner creates a thread that calls progress every
> second to keep the system from killing the map if the sort takes too long.
> This is the wrong approach, because it will cause stuck tasks to not be
> killed. The right solution is to have the sort call progress as it actually
> makes progress. This is part of what is going on in HADOOP-1374. A map gets
> stuck at 100% progress, but not done.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.