[ 
https://issues.apache.org/jira/browse/HADOOP-1158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12501870
 ] 

Arun C Murthy edited comment on HADOOP-1158 at 6/6/07 8:27 PM:
---------------------------------------------------------------

bq.b) Given sufficient fetch-failures per-map (say 3 or 4), the reducer then 
complains to the JobTracker via a new rpc:

I take that back.. I propose we augument TaskStatus itself to let the 
JobTracker know about the failed-fetches i.e. map taskids. 

We could just add an new RPC to TaskUmbilicalProtocol for the reduce-task to 
let the TaskTracker know about the failed fetch... 
{code:title=TaskUmbilical.java}
void fetchError(String taskId, String failedFetchMapTaskId);
{code}

Even better, a tad more involved, is to rework 
{code:title=TaskUmbilical.java}
  void progress(String taskid, float progress, String state, 
                            TaskStatus.Phase phase, Counters counters)
   throws IOException, InterruptedException;
{code}
as
{code:title=TaskUmbilical.java}
  void progress(String taskid, TaskStatus taskStatus}
   throws IOException, InterruptedException;
{code}

This simplies the flow so that the child-vm itself computes it's {{TaskStatus}} 
(which will be augumented to contain the failed-fetch-mapIds) and sends it 
along the {{TaskTracker}} which just forwards it to the {{JobTracker}}, thereby 
relieving it of some of the responsibilities vis-a-vis computing the 
{{TaskStatus}}. Clearly this could be linked to the the reporting re-design at 
HADOOP-1462 ...

Thoughts?


 was:
bq.b) Given sufficient fetch-failures per-map (say 3 or 4), the reducer then 
complains to the JobTracker via a new rpc:

I take that back, I'm propose we use augument TaskStatus itself to let the 
JobTracker know about the failed-fetches i.e. map taskids, we could just add an 
new RPC to TaskUmbilicalProtocol for the reduce-task to let the TaskTracker 
know about the failed fetch.

> JobTracker should collect statistics of failed map output fetches, and take 
> decisions to reexecute map tasks and/or restart the (possibly faulty) Jetty 
> server on the TaskTracker
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1158
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1158
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.12.2
>            Reporter: Devaraj Das
>            Assignee: Arun C Murthy
>
> The JobTracker should keep a track (with feedback from Reducers) of how many 
> times a fetch for a particular map output failed. If this exceeds a certain 
> threshold, then that map should be declared as lost, and should be reexecuted 
> elsewhere. Based on the number of such complaints from Reducers, the 
> JobTracker can blacklist the TaskTracker. This will make the framework 
> reliable - it will take care of (faulty) TaskTrackers that sometimes always 
> fail to serve up map outputs (for which exceptions are not properly 
> raised/handled, for e.g., if the exception/problem happens in the Jetty 
> server).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to