[ 
https://issues.apache.org/jira/browse/HADOOP-1462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12504806
 ] 

Devaraj Das commented on HADOOP-1462:
-------------------------------------

+1 (though I am not sure whether we have to eliminate the findbugs warnings)

> Better progress reporting from a Task
> -------------------------------------
>
>                 Key: HADOOP-1462
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1462
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Vivek Ratan
>            Assignee: Vivek Ratan
>             Fix For: 0.14.0
>
>         Attachments: 1462.patch, 1462_02.patch, 1462_03.patch
>
>
> The Task code that reports progress updates has the following problems:
> 1. Some RPC calls are blocking. For example, in MapRunner::run(), the call to 
> RecordReader::next() can result in a blocking RPC call to the Task Tracker 
> (TT) to report progress. 
> 2. Some RPC calls are unnecessary. The Ping thread pings the TT once every 
> second, while we also independently send progress updates every second. We 
> don't, for example, need to ping the TT right after we send the progress 
> update. 
> 3. In some places, we spawn a thread to send progress updates (in 
> MapOutputBuffer::collect(), for example). If our code gets stuck, the thread 
> will continue sending updates to the TT and we will never be shut down. 
> These issues, in some form or another, have been reported in HADOOP-1201 and 
> HADOOP-1431. 
> I propose we make the following changes: 
> 1. In order to make the RPC calls non-blocking, we need a thread that calls 
> TT. This thread, to be created early on, will make sure we make the most 
> appropriate RPCs. It will have access to two flags: a progress flag that 
> indicates that the Task has made progress since the last RPC, and a 
> keep_alive flag that indicates that we need to let the TT know that we're 
> alive. This thread will also handle pings. It's logic will be something like 
> this: 
> while (1) {
>       if (progress_flag is set) {
>               // report progress update
>               umbilical::progress(...);
>               if (failure), kill task;
>               reset progress_flag; 
>               reset keep_alive_flag; // calling progress() also indicates 
> that we're alive
>       }
>       else if (keep_alive_flag is set) {
>               // let TT know we're alive
>               umbilical::progress(same params as last time);
>               if (failure), kill task;
>               reset keep_alive_flag;
>               break;
>       }
>       else {
>               // see if TT is alive
>               umbilical::ping();
>               if (failure), kill task;
>       }
>       sleep (1 sec);
> }
> 2. progress_flag and keep_alive_flag are set by the MapReduce code. 
> Reporter::progress() (in Task.java) sets keep_alive_flag while progress_flag 
> is set whenever Task's taskProgress object has any of its fields changed. 
> 3. We do away with Task::reportProgress() as this code is now handled in the 
> Progress thread. Wherever this method is called in our MapReduce kernel code, 
> we should replace it either with Reporter::progress() (if the intent was to 
> let TT know that we're alive) or we simply remove that call (if the intent 
> was to transmit progress changes to the TT). 
> 4. TaskUmbilicalProtocol::progress() should return a boolean, and should 
> return the same values that TaskUmbilicalProtocol::ping() does. This will let 
> the Task know whether its ID is known to the TT. 
> 5. We no longer need to create a ping thread in TaskTracker::Child. However, 
> we can perhaps create the Progress thread in the same place the Ping thread 
> was created. 
> 6. We will need to remove code that creates progress threads. This is in 
> MapTask::MapOutputBuffer::collect(), MapTask::MapOutputBuffer::flush(), and 
> ReduceTask::ReduceCopier(), at the least. Instead, we will need to add code 
> that updates the progress or calls Reporter::progress(). Any of these calls 
> simply update flags. so there's not a lot of performance penalty (at worst, 
> updating progress_flag or keep_alive_flag may need to be done within a 
> synchronized block, but even that may not be necessary since the flags are 
> just boolean values). As per HADOOP-1431, these calls can be made through a 
> ReportingComparator, or from within the generic BuferSorter, or perhaps from 
> some place better. 
> I may have missed out on some details, but hopefully the overall idea is 
> clear. Comments welcome. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to