[jira] [Commented] (MAPREDUCE-5124) AM lacks flow control for task events

Jason Lowe (JIRA) Fri, 10 Nov 2017 08:33:27 -0800

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16247743#comment-16247743
 ]


Jason Lowe commented on MAPREDUCE-5124:
---------------------------------------

bq. Are you sure we can't just replace the status updates?

Yes.  For the counters, I was thinking of Tez which only sends the counters 
every other status update or so.  For MapReduce I think we're OK on the 
counters since they're sent every heartbeat.  However we're not OK when it 
comes to the failed fetch tasks.  These are only sent once, and once the status 
report has been sent successfully are cleared:
{code}
            amFeedback = umbilical.statusUpdate(taskId, taskStatus);
            taskFound = amFeedback.getTaskFound();
            taskStatus.clearStatus();
{code}
and ReduceTaskStatus wipes out the fetchFailedTasks:
{code}
  synchronized void clearStatus() {
    super.clearStatus();
    failedFetchTasks.clear();
  }
{code}

bq. If you think about it, we send updates in every 3 seconds anyway - so if 
it's a problem, then it would appear on the client side, too (that is, losing 
data).

As I mentioned above, Tez only sends them every so often and the AM tracks the 
last one received.  If that were to happen here and we were to clobber the 
counters on a previous pending status with a null counters from the current 
status then we would drop that update.  The listener should receive a 
subsequent status update eventually with counters that will correct that 
problem, but in the interim the counters will be inaccurate primarily due to a 
mishandling on the listener side that can be corrected.

As for the fetch failures, these are one-time trigger events that will never be 
resent.  Looking at how TaskAttemptImpl and JobImpl interprets them, it doesn't 
expect these to be a cumulative list since otherwise it would end up repeatedly 
blaming maps for fetch failures every status update and would not be able to 
distinguish when a reducer is making a new complaint about a map versus 
repeating an old complaint.

So we need to coalesce the fetch failures between pending status updates or 
they could be dropped.  Either that or we need to move the handling of those 
failures reported in the status from the task attempt to the task listener.


> AM lacks flow control for task events
> -------------------------------------
>
>                 Key: MAPREDUCE-5124
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5124
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am
>    Affects Versions: 2.0.3-alpha, 0.23.5
>            Reporter: Jason Lowe
>            Assignee: Peter Bacsko
>         Attachments: MAPREDUCE-5124-CoalescingPOC-1.patch, 
> MAPREDUCE-5124-CoalescingPOC2.patch, MAPREDUCE-5124-proto.2.txt, 
> MAPREDUCE-5124-prototype.txt
>
>
> The AM does not have any flow control to limit the incoming rate of events 
> from tasks.  If the AM is unable to keep pace with the rate of incoming 
> events for a sufficient period of time then it will eventually exhaust the 
> heap and crash.  MAPREDUCE-5043 addressed a major bottleneck for event 
> processing, but the AM could still get behind if it's starved for CPU and/or 
> handling a very large job with tens of thousands of active tasks.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org

[jira] [Commented] (MAPREDUCE-5124) AM lacks flow control for task events

Reply via email to