[jira] [Updated] (MAPREDUCE-7053) Timed out tasks can fail to produce thread dump

Jason Lowe (JIRA) Wed, 14 Feb 2018 14:12:28 -0800

     [ 
https://issues.apache.org/jira/browse/MAPREDUCE-7053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jason Lowe updated MAPREDUCE-7053:
----------------------------------
    Status: Patch Available  (was: Open)

Yeah, this is yet another latent bug that was exposed when the task attempt 
listener starts rejecting status updates for tasks the AM no longer thinks is 
running.

As such I'm proposing a fix where we do *not* immediately reject attempts that 
the AM thinks should not be running, but rather give them a grace period of 
sorts.  This patch adds the ability of the task heartbeat handler to track 
attempts that have unregistered recently.  It uses the same grace period for 
unregistered tasks that is currently used for tasks that have unregistered via 
the umbilical and are shutting down gracefully.  This keeps the AM from 
immediately rejecting a recently unregistered attempt, allowing that attempt to 
receive a stack dump signal and otherwise shut down cleanly by itself.  After 
the grace period expires, it will reject status updates.

> Timed out tasks can fail to produce thread dump
> -----------------------------------------------
>
>                 Key: MAPREDUCE-7053
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-7053
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>    Affects Versions: 3.1.0, 3.0.1, 2.10.0, 2.9.1, 2.8.4, 2.7.6
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>            Priority: Major
>         Attachments: MAPREDUCE-7053.001.patch
>
>
> TestMRJobs#testThreadDumpOnTaskTimeout has been failing sporadically 
> recently.  When the AM times out a task it immediately removes it from the 
> list of known tasks and then connects to the NM to request a thread dump 
> followed by a kill.  If the task heartbeats in after the task has been 
> removed from the list of known tasks but before the thread dump signal 
> arrives then the task can exit with a "org.apache.hadoop.mapred.Task: Parent 
> died." message and no thread dump.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org

[jira] [Updated] (MAPREDUCE-7053) Timed out tasks can fail to produce thread dump

Reply via email to