[
https://issues.apache.org/jira/browse/MAPREDUCE-6771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15447008#comment-15447008
]
Haibo Chen commented on MAPREDUCE-6771:
---------------------------------------
Thanks [~jlowe] for bringing up the case in MAPREDUCE-4955. Looking at that
jira, it is indeed another case where diagnostics could be lost.
bq. The AM would either need to postpone recording the attempt completion event
until it receives the container completion event to see if there are any
diagnostics or there needs to be a way to record postmortem diagnostics for
attempts in the jhist file.
The diagnostics are included as part of a
TaskAttemptUnsuccessfullyCompletionEvent, so my understanding of this is there
should be ideally one such event in the jhist file (If we emit multiple
instances, JobHistoryParser will always take the last instance seen in .jhist
file). Please correct me if I am wrong. Therefore, I am thinking of postponing
recording the unsuccessfully completion event.
bq. postpone recording the attempt completion event until it receives the
container completion event to see if there are any diagnostics
TaskAttemptUnsuccessfullyCompletionEvent is generated upon receipt of TA_KILL,
TA_TooManyFetchFailures and TA_FailMsg Postponing the event emission until a
container completion event is received makes the handling of TA_FAILMSG
semantically inconsistent with that of other cases. I wonder if it is
semantically cleaner to postpone the completion event until the transition into
the final states (FAILED, KILLED). The emission of
TaskAttemptUnsuccessfullyCompletionEvents happens currently before transition
into FAIL_FINISHING_CONTAINER, FAILED or KILLED state, but given that
FAIL_FINISHING_CONTAINER will eventually transition into FAILED state, we could
reduce the three cases into two (See the attachment show transitions during
which an TaskAttemptUnsuccessfullyCompletionEvent is generated). That is, right
before a task attempt goes into KILLED or FAILED, a
TaskAttemptUnsuccessfullyCompletionEvents is written into the .jhist file.
> Diagnostics information can be lost in .jhist if task containers are killed
> by Node Manager.
> --------------------------------------------------------------------------------------------
>
> Key: MAPREDUCE-6771
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6771
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: mrv2
> Affects Versions: 2.7.3
> Reporter: Haibo Chen
> Assignee: Haibo Chen
> Attachments: mapreduce6771.001.patch
>
>
> Task containers can go over their resource limit, and killed by Node Manager.
> Then MR AM gets notified of the container status and diagnostics information
> through its heartbeat with RM. However, it is possible that the diagnostics
> information never gets into .jhist file, so when the job completes, the
> diagnostics information associated with the failed task attempts is empty.
> This makes it hard for users to root cause job failures that are often caused
> by memory leak.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]