[
https://issues.apache.org/jira/browse/MAPREDUCE-6771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15452936#comment-15452936
]
Haibo Chen commented on MAPREDUCE-6771:
---------------------------------------
bq. Note that we aren't stuck with TaskAttemptUnsuccessfulCompletion event for
doing diagnostics.
Agree. I am guessing the reason why diagnostics is included in
TaskAttemptUnsuccessfulCompletionEvent is users only want to see diagnostics
when task attempts fail. Parsing a new event and ignoring such events for
successful task attempts does need additional change.
bq. but waiting for a container completion event is not something the state
machine does today.
There is no need to wait for container completion event. My proposal is to wait
for transition into FAILED state. As long as the task attempt goes into FAILED
state, which does not necessarily need to be triggered by a container
completion event (Time out (TA_TIMED_OUT) is already built-in in transitions
from FAIL_FINISHING_CONTAINER to FAILED), the diagnostics will be written into
jhist file. But your point of having a wide window is susceptible to AM crash
is still very convincing.
Given that there is no clear preferable approach to address the case in
MAPREDUCE-4955, do you think I can go ahead address the issue in this jira? The
symptom of MAPREDUCE-4955 and this one is the same, but the cause is not quite
exactly. The case in MAPREDUCE-4955 happens when AM thinks the task attempt is
already dead, or the diagnostics comes after a taskUnsuccessfulCompletionEvent
is generated already, whereas the case in this jira happens when the
diagnostics comes in while task attempt is still in running state, or before a
taskUnsuccessfulCompletionEvent. The case in this jira is easy to fix, and we
can keep MAPREDUCE-4955 to address the other when we decide what to do.
> Diagnostics information can be lost in .jhist if task containers are killed
> by Node Manager.
> --------------------------------------------------------------------------------------------
>
> Key: MAPREDUCE-6771
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6771
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: mrv2
> Affects Versions: 2.7.3
> Reporter: Haibo Chen
> Assignee: Haibo Chen
> Attachments: TaUnsuccessfullyEventEmission.jpg,
> mapreduce6771.001.patch
>
>
> Task containers can go over their resource limit, and killed by Node Manager.
> Then MR AM gets notified of the container status and diagnostics information
> through its heartbeat with RM. However, it is possible that the diagnostics
> information never gets into .jhist file, so when the job completes, the
> diagnostics information associated with the failed task attempts is empty.
> This makes it hard for users to root cause job failures that are often caused
> by memory leak.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]