[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated MAPREDUCE-5982:
----------------------------------
    Status: Open  (was: Patch Available)

Thanks for the patch, Chang!

Note that the point of this change is to be able to have users locate any 
potential logs for applications that failed in the ASSIGNED state.  By having a 
canned fake started event there's no way to determine which nodemanager tried 
to run the container and therefore we can't provide a good logs link.  We need 
to preserve as much information as we can about the task, and that includes the 
host, http port, etc.

The good news is that we have most of this information from the container that 
was assigned to the task attempt.  See the code for LaunchedContainerTransition 
for details.  It would be nice to see some of the code in that transition 
factored out so it can be reused when we are creating the start event for an 
attempt that failed in the ASSIGNED state.  Also I would hesitate to call it a 
fake event.  It's still a task started event, but we are missing just a few key 
components like the shuffle port and the start time.  If we factor out the code 
from LaunchedContainerTransition then we can drop the "fake" part.

Is forceFinishTime really necessary?  We can go ahead and set the launch time 
as we are processing the task started event and then just call setFinishTime.

In general I think we should worry about making sure we generate a proper task 
start event and then let the normal task unsuccessful completion event code 
handle things after that.  For example, in DeallocateContainerTransition I 
think we should be generating the job counter update events for this scenario, 
but we don't since we go down a different task unsuccessful completion event 
handling path when launchTime is zero.  Seems like we should just generate the 
missing start event when launchTime is zero then fall through to the normal 
unsucessful completion event handling code in all cases after that.

Nit: missing whitespace before new method in MRApp.


> Task attempts that fail from the ASSIGNED state can disappear
> -------------------------------------------------------------
>
>                 Key: MAPREDUCE-5982
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5982
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am
>    Affects Versions: 2.7.1, 2.2.1, 0.23.10
>            Reporter: Jason Lowe
>            Assignee: Chang Li
>         Attachments: MAPREDUCE-5982.2.patch, MAPREDUCE-5982.3.patch, 
> MAPREDUCE-5982.4.patch, MAPREDUCE-5982.patch
>
>
> If a task attempt fails in the ASSIGNED state, e.g.: container launch fails,  
> then it can disappear from the job history.  The task overview page will show 
> subsequent attempts but the attempt in question is simply missing.  For 
> example attempt ID 1 appears but the attempt ID 0 is missing.  Similarly in 
> the job overview page the task attempt doesn't appear in any of the 
> failed/killed/succeeded counts or pages.  It's as if the task attempt never 
> existed, but the AM logs show otherwise.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to