[jira] [Commented] (MAPREDUCE-5465) Container killed before hprof dumps profile.out

Jason Lowe (JIRA) Wed, 14 May 2014 16:05:26 -0700

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13993652#comment-13993652
 ]


Jason Lowe commented on MAPREDUCE-5465:
---------------------------------------

The release audit warnings are unrelated, filed MAPREDUCE-5885.  The 
TestPipeApplication timeout is also unrelated, see MAPREDUCE-5868.

Thanks for updating the patch, Ming!  Sorry for the long delay in getting back 
to this.  I've been thinking about the performance implications of this change. 
 I'm wondering if we should treat the finishing states as if they're the 
corresponding completed states from external entities (i.e.: task/job).  We 
would send T_ATTEMPT_SUCCEEDED or T_ATTEMPT_FAILED and set task finish times to 
the time the attempt said it succeeded or failed rather than the time the 
container completed.  Similarly we would map the internal finishing states to 
their respective external SUCCEEDED/FAILED state rather than RUNNING.  From the 
task/job perspective they're not particularly interested in when the attempt 
exits, rather they only care about when the task says it's output is available. 
 This would allow the task and job to react to success/failure transitions in 
the same timeframe that it does today, so there should be a minimal performance 
impact.  The only impact would be if the container needs to complete to free up 
enough space for the next task's container to be allocated, and in most cases 
the task will complete quick enough that the AM will receive the new container 
in the same heartbeat that it used to before this change.  Actually this may 
end up being slightly faster than what it does today, since today it connects 
to the NM and sends the kill command before it considers the task completed.  
This proposal would have the task complete as soon as the task indicated via 
the umbilical.

Other comments on the latest patch:
- Rather than have the finishing states call the cleanup container transition 
and have that transition have to special-case being called by finishing states, 
it'd be cleaner to factor out the common code from the cleanup container 
transition that they're trying to leverage and call that instead.  Transitions 
doing state or event checks usually means somethings a bit off, since the 
transition should already know what event triggered it and what state(s) it 
applies to.
- Similarly, the timeout transitions should have dedicated transition code that 
not only warns in the AM log but also sets an attempt diagnostic message.   It 
can re-use some/all of the cleanup container transition so it's not replicating 
code.  With the diagnostic it will be much more likely the user will be aware 
of the timeout issue and fix their task code.  Tasks that timeout during 
finishing can still succeed, so users probably won't even know something went 
wrong unless they bother to examine the AM log and happen to notice it.
- This change looks like some accidental reformatting:
{noformat}
--- 
a/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapred/LocalContainerLauncher.java
+++ 
b/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapred/LocalContainerLauncher.java
@@ -222,7 +222,7 @@ public void run() {
           // remember the current attempt
           futures.put(event.getTaskAttemptID(), future);
 
-        } else if (event.getType() == EventType.CONTAINER_REMOTE_CLEANUP) {
+          } else if (event.getType() == EventType.CONTAINER_REMOTE_CLEANUP) {
 
           // cancel (and interrupt) the current running task associated with 
the
           // event
{noformat}
- Nit: a sendContainerCompleted utility method to send the CONTAINER_COMPLETED 
event would be nice
- Nit: code should be formatted to 80 columns, comments for the state 
transitions in particular.

> Container killed before hprof dumps profile.out
> -----------------------------------------------
>
>                 Key: MAPREDUCE-5465
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5465
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: mr-am, mrv2
>    Affects Versions: trunk, 2.0.3-alpha
>            Reporter: Radim Kolar
>            Assignee: Ming Ma
>         Attachments: MAPREDUCE-5465-2.patch, MAPREDUCE-5465-3.patch, 
> MAPREDUCE-5465-4.patch, MAPREDUCE-5465-5.patch, MAPREDUCE-5465-6.patch, 
> MAPREDUCE-5465.patch
>
>
> If there is profiling enabled for mapper or reducer then hprof dumps 
> profile.out at process exit. It is dumped after task signaled to AM that work 
> is finished.
> AM kills container with finished work without waiting for hprof to finish 
> dumps. If hprof is dumping larger outputs (such as with depth=4 while depth=3 
> works) , it could not finish dump in time before being killed making entire 
> dump unusable because cpu and heap stats are missing.
> There needs to be better delay before container is killed if profiling is 
> enabled.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAPREDUCE-5465) Container killed before hprof dumps profile.out

Reply via email to