[
https://issues.apache.org/jira/browse/MAPREDUCE-5465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13993652#comment-13993652
]
Jason Lowe commented on MAPREDUCE-5465:
---------------------------------------
The release audit warnings are unrelated, filed MAPREDUCE-5885. The
TestPipeApplication timeout is also unrelated, see MAPREDUCE-5868.
Thanks for updating the patch, Ming! Sorry for the long delay in getting back
to this. I've been thinking about the performance implications of this change.
I'm wondering if we should treat the finishing states as if they're the
corresponding completed states from external entities (i.e.: task/job). We
would send T_ATTEMPT_SUCCEEDED or T_ATTEMPT_FAILED and set task finish times to
the time the attempt said it succeeded or failed rather than the time the
container completed. Similarly we would map the internal finishing states to
their respective external SUCCEEDED/FAILED state rather than RUNNING. From the
task/job perspective they're not particularly interested in when the attempt
exits, rather they only care about when the task says it's output is available.
This would allow the task and job to react to success/failure transitions in
the same timeframe that it does today, so there should be a minimal performance
impact. The only impact would be if the container needs to complete to free up
enough space for the next task's container to be allocated, and in most cases
the task will complete quick enough that the AM will receive the new container
in the same heartbeat that it used to before this change. Actually this may
end up being slightly faster than what it does today, since today it connects
to the NM and sends the kill command before it considers the task completed.
This proposal would have the task complete as soon as the task indicated via
the umbilical.
Other comments on the latest patch:
- Rather than have the finishing states call the cleanup container transition
and have that transition have to special-case being called by finishing states,
it'd be cleaner to factor out the common code from the cleanup container
transition that they're trying to leverage and call that instead. Transitions
doing state or event checks usually means somethings a bit off, since the
transition should already know what event triggered it and what state(s) it
applies to.
- Similarly, the timeout transitions should have dedicated transition code that
not only warns in the AM log but also sets an attempt diagnostic message. It
can re-use some/all of the cleanup container transition so it's not replicating
code. With the diagnostic it will be much more likely the user will be aware
of the timeout issue and fix their task code. Tasks that timeout during
finishing can still succeed, so users probably won't even know something went
wrong unless they bother to examine the AM log and happen to notice it.
- This change looks like some accidental reformatting:
{noformat}
---
a/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapred/LocalContainerLauncher.java
+++
b/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapred/LocalContainerLauncher.java
@@ -222,7 +222,7 @@ public void run() {
// remember the current attempt
futures.put(event.getTaskAttemptID(), future);
- } else if (event.getType() == EventType.CONTAINER_REMOTE_CLEANUP) {
+ } else if (event.getType() == EventType.CONTAINER_REMOTE_CLEANUP) {
// cancel (and interrupt) the current running task associated with
the
// event
{noformat}
- Nit: a sendContainerCompleted utility method to send the CONTAINER_COMPLETED
event would be nice
- Nit: code should be formatted to 80 columns, comments for the state
transitions in particular.
> Container killed before hprof dumps profile.out
> -----------------------------------------------
>
> Key: MAPREDUCE-5465
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5465
> Project: Hadoop Map/Reduce
> Issue Type: Improvement
> Components: mr-am, mrv2
> Affects Versions: trunk, 2.0.3-alpha
> Reporter: Radim Kolar
> Assignee: Ming Ma
> Attachments: MAPREDUCE-5465-2.patch, MAPREDUCE-5465-3.patch,
> MAPREDUCE-5465-4.patch, MAPREDUCE-5465-5.patch, MAPREDUCE-5465-6.patch,
> MAPREDUCE-5465.patch
>
>
> If there is profiling enabled for mapper or reducer then hprof dumps
> profile.out at process exit. It is dumped after task signaled to AM that work
> is finished.
> AM kills container with finished work without waiting for hprof to finish
> dumps. If hprof is dumping larger outputs (such as with depth=4 while depth=3
> works) , it could not finish dump in time before being killed making entire
> dump unusable because cpu and heap stats are missing.
> There needs to be better delay before container is killed if profiling is
> enabled.
--
This message was sent by Atlassian JIRA
(v6.2#6252)