[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14074185#comment-14074185
 ] 

Zhijie Shen commented on MAPREDUCE-6002:
----------------------------------------

TaskUmbilicalProtocol#fsError and #fatalError are the two calls that will 
result in TA_FAILMSG, and consequently move a task attempt to failure. Checking 
whether the task attempt process is stopping and only notifying the listener 
when the process is NOT stopping can prevent the task attempt being moved to 
FAILED because of the exception caused by stopping a process, such as the 
aforementioned case.

In general, the solution makes sense to me. Just one concern: it may result in 
another race condition on the contradictory. For example, an exception which is 
NOT caused by stopping the task attempt process happens MERELY before the 
shutdown hook is invoked. Then, when we check whether the task attempt process 
is stopping, it already returns true. In this extreme case, the exception is 
going to be missed by the listener, and the task attempt is moved to PREEMPTED 
instead of FAILED.

While marking a TA that is supposed to PREEMPTED as FAILED and vice versa are 
the rare cases, IMHO, they have different levels of down side. Marking a TA 
that is supposed to PREEMPTED as FAILED is likely to make the task not be able 
to retry. IMHO, On the other side, marking a TA that is supposed to FAILED as 
PREEMPTED will make the attempt retry even it used up the retry quota, which is 
not too bad. Offering users more what the are promised sounds better than 
offering less. Any thoughts?

> MR task should prevent report error to AM when process is shutting down
> -----------------------------------------------------------------------
>
>                 Key: MAPREDUCE-6002
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6002
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: task
>    Affects Versions: 2.5.0
>            Reporter: Wangda Tan
>            Assignee: Wangda Tan
>         Attachments: MR-6002.patch
>
>
> With MAPREDUCE-5900, preempted MR task should not be treat as failed. 
> But it is still possible a MR task fail and report to AM when preemption take 
> effect and the AM hasn't received completed container from RM yet. It will 
> cause the task attempt marked failed instead of preempted.
> An example is FileSystem has shutdown hook, it will close all FileSystem 
> instance, if at the same time, the FileSystem is in-use (like reading split 
> details from HDFS), MR task will fail and report the fatal error to MR AM. An 
> exception will be raised:
> {code}
> 2014-07-22 01:46:19,613 FATAL [IPC Server handler 10 on 56903] 
> org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: 
> attempt_1405985051088_0018_m_000025_0 - exited : java.io.IOException: 
> Filesystem closed
>       at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:707)
>       at 
> org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:776)
>       at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:837)
>       at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:645)
>       at java.io.DataInputStream.readByte(DataInputStream.java:265)
>       at org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:308)
>       at 
> org.apache.hadoop.io.WritableUtils.readVIntInRange(WritableUtils.java:348)
>       at org.apache.hadoop.io.Text.readString(Text.java:464)
>       at org.apache.hadoop.io.Text.readString(Text.java:457)
>       at org.apache.hadoop.mapred.MapTask.getSplitDetails(MapTask.java:357)
>       at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:731)
>       at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
>       at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167)
>       at java.security.AccessController.doPrivileged(Native Method)
>       at javax.security.auth.Subject.doAs(Subject.java:415)
>       at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1594)
>       at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)
> {code}
> We should prevent this, because it is possible other exceptions happen when 
> shutting down, we shouldn't report any of such exceptions to AM.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to