[ 
https://issues.apache.org/jira/browse/TEZ-3767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16060182#comment-16060182
 ] 

Siddharth Seth commented on TEZ-3767:
-------------------------------------

Exceptions from ShuffleCallable are supposed to be handled by the onFailure 
method in ShuffleRunnerFutureCallback, which already has checks in place for 
whether shutdown has been invoked or not.

The killSelf invocation in ShuffleScheduler ends up invoking close on the 
ShuffleScheduler, which is part of Shuffle. Normally, Shuffle is supposed to 
control this component shutting down. I think it will be better if we extend 
the ExceptionReporter interface implemented by Shuffle to include the killSelf 
functionality. With that, Shuffle will continue to be responsible for the 
ShuffleScheduler lifecycle, after it kills itself.

> Shuffle should not report error to AM during inputContext.killSelf()
> --------------------------------------------------------------------
>
>                 Key: TEZ-3767
>                 URL: https://issues.apache.org/jira/browse/TEZ-3767
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Rajesh Balamohan
>            Assignee: Rajesh Balamohan
>         Attachments: TEZ-3767.1.patch, TEZ-3767.2.patch
>
>
> {{ShuffleScheduler::killSelf}} kills the current attempt when it encounters 
> certain errors. As a part of cleanup, it invokes {{close}} which internally 
> releases the resources.
> If merge is happening in the middle,  it could throw the following exception. 
> This is caught in {{RunShuffleCallable}} and reported to AM immediately. This 
> causes tasks to fail.
> {noformat}
> » Error: Error while running task ( failure ) : 
> org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$ShuffleError:
>  Error while doing final merge 
>   at 
> org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$RunShuffleCallable.callInternal(Shuffle.java:320)
>   at 
> org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$RunShuffleCallable.callInternal(Shuffle.java:285)
>   at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: java.util.ConcurrentModificationException
>   at java.util.TreeMap$PrivateEntryIterator.nextEntry(TreeMap.java:1211)
>   at java.util.TreeMap$KeyIterator.next(TreeMap.java:1265)
>   at java.util.AbstractCollection.toArray(AbstractCollection.java:141)
>   at java.util.ArrayList.addAll(ArrayList.java:577)
>   at 
> org.apache.tez.runtime.library.common.shuffle.orderedgrouped.MergeManager.close(MergeManager.java:636)
>   at 
> org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$RunShuffleCallable.callInternal(Shuffle.java:316)
>   ... 6 more
> {noformat}
> When {{isShutDown}} is set to true, it would be good to avoid sending error 
> messages to AM.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to