[ https://issues.apache.org/jira/browse/TEZ-3767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16060182#comment-16060182 ]
Siddharth Seth commented on TEZ-3767: ------------------------------------- Exceptions from ShuffleCallable are supposed to be handled by the onFailure method in ShuffleRunnerFutureCallback, which already has checks in place for whether shutdown has been invoked or not. The killSelf invocation in ShuffleScheduler ends up invoking close on the ShuffleScheduler, which is part of Shuffle. Normally, Shuffle is supposed to control this component shutting down. I think it will be better if we extend the ExceptionReporter interface implemented by Shuffle to include the killSelf functionality. With that, Shuffle will continue to be responsible for the ShuffleScheduler lifecycle, after it kills itself. > Shuffle should not report error to AM during inputContext.killSelf() > -------------------------------------------------------------------- > > Key: TEZ-3767 > URL: https://issues.apache.org/jira/browse/TEZ-3767 > Project: Apache Tez > Issue Type: Bug > Reporter: Rajesh Balamohan > Assignee: Rajesh Balamohan > Attachments: TEZ-3767.1.patch, TEZ-3767.2.patch > > > {{ShuffleScheduler::killSelf}} kills the current attempt when it encounters > certain errors. As a part of cleanup, it invokes {{close}} which internally > releases the resources. > If merge is happening in the middle, it could throw the following exception. > This is caught in {{RunShuffleCallable}} and reported to AM immediately. This > causes tasks to fail. > {noformat} > » Error: Error while running task ( failure ) : > org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$ShuffleError: > Error while doing final merge > at > org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$RunShuffleCallable.callInternal(Shuffle.java:320) > at > org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$RunShuffleCallable.callInternal(Shuffle.java:285) > at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.util.ConcurrentModificationException > at java.util.TreeMap$PrivateEntryIterator.nextEntry(TreeMap.java:1211) > at java.util.TreeMap$KeyIterator.next(TreeMap.java:1265) > at java.util.AbstractCollection.toArray(AbstractCollection.java:141) > at java.util.ArrayList.addAll(ArrayList.java:577) > at > org.apache.tez.runtime.library.common.shuffle.orderedgrouped.MergeManager.close(MergeManager.java:636) > at > org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$RunShuffleCallable.callInternal(Shuffle.java:316) > ... 6 more > {noformat} > When {{isShutDown}} is set to true, it would be good to avoid sending error > messages to AM. -- This message was sent by Atlassian JIRA (v6.4.14#64029)