[ 
https://issues.apache.org/jira/browse/MAPREDUCE-3355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13147471#comment-13147471
 ] 

Siddharth Seth commented on MAPREDUCE-3355:
-------------------------------------------

There's another extremely unlikely situation which could cause this. 
Canceling the timer doesn't affect the timer task if it's already started. An 
interrupt could come in anytime after the cancel - which could interrupt the 
TA_CONTAINER_CLEANED event or the ContainerLaunchedEvent. This would be a 
combination of startContainer finishing around when the timer expires + some 
very specific thread scheduling. Also if the start/stopContainer were to 
complete around the same time as when the timer kicks in.
Possible fix would be to synchronize in the main task on the CommandTimer when 
we don't care about interrupts, and always synchronize the CommandTimer on 
itself.

                
> AM scheduling hangs frequently with sort job on 350 nodes
> ---------------------------------------------------------
>
>                 Key: MAPREDUCE-3355
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3355
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Vinod Kumar Vavilapalli
>            Priority: Blocker
>             Fix For: 0.23.1
>
>         Attachments: MAPREDUCE-3355-20111109.1.txt, 
> MAPREDUCE-3355-20111109.txt
>
>
> Another collaboration with [~karams]. Sort job hangs not so rarely on a 350 
> node cluster. Found this in AM logs:
> {code}
> Exception in thread "ContainerLauncher #60" 
> org.apache.hadoop.yarn.YarnException: java.lang.InterruptedException
>             at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:170)
>             at 
> org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:379)
>             at 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>             at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>             at java.lang.Thread.run(Thread.java:619)
> Caused by: java.lang.InterruptedException
>             at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1199)
>             at 
> java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:312)
>             at 
> java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:294)
>             at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:168)
>             ... 4 more
> Exception in thread "ContainerLauncher #53" 
> org.apache.hadoop.yarn.YarnException: java.lang.InterruptedException
>             at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:170)
>             at 
> org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl.sendContainerLaunchFailedMsg(ContainerLauncherImpl.java:405)
>             at 
> org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:330)
>             at 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>             at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>             at java.lang.Thread.run(Thread.java:619)
> Caused by: java.lang.InterruptedException
>             at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1199)
>             at 
> java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:312)
>             at 
> java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:294)
>             at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:168)
>             ... 5 more
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to