Shanthoosh Venkataraman created SAMZA-1506:
----------------------------------------------

             Summary: Potential orphaned containers  in LocalContainerRunner.
                 Key: SAMZA-1506
                 URL: https://issues.apache.org/jira/browse/SAMZA-1506
             Project: Samza
          Issue Type: Bug
            Reporter: Shanthoosh Venkataraman
            Assignee: Abhishek Shivanna
             Fix For: 0.14.0


We noticed an occurrence of orphaned container in LinkedIn production 
environment(using samza-yarn). 

The ContainerHeartbeatMonitor added as part of SAMZA-871 to solve this problem 
is alive on the orphaned container java process and didn't shut it down. 

ContainerHeartbeatMonitor uses single-threaded ScheduledExecutorService to 
periodically check if the container is orphaned.

>From the following process thread dump, it's apparent that the worker thread 
>in ScheduledExecutorService finds the task queue is empty and goes to waiting 
>state(expecting new tasks to be added to the queue).

{code:java}
"Samza-ContainerHeartbeatMonitor-0" #34 prio=5 os_prio=0 tid=0x00007f9322896800 
nid=0x38af waiting on condition [0x00007f92f363e000]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x000000070078a0e8> (a 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
        at 
java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1081)
        at 
java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809)
        at 
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1067)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1127)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
{code}

If the execution of a Runnable submitted to 
ScheduledExecutorService.scheduleAtFixedRate throws an exception, subsequent 
executions are suppressed. 

Existing ContainerHeartBeatClient implementation which accesses the 
ApplicationMaster http-endpoint to get container liveness has IOException 
handlers alone. Any unchecked exceptions thrown from that code path will 
shutdown the ContainerHeartbeatMonitor(This is the suspected cause).

This requires further investigation.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to