Shanthoosh Venkataraman created SAMZA-1506:
----------------------------------------------
Summary: Potential orphaned containers in LocalContainerRunner.
Key: SAMZA-1506
URL: https://issues.apache.org/jira/browse/SAMZA-1506
Project: Samza
Issue Type: Bug
Reporter: Shanthoosh Venkataraman
Assignee: Abhishek Shivanna
Fix For: 0.14.0
We noticed an occurrence of orphaned container in LinkedIn production
environment(using samza-yarn).
The ContainerHeartbeatMonitor added as part of SAMZA-871 to solve this problem
is alive on the orphaned container java process and didn't shut it down.
ContainerHeartbeatMonitor uses single-threaded ScheduledExecutorService to
periodically check if the container is orphaned.
>From the following process thread dump, it's apparent that the worker thread
>in ScheduledExecutorService finds the task queue is empty and goes to waiting
>state(expecting new tasks to be added to the queue).
{code:java}
"Samza-ContainerHeartbeatMonitor-0" #34 prio=5 os_prio=0 tid=0x00007f9322896800
nid=0x38af waiting on condition [0x00007f92f363e000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x000000070078a0e8> (a
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
at
java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1081)
at
java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809)
at
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1067)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1127)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}
If the execution of a Runnable submitted to
ScheduledExecutorService.scheduleAtFixedRate throws an exception, subsequent
executions are suppressed.
Existing ContainerHeartBeatClient implementation which accesses the
ApplicationMaster http-endpoint to get container liveness has IOException
handlers alone. Any unchecked exceptions thrown from that code path will
shutdown the ContainerHeartbeatMonitor(This is the suspected cause).
This requires further investigation.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)