Hmm, could there be something in your job holding up the container shutdown process? Perhaps something ignoring SIGTERM/Thread.interrupt, by chance?
Also, I think there's a YARN property specifying the amount of time the NM waits between sending a SIGTERM and a SIGKILL, though I can't find it at the moment. -Jake On Wed, May 18, 2016 at 10:32 AM, David Yu <david...@optimizely.com> wrote: > From the NM log, I'm seeing: > > 2016-05-18 06:29:06,248 INFO > > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: > Cleaning up container container_e01_1463512986427_0007_01_0000022016-05-18 > 06:29:06,265 INFO > > org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: > Application *application_1463512986427_0007* transitioned from RUNNING to > FINISHING_CONTAINERS_WAIT > > (*Highlighted* is the particular samza application.) > > The status never transitioned from FINISHING_CONTAINERS_WAIT :( > > > > On Wed, May 18, 2016 at 10:21 AM, David Yu <david...@optimizely.com> > wrote: > > > Jacob, > > > > I have checked and made sure that NM is running on the node: > > > > $ ps aux | grep java > > ... > > yarn 25623 0.5 0.8 2366536 275488 ? Sl May17 7:04 > > /usr/java/jdk1.8.0_51/bin/java -Dproc_nodemanager > > ... org.apache.hadoop.yarn.server.nodemanager.NodeManager > > > > > > > > Thanks, > > David > > > > On Wed, May 18, 2016 at 7:08 AM, Jacob Maes <jacob.m...@gmail.com> > wrote: > > > >> Hey David, > >> > >> The only time I've seen orphaned containers is when the NM dies. If the > NM > >> isn't running, the RM has no means to kill the containers on a node. Can > >> you verify that the NM was healthy at the time of the shut down? > >> > >> If it wasn't healthy and/or it was restarted, one option that may help > is > >> NM Recovery: > >> > >> > https://hadoop.apache.org/docs/r2.7.2/hadoop-yarn/hadoop-yarn-site/NodeManagerRestart.html > >> > >> With NM Recovery, the NM will resume control over containers that were > >> running when the NM shut down. This option has virtually eliminated > >> orphaned containers in our clusters. > >> > >> -Jake > >> > >> On Tue, May 17, 2016 at 11:54 PM, David Yu <david...@optimizely.com> > >> wrote: > >> > >> > Samza version = 0.10.0 > >> > YARN version = Hadoop 2.6.0-cdh5.4.9 > >> > > >> > We are experience issues when killing a Samza job: > >> > > >> > $ yarn application -kill application_1463512986427_0007 > >> > > >> > Killing application application_1463512986427_0007 > >> > > >> > 16/05/18 06:29:05 INFO impl.YarnClientImpl: Killed application > >> > application_1463512986427_0007 > >> > > >> > RM shows that the job is killed. However, the samza containers are > still > >> > left running. > >> > > >> > Any idea why this is happening? > >> > > >> > Thanks, > >> > David > >> > > >> > > > > >