Re: Samza job killed by left orphaned on YARN

Jacob Maes Wed, 18 May 2016 12:26:26 -0700

Hmm, could there be something in your job holding up the container shutdown
process? Perhaps something ignoring SIGTERM/Thread.interrupt, by chance?


Also, I think there's a YARN property specifying the amount of time the NM
waits between sending a SIGTERM and a SIGKILL, though I can't find it at
the moment.

-Jake

On Wed, May 18, 2016 at 10:32 AM, David Yu <david...@optimizely.com> wrote:

> From the NM log, I'm seeing:
>
> 2016-05-18 06:29:06,248 INFO
>
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
> Cleaning up container container_e01_1463512986427_0007_01_0000022016-05-18
> 06:29:06,265 INFO
>
> org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application:
> Application *application_1463512986427_0007* transitioned from RUNNING to
> FINISHING_CONTAINERS_WAIT
>
> (*Highlighted* is the particular samza application.)
>
> The status never transitioned from FINISHING_CONTAINERS_WAIT :(
>
>
>
> On Wed, May 18, 2016 at 10:21 AM, David Yu <david...@optimizely.com>
> wrote:
>
> > Jacob,
> >
> > I have checked and made sure that NM is running on the node:
> >
> > $ ps aux | grep java
> > ...
> > yarn     25623  0.5  0.8 2366536 275488 ?      Sl   May17   7:04
> > /usr/java/jdk1.8.0_51/bin/java -Dproc_nodemanager
> >  ... org.apache.hadoop.yarn.server.nodemanager.NodeManager
> >
> >
> >
> > Thanks,
> > David
> >
> > On Wed, May 18, 2016 at 7:08 AM, Jacob Maes <jacob.m...@gmail.com>
> wrote:
> >
> >> Hey David,
> >>
> >> The only time I've seen orphaned containers is when the NM dies. If the
> NM
> >> isn't running, the RM has no means to kill the containers on a node. Can
> >> you verify that the NM was healthy at the time of the shut down?
> >>
> >> If it wasn't healthy and/or it was restarted, one option that may help
> is
> >> NM Recovery:
> >>
> >>
> https://hadoop.apache.org/docs/r2.7.2/hadoop-yarn/hadoop-yarn-site/NodeManagerRestart.html
> >>
> >> With NM Recovery, the NM will resume control over containers that were
> >> running when the NM shut down. This option has virtually eliminated
> >> orphaned containers in our clusters.
> >>
> >> -Jake
> >>
> >> On Tue, May 17, 2016 at 11:54 PM, David Yu <david...@optimizely.com>
> >> wrote:
> >>
> >> > Samza version = 0.10.0
> >> > YARN version = Hadoop 2.6.0-cdh5.4.9
> >> >
> >> > We are experience issues when killing a Samza job:
> >> >
> >> > $ yarn application -kill application_1463512986427_0007
> >> >
> >> > Killing application application_1463512986427_0007
> >> >
> >> > 16/05/18 06:29:05 INFO impl.YarnClientImpl: Killed application
> >> > application_1463512986427_0007
> >> >
> >> > RM shows that the job is killed. However, the samza containers are
> still
> >> > left running.
> >> >
> >> > Any idea why this is happening?
> >> >
> >> > Thanks,
> >> > David
> >> >
> >>
> >
> >
>

Re: Samza job killed by left orphaned on YARN

Reply via email to