Hi, David and all,

The "ultimate" solution is probably to implement SAMZA-871
<https://issues.apache.org/jira/browse/SAMZA-871>, which allows Samza
JobCoordinator directly identifies whether a container is alive or not w/o
dependency on the cluster management systems. This is also considered
together w/ SAMZA-881 where we are refactoring the JobCoordinator code to
allow standalone deployment of Samza jobs. Hope that set the direction
right. If you have other concerns on this, please speak up in those JIRAs.

Thanks!

-Yi

On Thu, May 19, 2016 at 2:33 PM, David Yu <david...@optimizely.com> wrote:

> Just stumbled upon this post and sees to be the same issue:
>
> https://issues.apache.org/jira/browse/SAMZA-498
>
>
> We followed the fix to create a wrapper kill script and everything works.
>
> Do we have a plan to fix this in the next version of Samza?
>
> Thanks,
> David
>
> On Wed, May 18, 2016 at 11:53 AM, Jacob Maes <jacob.m...@gmail.com> wrote:
>
> > Hmm, could there be something in your job holding up the container
> shutdown
> > process? Perhaps something ignoring SIGTERM/Thread.interrupt, by chance?
> >
> > Also, I think there's a YARN property specifying the amount of time the
> NM
> > waits between sending a SIGTERM and a SIGKILL, though I can't find it at
> > the moment.
> >
> > -Jake
> >
> > On Wed, May 18, 2016 at 10:32 AM, David Yu <david...@optimizely.com>
> > wrote:
> >
> > > From the NM log, I'm seeing:
> > >
> > > 2016-05-18 06:29:06,248 INFO
> > >
> > >
> >
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
> > > Cleaning up container
> > container_e01_1463512986427_0007_01_0000022016-05-18
> > > 06:29:06,265 INFO
> > >
> > >
> >
> org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application:
> > > Application *application_1463512986427_0007* transitioned from RUNNING
> to
> > > FINISHING_CONTAINERS_WAIT
> > >
> > > (*Highlighted* is the particular samza application.)
> > >
> > > The status never transitioned from FINISHING_CONTAINERS_WAIT :(
> > >
> > >
> > >
> > > On Wed, May 18, 2016 at 10:21 AM, David Yu <david...@optimizely.com>
> > > wrote:
> > >
> > > > Jacob,
> > > >
> > > > I have checked and made sure that NM is running on the node:
> > > >
> > > > $ ps aux | grep java
> > > > ...
> > > > yarn     25623  0.5  0.8 2366536 275488 ?      Sl   May17   7:04
> > > > /usr/java/jdk1.8.0_51/bin/java -Dproc_nodemanager
> > > >  ... org.apache.hadoop.yarn.server.nodemanager.NodeManager
> > > >
> > > >
> > > >
> > > > Thanks,
> > > > David
> > > >
> > > > On Wed, May 18, 2016 at 7:08 AM, Jacob Maes <jacob.m...@gmail.com>
> > > wrote:
> > > >
> > > >> Hey David,
> > > >>
> > > >> The only time I've seen orphaned containers is when the NM dies. If
> > the
> > > NM
> > > >> isn't running, the RM has no means to kill the containers on a node.
> > Can
> > > >> you verify that the NM was healthy at the time of the shut down?
> > > >>
> > > >> If it wasn't healthy and/or it was restarted, one option that may
> help
> > > is
> > > >> NM Recovery:
> > > >>
> > > >>
> > >
> >
> https://hadoop.apache.org/docs/r2.7.2/hadoop-yarn/hadoop-yarn-site/NodeManagerRestart.html
> > > >>
> > > >> With NM Recovery, the NM will resume control over containers that
> were
> > > >> running when the NM shut down. This option has virtually eliminated
> > > >> orphaned containers in our clusters.
> > > >>
> > > >> -Jake
> > > >>
> > > >> On Tue, May 17, 2016 at 11:54 PM, David Yu <david...@optimizely.com
> >
> > > >> wrote:
> > > >>
> > > >> > Samza version = 0.10.0
> > > >> > YARN version = Hadoop 2.6.0-cdh5.4.9
> > > >> >
> > > >> > We are experience issues when killing a Samza job:
> > > >> >
> > > >> > $ yarn application -kill application_1463512986427_0007
> > > >> >
> > > >> > Killing application application_1463512986427_0007
> > > >> >
> > > >> > 16/05/18 06:29:05 INFO impl.YarnClientImpl: Killed application
> > > >> > application_1463512986427_0007
> > > >> >
> > > >> > RM shows that the job is killed. However, the samza containers are
> > > still
> > > >> > left running.
> > > >> >
> > > >> > Any idea why this is happening?
> > > >> >
> > > >> > Thanks,
> > > >> > David
> > > >> >
> > > >>
> > > >
> > > >
> > >
> >
>

Reply via email to