Hey Jack,

Be sure to check the AM log in addition to the container logs. The AM log
will show cases where the AM killed a container (e.g. for exceeding
yarn.container.memory.mb).

The only time I've ever seen a container die with nothing in the container
log, syserr, or AM log, it was being killed locally (i.e. someone or some
script was running "kill -9")

-Jake

On Fri, Jun 10, 2016 at 7:07 PM, Jack Huang <jackhu...@mz.com> wrote:

> Hi Jake,
>
> Thanks for the reply. It appears that my task fails too many times within
> the yarn.container.retry.window.ms
> ​
> window. Increasing yarn.container.retry.count​seems to solve the issue for
> the time being.
> It still remains unknown why my tasks are failing so often. There is no
> error messages in the log to help...
>
> Thanks,
> Jack
>
>
> On Fri, Jun 10, 2016 at 1:21 PM, Jacob Maes <jacob.m...@gmail.com> wrote:
>
> > Hey Jack,
> >
> > The port collisions are fixed in
> > https://issues.apache.org/jira/browse/SAMZA-932. However, they usually
> > don't occur frequently enough to cause the job to fail. Usually only a
> > couple containers fail with this exception but the job restarts them and
> > eventually runs smoothly.
> >
> > Your GC log looks normal. Don't be discouraged by the term "Allocation
> > Failure". It just means that Garbage Collection was invoked because a
> > memory allocation failed. That's the most common cause of GC in any Java
> > application.
> >
> > -Jake
> >
> > On Fri, Jun 10, 2016 at 10:55 AM, Jack Huang <jackhu...@mz.com> wrote:
> >
> > > Hi all,
> > >
> > > I am having difficulty determining the reason my Samza task is failing.
> > It
> > > generally failed within 10 minutes of start. When I examine the YARN
> log
> > I
> > > see the following exception on some but not all containers:
> > >
> > > java.rmi.server.ExportException: Port already in use: 40029; nested
> > > exception is:
> > >         java.net.BindException: Address already in use
> > >         at
> > sun.rmi.transport.tcp.TCPTransport.listen(TCPTransport.java:341)
> > >         at
> > > sun.rmi.transport.tcp.TCPTransport.exportObject(TCPTransport.java:249)
> > >         at
> > > sun.rmi.transport.tcp.TCPEndpoint.exportObject(TCPEndpoint.java:411)
> > >         at sun.rmi.transport.LiveRef.exportObject(LiveRef.java:147)
> > >         at
> > > sun.rmi.server.UnicastServerRef.exportObject(UnicastServerRef.java:208)
> > >         at
> > >
> >
> java.rmi.server.UnicastRemoteObject.exportObject(UnicastRemoteObject.java:383)
> > >         at
> > >
> >
> java.rmi.server.UnicastRemoteObject.exportObject(UnicastRemoteObject.java:346)
> > >         at
> > >
> >
> javax.management.remote.rmi.RMIJRMPServerImpl.export(RMIJRMPServerImpl.java:118)
> > >         at
> > >
> >
> javax.management.remote.rmi.RMIJRMPServerImpl.export(RMIJRMPServerImpl.java:95)
> > >         at
> > >
> >
> javax.management.remote.rmi.RMIConnectorServer.start(RMIConnectorServer.java:404)
> > >         at
> org.apache.samza.metrics.JmxServer.<init>(JmxServer.scala:89)
> > >         at
> org.apache.samza.metrics.JmxServer.<init>(JmxServer.scala:43)
> > >         at
> > >
> >
> org.apache.samza.container.SamzaContainer$$anonfun$main$2.apply(SamzaContainer.scala:66)
> > >         at
> > >
> >
> org.apache.samza.container.SamzaContainer$$anonfun$main$2.apply(SamzaContainer.scala:66)
> > >         at
> > >
> >
> org.apache.samza.container.SamzaContainer$.safeMain(SamzaContainer.scala:91)
> > >         at
> > >
> org.apache.samza.container.SamzaContainer$.main(SamzaContainer.scala:66)
> > >         at
> > > org.apache.samza.container.SamzaContainer.main(SamzaContainer.scala)
> > > Caused by: java.net.BindException: Address already in use
> > >         at java.net.PlainSocketImpl.socketBind(Native Method)
> > >         at
> > > java.net.AbstractPlainSocketImpl.bind(AbstractPlainSocketImpl.java:387)
> > >         at java.net.ServerSocket.bind(ServerSocket.java:375)
> > >         at java.net.ServerSocket.<init>(ServerSocket.java:237)
> > >         at java.net.ServerSocket.<init>(ServerSocket.java:128)
> > >         at
> > >
> >
> sun.rmi.transport.proxy.RMIDirectSocketFactory.createServerSocket(RMIDirectSocketFactory.java:45)
> > >         at
> > >
> >
> sun.rmi.transport.proxy.RMIMasterSocketFactory.createServerSocket(RMIMasterSocketFactory.java:345)
> > >         at
> > > sun.rmi.transport.tcp.TCPEndpoint.newServerSocket(TCPEndpoint.java:666)
> > >         at
> > sun.rmi.transport.tcp.TCPTransport.listen(TCPTransport.java:330)
> > >         ... 16 more
> > >
> > > ​
> > >
> > >
> > > Aside from this, I also seeing a lot of garbage collection messages,
> but
> > no
> > > OutOfMemoryError:
> > >
> > > 2016-06-10T09:52:48.335+0000: 1.150: [GC (System.gc())
> > > 115345K->9996K(1005056K), 0.0097432 secs]
> > > 2016-06-10T09:52:48.345+0000: 1.160: [Full GC (System.gc())
> > > 9996K->9414K(1005056K), 0.0291634 secs]
> > > 2016-06-10T09:52:50.032+0000: 2.846: [GC (Allocation Failure)
> > > 271558K->18186K(1005056K), 0.0094124 secs]
> > > 2016-06-10T09:52:50.592+0000: 3.406: [GC (Allocation Failure)
> > > 280330K->13998K(1005056K), 0.0029778 secs]
> > > 2016-06-10T09:52:51.036+0000: 3.850: [GC (Allocation Failure)
> > > 276142K->11966K(1005056K), 0.0029768 secs]
> > > 2016-06-10T09:52:51.437+0000: 4.252: [GC (Allocation Failure)
> > > 274110K->11942K(1005056K), 0.0033398 secs]
> > > 2016-06-10T09:52:53.367+0000: 6.182: [GC (Metadata GC Threshold)
> > > 114006K->14958K(1037312K), 0.0050709 secs]
> > > 2016-06-10T09:52:53.372+0000: 6.187: [Full GC (Metadata GC Threshold)
> > > 14958K->8364K(1037312K), 0.0251294 secs]
> > > 2016-06-10T09:52:54.438+0000: 7.252: [GC (Allocation Failure)
> > > 302764K->14676K(1005056K), 0.0073039 secs]
> > > 2016-06-10T09:52:54.952+0000: 7.766: [GC (Allocation Failure)
> > > 309076K->18076K(1034752K), 0.0105222 secs]
> > > 2016-06-10T09:52:55.471+0000: 8.285: [GC (Allocation Failure)
> > > 342684K->14556K(1036288K), 0.0067447 secs]
> > > 2016-06-10T09:52:55.970+0000: 8.784: [GC (Allocation Failure)
> > > 339164K->15956K(1037312K), 0.0038883 secs]
> > > 2016-06-10T09:52:56.473+0000: 9.287: [GC (Allocation Failure)
> > > 342612K->13940K(1037312K), 0.0034693 secs]
> > > 2016-06-10T09:52:56.958+0000: 9.773: [GC (Allocation Failure)
> > > 340596K->15608K(1037824K), 0.0049325 secs]
> > > 2016-06-10T09:52:57.452+0000: 10.266: [GC (Allocation Failure)
> > > 343288K->18659K(1037824K), 0.0155791 secs]
> > > 2016-06-10T09:52:58.000+0000: 10.814: [GC (Allocation Failure)
> > > 346339K->19508K(1036800K), 0.0154724 secs]
> > > 2016-06-10T09:52:58.528+0000: 11.342: [GC (Allocation Failure)
> > > 346164K->23116K(1037312K), 0.0033848 secs]
> > > 2016-06-10T09:52:58.999+0000: 11.813: [GC (Allocation Failure)
> > > 349772K->26455K(1038336K), 0.0081673 secs]
> > > 2016-06-10T09:52:59.488+0000: 12.302: [GC (Allocation Failure)
> > > 354647K->27086K(1037824K), 0.0046321 secs]
> > > 2016-06-10T09:52:59.937+0000: 12.751: [GC (Allocation Failure)
> > > 355278K->30694K(1038336K), 0.0032607 secs]
> > > 2016-06-10T09:53:00.375+0000: 13.189: [GC (Allocation Failure)
> > > 358886K->34214K(1037824K), 0.0053298 secs]
> > > 2016-06-10T09:53:00.819+0000: 13.634: [GC (Allocation Failure)
> > > 362406K->34860K(1038848K), 0.0064049 secs]
> > > 2016-06-10T09:53:01.261+0000: 14.075: [GC (Allocation Failure)
> > > 364588K->38492K(1038848K), 0.0053598 secs]
> > > 2016-06-10T09:53:01.708+0000: 14.522: [GC (Allocation Failure)
> > > 368220K->40955K(1039360K), 0.0054838 secs]
> > > 2016-06-10T09:53:02.156+0000: 14.971: [GC (Allocation Failure)
> > > 371707K->42472K(1039360K), 0.0097664 secs]
> > > 2016-06-10T09:53:02.602+0000: 15.416: [GC (Allocation Failure)
> > > 373224K->46177K(1040384K), 0.0057107 secs]
> > > 2016-06-10T09:53:03.052+0000: 15.866: [GC (Allocation Failure)
> > > 378465K->46599K(1039872K), 0.0063725 secs]
> > > 2016-06-10T09:53:03.497+0000: 16.311: [GC (Allocation Failure)
> > > 378887K->50183K(1040384K), 0.0095219 secs]
> > > 2016-06-10T09:53:03.945+0000: 16.759: [GC (Allocation Failure)
> > > 382983K->52824K(1040384K), 0.0044979 secs]
> > > 2016-06-10T09:53:04.392+0000: 17.206: [GC (Allocation Failure)
> > > 385624K->54387K(1040896K), 0.0053487 secs]
> > > 2016-06-10T09:53:04.841+0000: 17.656: [GC (Allocation Failure)
> > > 388211K->56947K(1040896K), 0.0025053 secs]
> > > 2016-06-10T09:53:05.303+0000: 18.117: [GC (Allocation Failure)
> > > 390771K->59701K(1041408K), 0.0054432 secs]
> > > 2016-06-10T09:53:05.757+0000: 18.571: [GC (Allocation Failure)
> > > 394037K->60303K(1040896K), 0.0053059 secs]
> > > 2016-06-10T09:53:06.208+0000: 19.022: [GC (Allocation Failure)
> > > 394639K->63759K(1041408K), 0.0024715 secs]
> > > 2016-06-10T09:53:06.661+0000: 19.475: [GC (Allocation Failure)
> > > 398607K->66585K(1041408K), 0.0039495 secs]
> > > 2016-06-10T09:53:07.109+0000: 19.924: [GC (Allocation Failure)
> > > 401433K->67278K(1041408K), 0.0056410 secs]
> > > 2016-06-10T09:53:07.563+0000: 20.377: [GC (Allocation Failure)
> > > 402126K->70798K(1041408K), 0.0031031 secs]
> > > 2016-06-10T09:53:08.038+0000: 20.853: [GC (Allocation Failure)
> > > 405646K->73652K(1041408K), 0.0057616 secs]
> > >
> > > ​
> > >
> > > Can anyone help?
> > >
> > >
> > > Regards,
> > > Jack
> > >
> >
>

Reply via email to