Hey Jack, Be sure to check the AM log in addition to the container logs. The AM log will show cases where the AM killed a container (e.g. for exceeding yarn.container.memory.mb).
The only time I've ever seen a container die with nothing in the container log, syserr, or AM log, it was being killed locally (i.e. someone or some script was running "kill -9") -Jake On Fri, Jun 10, 2016 at 7:07 PM, Jack Huang <jackhu...@mz.com> wrote: > Hi Jake, > > Thanks for the reply. It appears that my task fails too many times within > the yarn.container.retry.window.ms > > window. Increasing yarn.container.retry.countseems to solve the issue for > the time being. > It still remains unknown why my tasks are failing so often. There is no > error messages in the log to help... > > Thanks, > Jack > > > On Fri, Jun 10, 2016 at 1:21 PM, Jacob Maes <jacob.m...@gmail.com> wrote: > > > Hey Jack, > > > > The port collisions are fixed in > > https://issues.apache.org/jira/browse/SAMZA-932. However, they usually > > don't occur frequently enough to cause the job to fail. Usually only a > > couple containers fail with this exception but the job restarts them and > > eventually runs smoothly. > > > > Your GC log looks normal. Don't be discouraged by the term "Allocation > > Failure". It just means that Garbage Collection was invoked because a > > memory allocation failed. That's the most common cause of GC in any Java > > application. > > > > -Jake > > > > On Fri, Jun 10, 2016 at 10:55 AM, Jack Huang <jackhu...@mz.com> wrote: > > > > > Hi all, > > > > > > I am having difficulty determining the reason my Samza task is failing. > > It > > > generally failed within 10 minutes of start. When I examine the YARN > log > > I > > > see the following exception on some but not all containers: > > > > > > java.rmi.server.ExportException: Port already in use: 40029; nested > > > exception is: > > > java.net.BindException: Address already in use > > > at > > sun.rmi.transport.tcp.TCPTransport.listen(TCPTransport.java:341) > > > at > > > sun.rmi.transport.tcp.TCPTransport.exportObject(TCPTransport.java:249) > > > at > > > sun.rmi.transport.tcp.TCPEndpoint.exportObject(TCPEndpoint.java:411) > > > at sun.rmi.transport.LiveRef.exportObject(LiveRef.java:147) > > > at > > > sun.rmi.server.UnicastServerRef.exportObject(UnicastServerRef.java:208) > > > at > > > > > > java.rmi.server.UnicastRemoteObject.exportObject(UnicastRemoteObject.java:383) > > > at > > > > > > java.rmi.server.UnicastRemoteObject.exportObject(UnicastRemoteObject.java:346) > > > at > > > > > > javax.management.remote.rmi.RMIJRMPServerImpl.export(RMIJRMPServerImpl.java:118) > > > at > > > > > > javax.management.remote.rmi.RMIJRMPServerImpl.export(RMIJRMPServerImpl.java:95) > > > at > > > > > > javax.management.remote.rmi.RMIConnectorServer.start(RMIConnectorServer.java:404) > > > at > org.apache.samza.metrics.JmxServer.<init>(JmxServer.scala:89) > > > at > org.apache.samza.metrics.JmxServer.<init>(JmxServer.scala:43) > > > at > > > > > > org.apache.samza.container.SamzaContainer$$anonfun$main$2.apply(SamzaContainer.scala:66) > > > at > > > > > > org.apache.samza.container.SamzaContainer$$anonfun$main$2.apply(SamzaContainer.scala:66) > > > at > > > > > > org.apache.samza.container.SamzaContainer$.safeMain(SamzaContainer.scala:91) > > > at > > > > org.apache.samza.container.SamzaContainer$.main(SamzaContainer.scala:66) > > > at > > > org.apache.samza.container.SamzaContainer.main(SamzaContainer.scala) > > > Caused by: java.net.BindException: Address already in use > > > at java.net.PlainSocketImpl.socketBind(Native Method) > > > at > > > java.net.AbstractPlainSocketImpl.bind(AbstractPlainSocketImpl.java:387) > > > at java.net.ServerSocket.bind(ServerSocket.java:375) > > > at java.net.ServerSocket.<init>(ServerSocket.java:237) > > > at java.net.ServerSocket.<init>(ServerSocket.java:128) > > > at > > > > > > sun.rmi.transport.proxy.RMIDirectSocketFactory.createServerSocket(RMIDirectSocketFactory.java:45) > > > at > > > > > > sun.rmi.transport.proxy.RMIMasterSocketFactory.createServerSocket(RMIMasterSocketFactory.java:345) > > > at > > > sun.rmi.transport.tcp.TCPEndpoint.newServerSocket(TCPEndpoint.java:666) > > > at > > sun.rmi.transport.tcp.TCPTransport.listen(TCPTransport.java:330) > > > ... 16 more > > > > > > > > > > > > > > > Aside from this, I also seeing a lot of garbage collection messages, > but > > no > > > OutOfMemoryError: > > > > > > 2016-06-10T09:52:48.335+0000: 1.150: [GC (System.gc()) > > > 115345K->9996K(1005056K), 0.0097432 secs] > > > 2016-06-10T09:52:48.345+0000: 1.160: [Full GC (System.gc()) > > > 9996K->9414K(1005056K), 0.0291634 secs] > > > 2016-06-10T09:52:50.032+0000: 2.846: [GC (Allocation Failure) > > > 271558K->18186K(1005056K), 0.0094124 secs] > > > 2016-06-10T09:52:50.592+0000: 3.406: [GC (Allocation Failure) > > > 280330K->13998K(1005056K), 0.0029778 secs] > > > 2016-06-10T09:52:51.036+0000: 3.850: [GC (Allocation Failure) > > > 276142K->11966K(1005056K), 0.0029768 secs] > > > 2016-06-10T09:52:51.437+0000: 4.252: [GC (Allocation Failure) > > > 274110K->11942K(1005056K), 0.0033398 secs] > > > 2016-06-10T09:52:53.367+0000: 6.182: [GC (Metadata GC Threshold) > > > 114006K->14958K(1037312K), 0.0050709 secs] > > > 2016-06-10T09:52:53.372+0000: 6.187: [Full GC (Metadata GC Threshold) > > > 14958K->8364K(1037312K), 0.0251294 secs] > > > 2016-06-10T09:52:54.438+0000: 7.252: [GC (Allocation Failure) > > > 302764K->14676K(1005056K), 0.0073039 secs] > > > 2016-06-10T09:52:54.952+0000: 7.766: [GC (Allocation Failure) > > > 309076K->18076K(1034752K), 0.0105222 secs] > > > 2016-06-10T09:52:55.471+0000: 8.285: [GC (Allocation Failure) > > > 342684K->14556K(1036288K), 0.0067447 secs] > > > 2016-06-10T09:52:55.970+0000: 8.784: [GC (Allocation Failure) > > > 339164K->15956K(1037312K), 0.0038883 secs] > > > 2016-06-10T09:52:56.473+0000: 9.287: [GC (Allocation Failure) > > > 342612K->13940K(1037312K), 0.0034693 secs] > > > 2016-06-10T09:52:56.958+0000: 9.773: [GC (Allocation Failure) > > > 340596K->15608K(1037824K), 0.0049325 secs] > > > 2016-06-10T09:52:57.452+0000: 10.266: [GC (Allocation Failure) > > > 343288K->18659K(1037824K), 0.0155791 secs] > > > 2016-06-10T09:52:58.000+0000: 10.814: [GC (Allocation Failure) > > > 346339K->19508K(1036800K), 0.0154724 secs] > > > 2016-06-10T09:52:58.528+0000: 11.342: [GC (Allocation Failure) > > > 346164K->23116K(1037312K), 0.0033848 secs] > > > 2016-06-10T09:52:58.999+0000: 11.813: [GC (Allocation Failure) > > > 349772K->26455K(1038336K), 0.0081673 secs] > > > 2016-06-10T09:52:59.488+0000: 12.302: [GC (Allocation Failure) > > > 354647K->27086K(1037824K), 0.0046321 secs] > > > 2016-06-10T09:52:59.937+0000: 12.751: [GC (Allocation Failure) > > > 355278K->30694K(1038336K), 0.0032607 secs] > > > 2016-06-10T09:53:00.375+0000: 13.189: [GC (Allocation Failure) > > > 358886K->34214K(1037824K), 0.0053298 secs] > > > 2016-06-10T09:53:00.819+0000: 13.634: [GC (Allocation Failure) > > > 362406K->34860K(1038848K), 0.0064049 secs] > > > 2016-06-10T09:53:01.261+0000: 14.075: [GC (Allocation Failure) > > > 364588K->38492K(1038848K), 0.0053598 secs] > > > 2016-06-10T09:53:01.708+0000: 14.522: [GC (Allocation Failure) > > > 368220K->40955K(1039360K), 0.0054838 secs] > > > 2016-06-10T09:53:02.156+0000: 14.971: [GC (Allocation Failure) > > > 371707K->42472K(1039360K), 0.0097664 secs] > > > 2016-06-10T09:53:02.602+0000: 15.416: [GC (Allocation Failure) > > > 373224K->46177K(1040384K), 0.0057107 secs] > > > 2016-06-10T09:53:03.052+0000: 15.866: [GC (Allocation Failure) > > > 378465K->46599K(1039872K), 0.0063725 secs] > > > 2016-06-10T09:53:03.497+0000: 16.311: [GC (Allocation Failure) > > > 378887K->50183K(1040384K), 0.0095219 secs] > > > 2016-06-10T09:53:03.945+0000: 16.759: [GC (Allocation Failure) > > > 382983K->52824K(1040384K), 0.0044979 secs] > > > 2016-06-10T09:53:04.392+0000: 17.206: [GC (Allocation Failure) > > > 385624K->54387K(1040896K), 0.0053487 secs] > > > 2016-06-10T09:53:04.841+0000: 17.656: [GC (Allocation Failure) > > > 388211K->56947K(1040896K), 0.0025053 secs] > > > 2016-06-10T09:53:05.303+0000: 18.117: [GC (Allocation Failure) > > > 390771K->59701K(1041408K), 0.0054432 secs] > > > 2016-06-10T09:53:05.757+0000: 18.571: [GC (Allocation Failure) > > > 394037K->60303K(1040896K), 0.0053059 secs] > > > 2016-06-10T09:53:06.208+0000: 19.022: [GC (Allocation Failure) > > > 394639K->63759K(1041408K), 0.0024715 secs] > > > 2016-06-10T09:53:06.661+0000: 19.475: [GC (Allocation Failure) > > > 398607K->66585K(1041408K), 0.0039495 secs] > > > 2016-06-10T09:53:07.109+0000: 19.924: [GC (Allocation Failure) > > > 401433K->67278K(1041408K), 0.0056410 secs] > > > 2016-06-10T09:53:07.563+0000: 20.377: [GC (Allocation Failure) > > > 402126K->70798K(1041408K), 0.0031031 secs] > > > 2016-06-10T09:53:08.038+0000: 20.853: [GC (Allocation Failure) > > > 405646K->73652K(1041408K), 0.0057616 secs] > > > > > > > > > > > > Can anyone help? > > > > > > > > > Regards, > > > Jack > > > > > >