Hey Jack,

The port collisions are fixed in
https://issues.apache.org/jira/browse/SAMZA-932. However, they usually
don't occur frequently enough to cause the job to fail. Usually only a
couple containers fail with this exception but the job restarts them and
eventually runs smoothly.

Your GC log looks normal. Don't be discouraged by the term "Allocation
Failure". It just means that Garbage Collection was invoked because a
memory allocation failed. That's the most common cause of GC in any Java
application.

-Jake

On Fri, Jun 10, 2016 at 10:55 AM, Jack Huang <jackhu...@mz.com> wrote:

> Hi all,
>
> I am having difficulty determining the reason my Samza task is failing. It
> generally failed within 10 minutes of start. When I examine the YARN log I
> see the following exception on some but not all containers:
>
> java.rmi.server.ExportException: Port already in use: 40029; nested
> exception is:
>         java.net.BindException: Address already in use
>         at sun.rmi.transport.tcp.TCPTransport.listen(TCPTransport.java:341)
>         at
> sun.rmi.transport.tcp.TCPTransport.exportObject(TCPTransport.java:249)
>         at
> sun.rmi.transport.tcp.TCPEndpoint.exportObject(TCPEndpoint.java:411)
>         at sun.rmi.transport.LiveRef.exportObject(LiveRef.java:147)
>         at
> sun.rmi.server.UnicastServerRef.exportObject(UnicastServerRef.java:208)
>         at
> java.rmi.server.UnicastRemoteObject.exportObject(UnicastRemoteObject.java:383)
>         at
> java.rmi.server.UnicastRemoteObject.exportObject(UnicastRemoteObject.java:346)
>         at
> javax.management.remote.rmi.RMIJRMPServerImpl.export(RMIJRMPServerImpl.java:118)
>         at
> javax.management.remote.rmi.RMIJRMPServerImpl.export(RMIJRMPServerImpl.java:95)
>         at
> javax.management.remote.rmi.RMIConnectorServer.start(RMIConnectorServer.java:404)
>         at org.apache.samza.metrics.JmxServer.<init>(JmxServer.scala:89)
>         at org.apache.samza.metrics.JmxServer.<init>(JmxServer.scala:43)
>         at
> org.apache.samza.container.SamzaContainer$$anonfun$main$2.apply(SamzaContainer.scala:66)
>         at
> org.apache.samza.container.SamzaContainer$$anonfun$main$2.apply(SamzaContainer.scala:66)
>         at
> org.apache.samza.container.SamzaContainer$.safeMain(SamzaContainer.scala:91)
>         at
> org.apache.samza.container.SamzaContainer$.main(SamzaContainer.scala:66)
>         at
> org.apache.samza.container.SamzaContainer.main(SamzaContainer.scala)
> Caused by: java.net.BindException: Address already in use
>         at java.net.PlainSocketImpl.socketBind(Native Method)
>         at
> java.net.AbstractPlainSocketImpl.bind(AbstractPlainSocketImpl.java:387)
>         at java.net.ServerSocket.bind(ServerSocket.java:375)
>         at java.net.ServerSocket.<init>(ServerSocket.java:237)
>         at java.net.ServerSocket.<init>(ServerSocket.java:128)
>         at
> sun.rmi.transport.proxy.RMIDirectSocketFactory.createServerSocket(RMIDirectSocketFactory.java:45)
>         at
> sun.rmi.transport.proxy.RMIMasterSocketFactory.createServerSocket(RMIMasterSocketFactory.java:345)
>         at
> sun.rmi.transport.tcp.TCPEndpoint.newServerSocket(TCPEndpoint.java:666)
>         at sun.rmi.transport.tcp.TCPTransport.listen(TCPTransport.java:330)
>         ... 16 more
>
> ​
>
>
> Aside from this, I also seeing a lot of garbage collection messages, but no
> OutOfMemoryError:
>
> 2016-06-10T09:52:48.335+0000: 1.150: [GC (System.gc())
> 115345K->9996K(1005056K), 0.0097432 secs]
> 2016-06-10T09:52:48.345+0000: 1.160: [Full GC (System.gc())
> 9996K->9414K(1005056K), 0.0291634 secs]
> 2016-06-10T09:52:50.032+0000: 2.846: [GC (Allocation Failure)
> 271558K->18186K(1005056K), 0.0094124 secs]
> 2016-06-10T09:52:50.592+0000: 3.406: [GC (Allocation Failure)
> 280330K->13998K(1005056K), 0.0029778 secs]
> 2016-06-10T09:52:51.036+0000: 3.850: [GC (Allocation Failure)
> 276142K->11966K(1005056K), 0.0029768 secs]
> 2016-06-10T09:52:51.437+0000: 4.252: [GC (Allocation Failure)
> 274110K->11942K(1005056K), 0.0033398 secs]
> 2016-06-10T09:52:53.367+0000: 6.182: [GC (Metadata GC Threshold)
> 114006K->14958K(1037312K), 0.0050709 secs]
> 2016-06-10T09:52:53.372+0000: 6.187: [Full GC (Metadata GC Threshold)
> 14958K->8364K(1037312K), 0.0251294 secs]
> 2016-06-10T09:52:54.438+0000: 7.252: [GC (Allocation Failure)
> 302764K->14676K(1005056K), 0.0073039 secs]
> 2016-06-10T09:52:54.952+0000: 7.766: [GC (Allocation Failure)
> 309076K->18076K(1034752K), 0.0105222 secs]
> 2016-06-10T09:52:55.471+0000: 8.285: [GC (Allocation Failure)
> 342684K->14556K(1036288K), 0.0067447 secs]
> 2016-06-10T09:52:55.970+0000: 8.784: [GC (Allocation Failure)
> 339164K->15956K(1037312K), 0.0038883 secs]
> 2016-06-10T09:52:56.473+0000: 9.287: [GC (Allocation Failure)
> 342612K->13940K(1037312K), 0.0034693 secs]
> 2016-06-10T09:52:56.958+0000: 9.773: [GC (Allocation Failure)
> 340596K->15608K(1037824K), 0.0049325 secs]
> 2016-06-10T09:52:57.452+0000: 10.266: [GC (Allocation Failure)
> 343288K->18659K(1037824K), 0.0155791 secs]
> 2016-06-10T09:52:58.000+0000: 10.814: [GC (Allocation Failure)
> 346339K->19508K(1036800K), 0.0154724 secs]
> 2016-06-10T09:52:58.528+0000: 11.342: [GC (Allocation Failure)
> 346164K->23116K(1037312K), 0.0033848 secs]
> 2016-06-10T09:52:58.999+0000: 11.813: [GC (Allocation Failure)
> 349772K->26455K(1038336K), 0.0081673 secs]
> 2016-06-10T09:52:59.488+0000: 12.302: [GC (Allocation Failure)
> 354647K->27086K(1037824K), 0.0046321 secs]
> 2016-06-10T09:52:59.937+0000: 12.751: [GC (Allocation Failure)
> 355278K->30694K(1038336K), 0.0032607 secs]
> 2016-06-10T09:53:00.375+0000: 13.189: [GC (Allocation Failure)
> 358886K->34214K(1037824K), 0.0053298 secs]
> 2016-06-10T09:53:00.819+0000: 13.634: [GC (Allocation Failure)
> 362406K->34860K(1038848K), 0.0064049 secs]
> 2016-06-10T09:53:01.261+0000: 14.075: [GC (Allocation Failure)
> 364588K->38492K(1038848K), 0.0053598 secs]
> 2016-06-10T09:53:01.708+0000: 14.522: [GC (Allocation Failure)
> 368220K->40955K(1039360K), 0.0054838 secs]
> 2016-06-10T09:53:02.156+0000: 14.971: [GC (Allocation Failure)
> 371707K->42472K(1039360K), 0.0097664 secs]
> 2016-06-10T09:53:02.602+0000: 15.416: [GC (Allocation Failure)
> 373224K->46177K(1040384K), 0.0057107 secs]
> 2016-06-10T09:53:03.052+0000: 15.866: [GC (Allocation Failure)
> 378465K->46599K(1039872K), 0.0063725 secs]
> 2016-06-10T09:53:03.497+0000: 16.311: [GC (Allocation Failure)
> 378887K->50183K(1040384K), 0.0095219 secs]
> 2016-06-10T09:53:03.945+0000: 16.759: [GC (Allocation Failure)
> 382983K->52824K(1040384K), 0.0044979 secs]
> 2016-06-10T09:53:04.392+0000: 17.206: [GC (Allocation Failure)
> 385624K->54387K(1040896K), 0.0053487 secs]
> 2016-06-10T09:53:04.841+0000: 17.656: [GC (Allocation Failure)
> 388211K->56947K(1040896K), 0.0025053 secs]
> 2016-06-10T09:53:05.303+0000: 18.117: [GC (Allocation Failure)
> 390771K->59701K(1041408K), 0.0054432 secs]
> 2016-06-10T09:53:05.757+0000: 18.571: [GC (Allocation Failure)
> 394037K->60303K(1040896K), 0.0053059 secs]
> 2016-06-10T09:53:06.208+0000: 19.022: [GC (Allocation Failure)
> 394639K->63759K(1041408K), 0.0024715 secs]
> 2016-06-10T09:53:06.661+0000: 19.475: [GC (Allocation Failure)
> 398607K->66585K(1041408K), 0.0039495 secs]
> 2016-06-10T09:53:07.109+0000: 19.924: [GC (Allocation Failure)
> 401433K->67278K(1041408K), 0.0056410 secs]
> 2016-06-10T09:53:07.563+0000: 20.377: [GC (Allocation Failure)
> 402126K->70798K(1041408K), 0.0031031 secs]
> 2016-06-10T09:53:08.038+0000: 20.853: [GC (Allocation Failure)
> 405646K->73652K(1041408K), 0.0057616 secs]
>
> ​
>
> Can anyone help?
>
>
> Regards,
> Jack
>

Reply via email to