Hi, My Giraph job works fine in smaller number of nodes. But when trying to run it on 128 nodes cluster I am getting the following error. It seems that only one worker failure is causing the entire job failure. I attached the error messages from master and failed worker log. Any help is appreciated
[MASTER LOG] 2014-11-15 23:01:45,305 INFO org.apache.giraph.worker.BspServiceWorker: finishSuperstep: (waiting for rest of workers) ALL_EXCEPT_ZOOKEEPER - Attempt=0, Superstep=59 2014-11-15 23:01:46,169 FATAL org.apache.giraph.graph.GraphMapper: uncaughtException: OverrideExceptionHandler on thread org.apache.giraph.master.MasterThread, msg = unable to create new native thread, exiting... java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:691) at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:943) at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1336) at java.lang.UNIXProcess.initStreams(UNIXProcess.java:172) at java.lang.UNIXProcess$2.run(UNIXProcess.java:145) at java.lang.UNIXProcess$2.run(UNIXProcess.java:143) at java.security.AccessController.doPrivileged(Native Method) at java.lang.UNIXProcess.<init>(UNIXProcess.java:143) at java.lang.ProcessImpl.start(ProcessImpl.java:130) at java.lang.ProcessBuilder.start(ProcessBuilder.java:1021) at java.lang.Runtime.exec(Runtime.java:615) at java.lang.Runtime.exec(Runtime.java:448) at java.lang.Runtime.exec(Runtime.java:345) at pga.MasterVertex.compute(MasterVertex.java:242) at org.apache.giraph.master.BspServiceMaster.doMasterCompute(BspServiceMaster.java:1691) at org.apache.giraph.master.BspServiceMaster.coordinateSuperstep(BspServiceMaster.java:1627) at org.apache.giraph.master.MasterThread.run(MasterThread.java:115) [FAILED WORKER LOG] 2014-11-15 23:11:46,281 WARN org.apache.giraph.comm.netty.NettyServer: start: Likely failed to bind on attempt 0 to port 30007 org.jboss.netty.channel.ChannelException: Failed to bind to: qb114/ 208.100.93.114:30007 at org.jboss.netty.bootstrap.ServerBootstrap.bind(ServerBootstrap.java:298) at org.apache.giraph.comm.netty.NettyServer.start(NettyServer.java:326) at org.apache.giraph.comm.netty.NettyMasterServer.<init>(NettyMasterServer.java:49) at org.apache.giraph.master.BspServiceMaster.becomeMaster(BspServiceMaster.java:877) at org.apache.giraph.master.MasterThread.run(MasterThread.java:98) Caused by: java.net.BindException: Address already in use at sun.nio.ch.Net.bind0(Native Method) at sun.nio.ch.Net.bind(Net.java:344) at sun.nio.ch.Net.bind(Net.java:336) at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:199) at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74) at org.jboss.netty.channel.socket.nio.NioServerSocketPipelineSink.bind(NioServerSocketPipelineSink.java:138) at org.jboss.netty.channel.socket.nio.NioServerSocketPipelineSink.handleServerSocket(NioServerSocketPipelineSink.java:90) at org.jboss.netty.channel.socket.nio.NioServerSocketPipelineSink.eventSunk(NioServerSocketPipelineSink.java:64) at org.jboss.netty.channel.Channels.bind(Channels.java:569) at org.jboss.netty.channel.AbstractChannel.bind(AbstractChannel.java:187) at org.jboss.netty.bootstrap.ServerBootstrap$Binder.channelOpen(ServerBootstrap.java:343) at org.jboss.netty.channel.Channels.fireChannelOpen(Channels.java:170) at org.jboss.netty.channel.socket.nio.NioServerSocketChannel.<init>(NioServerSocketChannel.java:80) at org.jboss.netty.channel.socket.nio.NioServerSocketChannelFactory.newChannel(NioServerSocketChannelFactory.java:158) at org.jboss.netty.channel.socket.nio.NioServerSocketChannelFactory.newChannel(NioServerSocketChannelFactory.java:86) at org.jboss.netty.bootstrap.ServerBootstrap.bind(ServerBootstrap.java:277) ... 4 more 2014-11-15 23:11:46,305 INFO org.apache.giraph.comm.netty.NettyServer: start: Started server communication server: qb114/208.100.93.114:31007 with up to 16 threads on bind attempt 1 with sendBufferSize = 32768 receiveBufferSize = 524288 backlog = 874 2014-11-15 23:11:46,325 INFO org.apache.giraph.comm.netty.NettyClient: NettyClient: Using execution handler with 8 threads after requestEncoder. 2014-11-15 23:11:46,325 INFO org.apache.giraph.master.BspServiceMaster: becomeMaster: I am now the master! 2014-11-15 23:11:46,326 INFO org.apache.giraph.master.BspServiceMaster: /_hadoopBsp/job_201411152123_0003/_vertexInputSplitDir already exists, no need to create 2014-11-15 23:11:46,326 ERROR org.apache.giraph.master.MasterThread: masterThread: Master algorithm failed with NullPointerException java.lang.NullPointerException at java.lang.String.<init>(String.java:505) at org.apache.giraph.master.BspServiceMaster.createInputSplits(BspServiceMaster.java:600) at org.apache.giraph.master.BspServiceMaster.createVertexInputSplits(BspServiceMaster.java:696) at org.apache.giraph.master.MasterThread.run(MasterThread.java:100) 2014-11-15 23:11:46,327 FATAL org.apache.giraph.graph.GraphMapper: uncaughtException: OverrideExceptionHandler on thread org.apache.giraph.master.MasterThread, msg = java.lang.NullPointerException, exiting... java.lang.IllegalStateException: java.lang.NullPointerException at org.apache.giraph.master.MasterThread.run(MasterThread.java:185) Caused by: java.lang.NullPointerException at java.lang.String.<init>(String.java:505) at org.apache.giraph.master.BspServiceMaster.createInputSplits(BspServiceMaster.java:600) at org.apache.giraph.master.BspServiceMaster.createVertexInputSplits(BspServiceMaster.java:696) at org.apache.giraph.master.MasterThread.run(MasterThread.java:100) -- Thanks and regards, Arghya Kusum Das (225-362-4031)
