[ https://issues.apache.org/jira/browse/HAMA-890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
lujing.zui updated HAMA-890: ---------------------------- Description: I build a cluster, which contain 4 groomservers. I run a pipesApplication, matrixmultiplication, and in one groomserver, it occurs a problems to connect to ZooKeeperSyncClient. so entire job failed. but in other groomservers, everything is fine. I reboot the problematic node, still not solve this problem. As my understanding, both sides of this connect are in one node, connection accept timeout seems impossible. iptables is off, and network is normal, ping every node is ok. I am so confused, any one can help me or give me some hint or suggestion? Thanks so much! the log list below: 14/03/15 16:21:05 INFO ipc.Server: Starting Socket Reader #1 for port 61002 14/03/15 16:21:05 INFO ipc.Server: IPC Server Responder: starting 14/03/15 16:21:05 INFO ipc.Server: IPC Server listener on 61002: starting 14/03/15 16:21:05 INFO ipc.Server: IPC Server handler 0 on 61002: starting 14/03/15 16:21:05 INFO ipc.Server: IPC Server handler 2 on 61002: starting 14/03/15 16:21:05 INFO ipc.Server: IPC Server handler 1 on 61002: starting 14/03/15 16:21:05 INFO ipc.Server: IPC Server handler 3 on 61002: starting 14/03/15 16:21:05 INFO message.HamaMessageManagerImpl: BSPPeer address:hd1.hadoop.lab port:61002 14/03/15 16:21:05 INFO ipc.Server: IPC Server handler 4 on 61002: starting 14/03/15 16:21:05 INFO Configuration.deprecation: mapred.cache.localFiles is deprecated. Instead, use mapreduce.job.cache.local.files 14/03/15 16:21:05 INFO sync.ZKSyncClient: Initializing ZK Sync Client 14/03/15 16:21:05 INFO sync.ZooKeeperSyncClientImpl: Start connecting to Zookeeper! At hd1.hadoop.lab/222.195.92.69:61002 14/03/15 16:21:08 ERROR bsp.BSPTask: Error running bsp setup and bsp function. java.net.SocketTimeoutException: Accept timed out at java.net.PlainSocketImpl.socketAccept(Native Method) at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:375) at java.net.ServerSocket.implAccept(ServerSocket.java:478) at java.net.ServerSocket.accept(ServerSocket.java:446) at org.apache.hama.pipes.PipesApplication.start(PipesApplication.java:286) at org.apache.hama.pipes.PipesBSP.setup(PipesBSP.java:43) at org.apache.hama.bsp.BSPTask.runBSP(BSPTask.java:170) at org.apache.hama.bsp.BSPTask.run(BSPTask.java:144) at org.apache.hama.bsp.GroomServer$BSPPeerChild.main(GroomServer.java:1243) 14/03/15 16:21:08 ERROR bsp.BSPTask: Error cleaning up after bsp executed. java.lang.NullPointerException at org.apache.hama.pipes.PipesBSP.cleanup(PipesBSP.java:95) at org.apache.hama.bsp.BSPTask.runBSP(BSPTask.java:177) at org.apache.hama.bsp.BSPTask.run(BSPTask.java:144) at org.apache.hama.bsp.GroomServer$BSPPeerChild.main(GroomServer.java:1243) 14/03/15 16:21:08 INFO ipc.Server: Stopping server on 61002 14/03/15 16:21:08 INFO ipc.Server: IPC Server handler 0 on 61002: exiting 14/03/15 16:21:08 INFO ipc.Server: IPC Server handler 2 on 61002: exiting 14/03/15 16:21:08 INFO ipc.Server: Stopping IPC Server listener on 61002 14/03/15 16:21:08 INFO ipc.Server: IPC Server handler 3 on 61002: exiting 14/03/15 16:21:08 INFO ipc.Server: IPC Server handler 4 on 61002: exiting 14/03/15 16:21:08 INFO ipc.Server: IPC Server handler 1 on 61002: exiting 14/03/15 16:21:08 INFO ipc.Server: Stopping IPC Server Responder 14/03/15 16:21:08 ERROR bsp.BSPTask: Shutting down ping service. 14/03/15 16:21:08 FATAL bsp.GroomServer: Error running child java.net.SocketTimeoutException: Accept timed out at java.net.PlainSocketImpl.socketAccept(Native Method) at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:375) at java.net.ServerSocket.implAccept(ServerSocket.java:478) at java.net.ServerSocket.accept(ServerSocket.java:446) at org.apache.hama.pipes.PipesApplication.start(PipesApplication.java:286) at org.apache.hama.pipes.PipesBSP.setup(PipesBSP.java:43) at org.apache.hama.bsp.BSPTask.runBSP(BSPTask.java:170) at org.apache.hama.bsp.BSPTask.run(BSPTask.java:144) at org.apache.hama.bsp.GroomServer$BSPPeerChild.main(GroomServer.java:1243) java.net.SocketTimeoutException: Accept timed out at java.net.PlainSocketImpl.socketAccept(Native Method) at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:375) at java.net.ServerSocket.implAccept(ServerSocket.java:478) at java.net.ServerSocket.accept(ServerSocket.java:446) at org.apache.hama.pipes.PipesApplication.start(PipesApplication.java:286) at org.apache.hama.pipes.PipesBSP.setup(PipesBSP.java:43) at org.apache.hama.bsp.BSPTask.runBSP(BSPTask.java:170) at org.apache.hama.bsp.BSPTask.run(BSPTask.java:144) at org.apache.hama.bsp.GroomServer$BSPPeerChild.main(GroomServer.java:1243) was: I build a cluster, which contain 4 groomserver. I run a pipesApplication, matrixmultiplication, and in one groomserver, it occurs a problems to connect to ZooKeeperSyncClient. so entire job failed. but other groomserver, everything is fine. I reboot the problematic node, cannot solve this problem. As I understand, both sides of this connect are in one node, accept timeout seems impossible. iptables is off, and network is normal, ping every node is ok. I am so confused, any one can help me or give me some hint or suggestion? Thanks so much! the log list below: 14/03/15 16:21:05 INFO ipc.Server: Starting Socket Reader #1 for port 61002 14/03/15 16:21:05 INFO ipc.Server: IPC Server Responder: starting 14/03/15 16:21:05 INFO ipc.Server: IPC Server listener on 61002: starting 14/03/15 16:21:05 INFO ipc.Server: IPC Server handler 0 on 61002: starting 14/03/15 16:21:05 INFO ipc.Server: IPC Server handler 2 on 61002: starting 14/03/15 16:21:05 INFO ipc.Server: IPC Server handler 1 on 61002: starting 14/03/15 16:21:05 INFO ipc.Server: IPC Server handler 3 on 61002: starting 14/03/15 16:21:05 INFO message.HamaMessageManagerImpl: BSPPeer address:hd1.hadoop.lab port:61002 14/03/15 16:21:05 INFO ipc.Server: IPC Server handler 4 on 61002: starting 14/03/15 16:21:05 INFO Configuration.deprecation: mapred.cache.localFiles is deprecated. Instead, use mapreduce.job.cache.local.files 14/03/15 16:21:05 INFO sync.ZKSyncClient: Initializing ZK Sync Client 14/03/15 16:21:05 INFO sync.ZooKeeperSyncClientImpl: Start connecting to Zookeeper! At hd1.hadoop.lab/222.195.92.69:61002 14/03/15 16:21:08 ERROR bsp.BSPTask: Error running bsp setup and bsp function. java.net.SocketTimeoutException: Accept timed out at java.net.PlainSocketImpl.socketAccept(Native Method) at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:375) at java.net.ServerSocket.implAccept(ServerSocket.java:478) at java.net.ServerSocket.accept(ServerSocket.java:446) at org.apache.hama.pipes.PipesApplication.start(PipesApplication.java:286) at org.apache.hama.pipes.PipesBSP.setup(PipesBSP.java:43) at org.apache.hama.bsp.BSPTask.runBSP(BSPTask.java:170) at org.apache.hama.bsp.BSPTask.run(BSPTask.java:144) at org.apache.hama.bsp.GroomServer$BSPPeerChild.main(GroomServer.java:1243) 14/03/15 16:21:08 ERROR bsp.BSPTask: Error cleaning up after bsp executed. java.lang.NullPointerException at org.apache.hama.pipes.PipesBSP.cleanup(PipesBSP.java:95) at org.apache.hama.bsp.BSPTask.runBSP(BSPTask.java:177) at org.apache.hama.bsp.BSPTask.run(BSPTask.java:144) at org.apache.hama.bsp.GroomServer$BSPPeerChild.main(GroomServer.java:1243) 14/03/15 16:21:08 INFO ipc.Server: Stopping server on 61002 14/03/15 16:21:08 INFO ipc.Server: IPC Server handler 0 on 61002: exiting 14/03/15 16:21:08 INFO ipc.Server: IPC Server handler 2 on 61002: exiting 14/03/15 16:21:08 INFO ipc.Server: Stopping IPC Server listener on 61002 14/03/15 16:21:08 INFO ipc.Server: IPC Server handler 3 on 61002: exiting 14/03/15 16:21:08 INFO ipc.Server: IPC Server handler 4 on 61002: exiting 14/03/15 16:21:08 INFO ipc.Server: IPC Server handler 1 on 61002: exiting 14/03/15 16:21:08 INFO ipc.Server: Stopping IPC Server Responder 14/03/15 16:21:08 ERROR bsp.BSPTask: Shutting down ping service. 14/03/15 16:21:08 FATAL bsp.GroomServer: Error running child java.net.SocketTimeoutException: Accept timed out at java.net.PlainSocketImpl.socketAccept(Native Method) at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:375) at java.net.ServerSocket.implAccept(ServerSocket.java:478) at java.net.ServerSocket.accept(ServerSocket.java:446) at org.apache.hama.pipes.PipesApplication.start(PipesApplication.java:286) at org.apache.hama.pipes.PipesBSP.setup(PipesBSP.java:43) at org.apache.hama.bsp.BSPTask.runBSP(BSPTask.java:170) at org.apache.hama.bsp.BSPTask.run(BSPTask.java:144) at org.apache.hama.bsp.GroomServer$BSPPeerChild.main(GroomServer.java:1243) java.net.SocketTimeoutException: Accept timed out at java.net.PlainSocketImpl.socketAccept(Native Method) at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:375) at java.net.ServerSocket.implAccept(ServerSocket.java:478) at java.net.ServerSocket.accept(ServerSocket.java:446) at org.apache.hama.pipes.PipesApplication.start(PipesApplication.java:286) at org.apache.hama.pipes.PipesBSP.setup(PipesBSP.java:43) at org.apache.hama.bsp.BSPTask.runBSP(BSPTask.java:170) at org.apache.hama.bsp.BSPTask.run(BSPTask.java:144) at org.apache.hama.bsp.GroomServer$BSPPeerChild.main(GroomServer.java:1243) > PipesApplication connect to ZooKeeperSyncClinetImpl always timeout > ------------------------------------------------------------------ > > Key: HAMA-890 > URL: https://issues.apache.org/jira/browse/HAMA-890 > Project: Hama > Issue Type: Bug > Affects Versions: 0.7.0 > Environment: Hadoop 2.2.0 distribute mode > Reporter: lujing.zui > > I build a cluster, which contain 4 groomservers. > I run a pipesApplication, matrixmultiplication, and in one groomserver, it > occurs a problems to connect to ZooKeeperSyncClient. so entire job failed. > but in other groomservers, everything is fine. > I reboot the problematic node, still not solve this problem. > As my understanding, both sides of this connect are in one node, connection > accept timeout seems impossible. iptables is off, and network is normal, ping > every node is ok. > I am so confused, any one can help me or give me some hint or suggestion? > Thanks so much! > the log list below: > 14/03/15 16:21:05 INFO ipc.Server: Starting Socket Reader #1 for port 61002 > 14/03/15 16:21:05 INFO ipc.Server: IPC Server Responder: starting > 14/03/15 16:21:05 INFO ipc.Server: IPC Server listener on 61002: starting > 14/03/15 16:21:05 INFO ipc.Server: IPC Server handler 0 on 61002: starting > 14/03/15 16:21:05 INFO ipc.Server: IPC Server handler 2 on 61002: starting > 14/03/15 16:21:05 INFO ipc.Server: IPC Server handler 1 on 61002: starting > 14/03/15 16:21:05 INFO ipc.Server: IPC Server handler 3 on 61002: starting > 14/03/15 16:21:05 INFO message.HamaMessageManagerImpl: BSPPeer > address:hd1.hadoop.lab port:61002 > 14/03/15 16:21:05 INFO ipc.Server: IPC Server handler 4 on 61002: starting > 14/03/15 16:21:05 INFO Configuration.deprecation: mapred.cache.localFiles is > deprecated. Instead, use mapreduce.job.cache.local.files > 14/03/15 16:21:05 INFO sync.ZKSyncClient: Initializing ZK Sync Client > 14/03/15 16:21:05 INFO sync.ZooKeeperSyncClientImpl: Start connecting to > Zookeeper! At hd1.hadoop.lab/222.195.92.69:61002 > 14/03/15 16:21:08 ERROR bsp.BSPTask: Error running bsp setup and bsp function. > java.net.SocketTimeoutException: Accept timed out > at java.net.PlainSocketImpl.socketAccept(Native Method) > at > java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:375) > at java.net.ServerSocket.implAccept(ServerSocket.java:478) > at java.net.ServerSocket.accept(ServerSocket.java:446) > at > org.apache.hama.pipes.PipesApplication.start(PipesApplication.java:286) > at org.apache.hama.pipes.PipesBSP.setup(PipesBSP.java:43) > at org.apache.hama.bsp.BSPTask.runBSP(BSPTask.java:170) > at org.apache.hama.bsp.BSPTask.run(BSPTask.java:144) > at > org.apache.hama.bsp.GroomServer$BSPPeerChild.main(GroomServer.java:1243) > 14/03/15 16:21:08 ERROR bsp.BSPTask: Error cleaning up after bsp executed. > java.lang.NullPointerException > at org.apache.hama.pipes.PipesBSP.cleanup(PipesBSP.java:95) > at org.apache.hama.bsp.BSPTask.runBSP(BSPTask.java:177) > at org.apache.hama.bsp.BSPTask.run(BSPTask.java:144) > at > org.apache.hama.bsp.GroomServer$BSPPeerChild.main(GroomServer.java:1243) > 14/03/15 16:21:08 INFO ipc.Server: Stopping server on 61002 > 14/03/15 16:21:08 INFO ipc.Server: IPC Server handler 0 on 61002: exiting > 14/03/15 16:21:08 INFO ipc.Server: IPC Server handler 2 on 61002: exiting > 14/03/15 16:21:08 INFO ipc.Server: Stopping IPC Server listener on 61002 > 14/03/15 16:21:08 INFO ipc.Server: IPC Server handler 3 on 61002: exiting > 14/03/15 16:21:08 INFO ipc.Server: IPC Server handler 4 on 61002: exiting > 14/03/15 16:21:08 INFO ipc.Server: IPC Server handler 1 on 61002: exiting > 14/03/15 16:21:08 INFO ipc.Server: Stopping IPC Server Responder > 14/03/15 16:21:08 ERROR bsp.BSPTask: Shutting down ping service. > 14/03/15 16:21:08 FATAL bsp.GroomServer: Error running child > java.net.SocketTimeoutException: Accept timed out > at java.net.PlainSocketImpl.socketAccept(Native Method) > at > java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:375) > at java.net.ServerSocket.implAccept(ServerSocket.java:478) > at java.net.ServerSocket.accept(ServerSocket.java:446) > at > org.apache.hama.pipes.PipesApplication.start(PipesApplication.java:286) > at org.apache.hama.pipes.PipesBSP.setup(PipesBSP.java:43) > at org.apache.hama.bsp.BSPTask.runBSP(BSPTask.java:170) > at org.apache.hama.bsp.BSPTask.run(BSPTask.java:144) > at > org.apache.hama.bsp.GroomServer$BSPPeerChild.main(GroomServer.java:1243) > java.net.SocketTimeoutException: Accept timed out > at java.net.PlainSocketImpl.socketAccept(Native Method) > at > java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:375) > at java.net.ServerSocket.implAccept(ServerSocket.java:478) > at java.net.ServerSocket.accept(ServerSocket.java:446) > at > org.apache.hama.pipes.PipesApplication.start(PipesApplication.java:286) > at org.apache.hama.pipes.PipesBSP.setup(PipesBSP.java:43) > at org.apache.hama.bsp.BSPTask.runBSP(BSPTask.java:170) > at org.apache.hama.bsp.BSPTask.run(BSPTask.java:144) > at > org.apache.hama.bsp.GroomServer$BSPPeerChild.main(GroomServer.java:1243) -- This message was sent by Atlassian JIRA (v6.2#6252)