[ 
https://issues.apache.org/jira/browse/GIRAPH-154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhiwei Gu updated GIRAPH-154:
-----------------------------

    Attachment: GIRAPH-154.patch

passed unit test and grid test.
                
> Worker ports are not synched properly with its peers
> ----------------------------------------------------
>
>                 Key: GIRAPH-154
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-154
>             Project: Giraph
>          Issue Type: Bug
>          Components: bsp
>    Affects Versions: 0.2.0
>            Reporter: Zhiwei Gu
>            Assignee: Zhiwei Gu
>         Attachments: GIRAPH-154.patch
>
>
> When worker trying multiple ports to setup the rpc server, the final port is 
> not synched with it's peer workers properly, and resulted in peer workers 
> send message to the default port.
> Here is some logs:
> ############################################################################
> Base port: 34900
> ############################################################################
> ############################################################################
> log for worker 161:
> ############################################################################
> IPC Server handler 98 on 36061: starting
> BasicRPCCommunications: Started RPC communication server: 
> gsta32085.tan.ygrid.yahoo.com/10.216.148.47:36061 with 100 handlers and 199 
> flush threads on bind attempt 1
> IPC Server handler 99 on 36061: starting
> setup: Registering health of this worker...
> getJobState: Job state already exists 
> (/_hadoopBsp/job_201203130609_14838/_masterJobState)
> getApplicationAttempt: Node 
> /_hadoopBsp/job_201203130609_14838/_applicationAttemptsDir already exists!
> getApplicationAttempt: Node 
> /_hadoopBsp/job_201203130609_14838/_applicationAttemptsDir already exists!
> registerHealth: Created my health node for attempt=0, superstep=-1 with 
> /_hadoopBsp/job_201203130609_14838/_applicationAttemptsDir/0/_superstepDir/-1/_workerHealthyDir/gsta32085.tan.ygrid.yahoo.com_161
>  and workerInfo= Worker(hostname=gsta32085.tan.ygrid.yahoo.com, 
> MRpartition=161, port=35061)
> process: partitionAssignmentsReadyChanged (partitions are assigned)
> startSuperstep: Ready for computation on superstep -1 since worker selection 
> and vertex range assignments are done in 
> /_hadoopBsp/job_201203130609_14838/_applicationAttemptsDir/0/_superstepDir/-1/_partitionAssignments
> Retrying connect to server: 
> gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 0 time(s).
> Retrying connect to server: 
> gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 1 time(s).
> Retrying connect to server: 
> gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 2 time(s).
> Retrying connect to server: 
> gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 3 time(s).
> Retrying connect to server: 
> gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 4 time(s).
> Retrying connect to server: 
> gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 5 time(s).
> Retrying connect to server: 
> gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 6 time(s).
> Retrying connect to server: 
> gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 7 time(s).
> Retrying connect to server: 
> gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 8 time(s).
> Retrying connect to server: 
> gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 9 time(s).
> Retrying connect to server: 
> gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 10 time(s).
> Retrying connect to server: 
> gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 11 time(s).
> Retrying connect to server: 
> gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 12 time(s).
> Retrying connect to server: 
> gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 13 time(s).
> Retrying connect to server: 
> gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 14 time(s).
> Retrying connect to server: 
> gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 15 time(s).
> Retrying connect to server: 
> gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 16 time(s).
> Retrying connect to server: 
> gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 17 time(s).
> Retrying connect to server: 
> gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 18 time(s).
> Retrying connect to server: 
> gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 19 time(s).
> Retrying connect to server: 
> gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 20 time(s).
> Retrying connect to server: 
> gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 21 time(s).
> Retrying connect to server: 
> gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 22 time(s).
> Retrying connect to server: 
> gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 23 time(s).
> Retrying connect to server: 
> gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 24 time(s).
> Retrying connect to server: 
> gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 25 time(s).
> Retrying connect to server: 
> gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 26 time(s).
> Retrying connect to server: 
> gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 27 time(s).
> Retrying connect to server: 
> gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 28 time(s).
> Retrying connect to server: 
> gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 29 time(s).
> Retrying connect to server: 
> gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 30 time(s).
> Retrying connect to server: 
> gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 31 time(s).
> Retrying connect to server: 
> gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 32 time(s).
> Retrying connect to server: 
> gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 33 time(s).
> Retrying connect to server: 
> gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 34 time(s).
> Retrying connect to server: 
> gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 35 time(s).
> Retrying connect to server: 
> gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 36 time(s).
> Retrying connect to server: 
> gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 37 time(s).
> Retrying connect to server: 
> gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 38 time(s).
> Retrying connect to server: 
> gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 39 time(s).
> Retrying connect to server: 
> gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 40 time(s).
> Retrying connect to server: 
> gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 41 time(s).
> Retrying connect to server: 
> gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 42 time(s).
> Retrying connect to server: 
> gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 43 time(s).
> Retrying connect to server: 
> gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 44 time(s).
> Retrying connect to server: 
> gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 45 time(s).
> Retrying connect to server: 
> gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 46 time(s).
> Retrying connect to server: 
> gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 47 time(s).
> Retrying connect to server: 
> gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 48 time(s).
> Retrying connect to server: 
> gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 49 time(s).
> PriviledgedActionException as:job_201203130609_14838 (auth:SIMPLE) 
> cause:java.net.ConnectException: Call to 
> gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061 failed on connection 
> exception: java.net.ConnectException: Connection refused
> connectAllRPCProxys: Failed on attempt 0 of 5 to connect to 
> (id=33,cur=Worker(hostname=gsta32085.tan.ygrid.yahoo.com, MRpartition=161, 
> port=35061),prev=null,ckpt_file=null)
> java.net.ConnectException: Call to 
> gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061 failed on connection 
> exception: java.net.ConnectException: Connection refused
>       at org.apache.hadoop.ipc.Client.wrapException(Client.java:1095)
>       at org.apache.hadoop.ipc.Client.call(Client.java:1071)
>       at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
>       at $Proxy8.getProtocolVersion(Unknown Source)
>       at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:396)
>       at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:370)
>       at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:420)
>       at 
> org.apache.giraph.comm.RPCCommunications$1.run(RPCCommunications.java:159)
>       at 
> org.apache.giraph.comm.RPCCommunications$1.run(RPCCommunications.java:155)
>       at java.security.AccessController.doPrivileged(Native Method)
>       at javax.security.auth.Subject.doAs(Subject.java:396)
>       at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1082)
>       at 
> org.apache.giraph.comm.RPCCommunications.getRPCProxy(RPCCommunications.java:153)
>       at 
> org.apache.giraph.comm.RPCCommunications.getRPCProxy(RPCCommunications.java:51)
>       at 
> org.apache.giraph.comm.BasicRPCCommunications.startPeerConnectionThread(BasicRPCCommunications.java:599)
>       at 
> org.apache.giraph.comm.BasicRPCCommunications.connectAllRPCProxys(BasicRPCCommunications.java:542)
>       at 
> org.apache.giraph.comm.BasicRPCCommunications.setup(BasicRPCCommunications.java:513)
>       at 
> org.apache.giraph.graph.BspServiceWorker.setup(BspServiceWorker.java:550)
>       at org.apache.giraph.graph.GraphMapper.setup(GraphMapper.java:458)
>       at org.apache.giraph.graph.GraphMapper.run(GraphMapper.java:630)
>       at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
>       at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>       at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>       at java.security.AccessController.doPrivileged(Native Method)
>       at javax.security.auth.Subject.doAs(Subject.java:396)
>       at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1082)
>       at org.apache.hadoop.mapred.Child.main(Child.java:249)
> Caused by: java.net.ConnectException: Connection refused
>       at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>       at 
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574)
>       at 
> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
>       at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:656)
>       at 
> org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:434)
>       at 
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:560)
>       at org.apache.hadoop.ipc.Client$Connection.access$2000(Client.java:184)
>       at org.apache.hadoop.ipc.Client.getConnection(Client.java:1202)
>       at org.apache.hadoop.ipc.Client.call(Client.java:1046)
>       ... 25 more
> ############################################################################
> log for worker 154
> ############################################################################
> PriviledgedActionException as:job_201203130609_14838 (auth:SIMPLE) 
> cause:java.net.ConnectException: Call to 
> gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061 failed on connection 
> exception: java.net.ConnectException: Connection refused
> connectAllRPCProxys: Failed on attempt 4 of 5 to connect to 
> (id=33,cur=Worker(hostname=gsta32085.tan.ygrid.yahoo.com, MRpartition=161, 
> port=35061),prev=null,ckpt_file=null)
> java.net.ConnectException: Call to 
> gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061 failed on connection 
> exception: java.net.ConnectException: Connection refused
>       at org.apache.hadoop.ipc.Client.wrapException(Client.java:1095)
>       at org.apache.hadoop.ipc.Client.call(Client.java:1071)
>       at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
>       at $Proxy8.getProtocolVersion(Unknown Source)
>       at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:396)
>       at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:370)
>       at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:420)
>       at 
> org.apache.giraph.comm.RPCCommunications$1.run(RPCCommunications.java:159)
>       at 
> org.apache.giraph.comm.RPCCommunications$1.run(RPCCommunications.java:155)
>       at java.security.AccessController.doPrivileged(Native Method)
>       at javax.security.auth.Subject.doAs(Subject.java:396)
>       at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1082)
>       at 
> org.apache.giraph.comm.RPCCommunications.getRPCProxy(RPCCommunications.java:153)
>       at 
> org.apache.giraph.comm.RPCCommunications.getRPCProxy(RPCCommunications.java:51)
>       at 
> org.apache.giraph.comm.BasicRPCCommunications.startPeerConnectionThread(BasicRPCCommunications.java:599)
>       at 
> org.apache.giraph.comm.BasicRPCCommunications.connectAllRPCProxys(BasicRPCCommunications.java:542)
>       at 
> org.apache.giraph.comm.BasicRPCCommunications.setup(BasicRPCCommunications.java:513)
>       at 
> org.apache.giraph.graph.BspServiceWorker.setup(BspServiceWorker.java:550)
>       at org.apache.giraph.graph.GraphMapper.setup(GraphMapper.java:458)
>       at org.apache.giraph.graph.GraphMapper.run(GraphMapper.java:630)
>       at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
>       at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>       at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>       at java.security.AccessController.doPrivileged(Native Method)
>       at javax.security.auth.Subject.doAs(Subject.java:396)
>       at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1082)
>       at org.apache.hadoop.mapred.Child.main(Child.java:249)
> Caused by: java.net.ConnectException: Connection refused
>       at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>       at 
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574)
>       at 
> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
>       at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:656)
>       at 
> org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:434)
>       at 
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:560)
>       at org.apache.hadoop.ipc.Client$Connection.access$2000(Client.java:184)
>       at org.apache.hadoop.ipc.Client.getConnection(Client.java:1202)
>       at org.apache.hadoop.ipc.Client.call(Client.java:1046)
>       ... 25 more

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to