Re: [jira] [Commented] (GIRAPH-300) Improve netty reliability with retrying failed connections, tracking requests, thread-safe hash partitioning

Avery Ching Wed, 15 Aug 2012 15:54:44 -0700

Yes, this will happen, but should be okay, since the connect retrieswill take care of it (hopefully). This already happened with the oldcode (as you mentioned).

I'm also working on a more robust implementation that will retry failedrequests going forward (and establish broken connections).


Avery

On 8/15/12 3:04 PM, Eli Reisman (JIRA) wrote:

     [ 
https://issues.apache.org/jira/browse/GIRAPH-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13435564#comment-13435564
 ]

Eli Reisman commented on GIRAPH-300:
------------------------------------

Getting errors like this during input superstep on about 20% of my workers, 
happens on small and large jobs. This happened before this patch got committed, 
but seems to be happening now too. Anyone seeing this on your runs?


Aug 15, 2012 9:55:25 PM org.jboss.netty.channel.DefaultChannelPipeline
WARNING: An exception was thrown by a user handler while handling an exception 
event ([id: 0x48433545] EXCEPTION: java.net.ConnectException: Connection timed 
out)
java.lang.IllegalStateException: exceptionCaught: Channel failed with remote 
address null
        at 
org.apache.giraph.comm.ResponseClientHandler.exceptionCaught(ResponseClientHandler.java:107)
        at 
org.jboss.netty.handler.codec.frame.FrameDecoder.exceptionCaught(FrameDecoder.java:244)
        at 
org.apache.giraph.comm.ByteCounter.handleUpstream(ByteCounter.java:61)
        at 
org.jboss.netty.channel.Channels.fireExceptionCaught(Channels.java:426)
        at 
org.jboss.netty.channel.socket.nio.NioClientSocketPipelineSink$Boss.connect(NioClientSocketPipelineSink.java:406)
        at 
org.jboss.netty.channel.socket.nio.NioClientSocketPipelineSink$Boss.processSelectedKeys(NioClientSocketPipelineSink.java:362)
        at 
org.jboss.netty.channel.socket.nio.NioClientSocketPipelineSink$Boss.run(NioClientSocketPipelineSink.java:284)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:662)
Caused by: java.net.ConnectException: Connection timed out
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at 
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:567)
        at 
org.jboss.netty.channel.socket.nio.NioClientSocketPipelineSink$Boss.connect(NioClientSocketPipelineSink.java:400)
        ... 5 more

Improve netty reliability with retrying failed connections, tracking requests, 
thread-safe hash partitioning
------------------------------------------------------------------------------------------------------------

                 Key: GIRAPH-300
                 URL: https://issues.apache.org/jira/browse/GIRAPH-300
             Project: Giraph
          Issue Type: Improvement
            Reporter: Avery Ching
            Assignee: Avery Ching
         Attachments: GIRAPH-300.2.patch, GIRAPH-300.patch


* Upgrade to the most recent stable version of Netty (3.5.3.Final)
* Try multiple connection attempts up to n failures
* Track requests throughout the system by keeping track of the request id and 
then matching the request id to the response (minor refactoring of 
WritableRequest to make requests simpler and support the request id)
* Improved handling of netty exceptions by dumping the exception stack to help 
debug failures
* Fixes bug in HashWorkerPartitioner by making partitionList thread-safe (this 
causes divide by zero exceptions in real life)
Currently, netty connection failures causes issues with more than 75 workers in 
my setup.  This allows us to reach over 200+ in a reasonably reliable network 
that doesn't kill connections.
This code passes the local Hadoop regressions and the single node Hadoop 
instance regressions.  It also succeeded on large runs (200+ workers) on a real 
Hadoop cluster.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

Re: [jira] [Commented] (GIRAPH-300) Improve netty reliability with retrying failed connections, tracking requests, thread-safe hash partitioning

Reply via email to