[
https://issues.apache.org/jira/browse/GIRAPH-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13435564#comment-13435564
]
Eli Reisman commented on GIRAPH-300:
------------------------------------
Getting errors like this during input superstep on about 20% of my workers,
happens on small and large jobs. This happened before this patch got committed,
but seems to be happening now too. Anyone seeing this on your runs?
Aug 15, 2012 9:55:25 PM org.jboss.netty.channel.DefaultChannelPipeline
WARNING: An exception was thrown by a user handler while handling an exception
event ([id: 0x48433545] EXCEPTION: java.net.ConnectException: Connection timed
out)
java.lang.IllegalStateException: exceptionCaught: Channel failed with remote
address null
at
org.apache.giraph.comm.ResponseClientHandler.exceptionCaught(ResponseClientHandler.java:107)
at
org.jboss.netty.handler.codec.frame.FrameDecoder.exceptionCaught(FrameDecoder.java:244)
at
org.apache.giraph.comm.ByteCounter.handleUpstream(ByteCounter.java:61)
at
org.jboss.netty.channel.Channels.fireExceptionCaught(Channels.java:426)
at
org.jboss.netty.channel.socket.nio.NioClientSocketPipelineSink$Boss.connect(NioClientSocketPipelineSink.java:406)
at
org.jboss.netty.channel.socket.nio.NioClientSocketPipelineSink$Boss.processSelectedKeys(NioClientSocketPipelineSink.java:362)
at
org.jboss.netty.channel.socket.nio.NioClientSocketPipelineSink$Boss.run(NioClientSocketPipelineSink.java:284)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.net.ConnectException: Connection timed out
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:567)
at
org.jboss.netty.channel.socket.nio.NioClientSocketPipelineSink$Boss.connect(NioClientSocketPipelineSink.java:400)
... 5 more
> Improve netty reliability with retrying failed connections, tracking
> requests, thread-safe hash partitioning
> ------------------------------------------------------------------------------------------------------------
>
> Key: GIRAPH-300
> URL: https://issues.apache.org/jira/browse/GIRAPH-300
> Project: Giraph
> Issue Type: Improvement
> Reporter: Avery Ching
> Assignee: Avery Ching
> Attachments: GIRAPH-300.2.patch, GIRAPH-300.patch
>
>
> * Upgrade to the most recent stable version of Netty (3.5.3.Final)
> * Try multiple connection attempts up to n failures
> * Track requests throughout the system by keeping track of the request id and
> then matching the request id to the response (minor refactoring of
> WritableRequest to make requests simpler and support the request id)
> * Improved handling of netty exceptions by dumping the exception stack to
> help debug failures
> * Fixes bug in HashWorkerPartitioner by making partitionList thread-safe
> (this causes divide by zero exceptions in real life)
> Currently, netty connection failures causes issues with more than 75 workers
> in my setup. This allows us to reach over 200+ in a reasonably reliable
> network that doesn't kill connections.
> This code passes the local Hadoop regressions and the single node Hadoop
> instance regressions. It also succeeded on large runs (200+ workers) on a
> real Hadoop cluster.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira