[
https://issues.apache.org/jira/browse/GIRAPH-306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13437451#comment-13437451
]
Eli Reisman commented on GIRAPH-306:
------------------------------------
Nice I will try this out right away, after friday I have been able to get in
some better instrumented runs to test a bunch of patches, and even on trunk I
am running into these errors all the time. I was not able to see logs for a
while on friday and could not determine what was happening, but its always
either memory issues or Netty connection errors. If this solves it I will be a
very happy guy, Giraph is performing very well up to a scale limit now and then
hitting this wall. Will report back the results...
Thanks again!
> Netty requests should be reliable and implement exactly once semantics
> ----------------------------------------------------------------------
>
> Key: GIRAPH-306
> URL: https://issues.apache.org/jira/browse/GIRAPH-306
> Project: Giraph
> Issue Type: Improvement
> Reporter: Avery Ching
> Priority: Critical
> Attachments: GIRAPH-306.patch
>
>
> One of the biggest scalability challenges is getting Giraph to run reliably
> on a large number of tasks (i.e. > 200). Several problems exist:
> 1) If the connection fails after the initial connection was made, the job
> will die.
> 2) Requests must be completed exactly once. This is difficult to implement,
> but required since we cannot have multiple retried requests succeed (i.e. a
> vertex gets more messages than expected).
> 3) Sometimes there are unresolved addresses, causing failure.
> This patch addresses these issues by re-establishing failed connections and
> keep tracking of every request sent to every worker. If the request fails or
> passes a timeout, it will be resent. The server will keep track of requests
> that succeeded to insure that the same request won't be processed more than
> once. The structure for keeping track of the succeeded requests on the
> server is efficient for handling increasing request ids (IncreasingBitSet).
> For handling unresolved addresses, I added retry logic to keep trying to
> resolve the problem.
> This patch also adds several unit tests that use fault injection to simulate
> a lost response or a closed channel exception on the server. It also has
> unittests for IncreasingBitSet to insure it is working correctly and
> efficiently.
> This passes all unittests (including the new ones). Additionally, I have
> some experience results as well.
> Previously, I was unable to run reliably with more than 200 workers. With
> this change I can reliably run 500+ workers. I also ran with 600 workers
> successfully. This is a really big reliability win for us.
> I can see the code working to do reconnections and re-issue requests when
> necessary. It's very cool.
> I.e.
> 2012-08-18 00:16:52,109 INFO org.apache.giraph.comm.NettyClient:
> checkAndFixChannel: Fixing disconnected channel to
> xxx.xxx.xxx.xxx/xx.xx.xx.xx:30455, open = false, bound = false
> 2012-08-18 00:16:52,111 INFO org.apache.giraph.comm.NettyClient:
> checkAndFixChannel: Connected to xxx.xxx.xxx.xxx/xxx.xxx.xxx.xxx:30455!
> 2012-08-18 00:16:52,123 INFO org.apache.giraph.comm.NettyClient:
> checkAndFixChannel: Fixing disconnected channel to
> xxx.xxx.xxx.xxx/xxx.xxx.xxx.xxx, open = false, bound = false
> 2012-08-18 00:16:52,124 INFO org.apache.giraph.comm.NettyClient:
> checkAndFixChannel: Connected to xxx.xxx.xxx.xxx/xxx.xxx.xxx.xxx:30117!
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira