[ 
https://issues.apache.org/jira/browse/GIRAPH-306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13437462#comment-13437462
 ] 

Eli Reisman commented on GIRAPH-306:
------------------------------------

FYI: Trunk + this patch makes it most of the way through restart when a worker 
dies, but tries to reconnect with itself as well as all the other workers,and 
cannot reconnect with itself even when all the other connections seem to 
succeed. Not sure what would happen next in regards to the InputSplit the 
reincarnated worker was reading at death either, but we didn't get that far. 
Seems like a minor detail, otherwise this is doing everything you said it 
would, will keep testing, nice work!

                
> Netty requests should be reliable and implement exactly once semantics
> ----------------------------------------------------------------------
>
>                 Key: GIRAPH-306
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-306
>             Project: Giraph
>          Issue Type: Improvement
>            Reporter: Avery Ching
>            Priority: Critical
>         Attachments: GIRAPH-306.patch
>
>
> One of the biggest scalability challenges is getting Giraph to run reliably 
> on a large number of tasks (i.e. > 200).  Several problems exist:
> 1) If the connection fails after the initial connection was made, the job 
> will die.
> 2) Requests must be completed exactly once.  This is difficult to implement, 
> but required since we cannot have multiple retried requests succeed (i.e. a 
> vertex gets more messages than expected).
> 3) Sometimes there are unresolved addresses, causing failure.
> This patch addresses these issues by re-establishing failed connections and 
> keep tracking of every request sent to every worker.  If the request fails or 
> passes a timeout, it will be resent.  The server will keep track of requests 
> that succeeded to insure that the same request won't be processed more than 
> once.  The structure for keeping track of the succeeded requests on the 
> server is efficient for handling increasing request ids (IncreasingBitSet).  
> For handling unresolved addresses, I added retry logic to keep trying to 
> resolve the problem.
> This patch also adds several unit tests that use fault injection to simulate 
> a lost response or a closed channel exception on the server.  It also has 
> unittests for IncreasingBitSet to insure it is working correctly and 
> efficiently.
> This passes all unittests (including the new ones).  Additionally, I have 
> some experience results as well.
> Previously, I was unable to run reliably with more than 200 workers.  With 
> this change I can reliably run 500+ workers.  I also ran with 600 workers 
> successfully.  This is a really big reliability win for us.
> I can see the code working to do reconnections and re-issue requests when 
> necessary.  It's very cool.
> I.e.
> 2012-08-18 00:16:52,109 INFO org.apache.giraph.comm.NettyClient: 
> checkAndFixChannel: Fixing disconnected channel to 
> xxx.xxx.xxx.xxx/xx.xx.xx.xx:30455, open = false, bound = false
> 2012-08-18 00:16:52,111 INFO org.apache.giraph.comm.NettyClient: 
> checkAndFixChannel: Connected to xxx.xxx.xxx.xxx/xxx.xxx.xxx.xxx:30455!
> 2012-08-18 00:16:52,123 INFO org.apache.giraph.comm.NettyClient: 
> checkAndFixChannel: Fixing disconnected channel to 
> xxx.xxx.xxx.xxx/xxx.xxx.xxx.xxx, open = false, bound = false
> 2012-08-18 00:16:52,124 INFO org.apache.giraph.comm.NettyClient: 
> checkAndFixChannel: Connected to xxx.xxx.xxx.xxx/xxx.xxx.xxx.xxx:30117!

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to