> On Aug. 21, 2012, 4:53 p.m., Alessandro Presta wrote:
> > Really cool! It looks like some network failures are unavoidable scaling 
> > up, so resending requests on one end and ensuring unicity on the other 
> > seems a solid approach.
> > From your numbers, it's a pretty dramatic step up and I can't wait to try 
> > it.
> > 
> > Is there any significant impact on jobs that wouldn't require the 
> > additional precautions? In other words, if you take a job that succeeds 
> > with current trunk, do the communication-heavy phases take considerably 
> > longer?
> > 
> > Other than that, only some comments about maintainability.
> > 
> > Thanks!
> 
> Avery Ching wrote:
>     The additional overhead is an additional integer that is send/received 
> with every request (small price to pay for reliability and also small when we 
> are bulking requests).  Running tests now, but don't expect any noticeable 
> differences.

I was referring more to run time, wondering if the additional logic (e.g. 
iterating over all requests to check the connection in waitSomeRequests) 
penalizes us significantly. The one integer per request seems unimportant to me 
(as you say, we batch up messages/vertices in requests).
The benchmarks you posted on the JIRA have indeed high variance, so we can't 
say much except that the difference is not dramatic. I also think reliability 
matters more at this stage, especially since we have consistent failures over a 
certain number of workers.
Ship it!


> On Aug. 21, 2012, 4:53 p.m., Alessandro Presta wrote:
> > http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/WorkerIdRequestId.java,
> >  line 24
> > <https://reviews.apache.org/r/6687/diff/3/?file=142980#file142980line24>
> >
> >     Just a matter of taste, but I think we could use a more meaningful name 
> > here.
> >     Some ideas: GlobalRequestId (as this is a global, unique identifier), 
> > WorkerRequestIdPair (super-explicit).
> >     Also I would specify in the comments that we're referring to the 
> > destination worker.
> 
> Avery Ching wrote:
>     I agree.  I changed the same to ClientRequestId since it is not globally 
> unique, but rather unique to a client.

Oh right. Sounds good.


> On Aug. 21, 2012, 4:53 p.m., Alessandro Presta wrote:
> > http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/NettyClient.java,
> >  line 211
> > <https://reviews.apache.org/r/6687/diff/3/?file=142973#file142973line211>
> >
> >     I would turn this 13 into a named constant and briefly explain how you 
> > get to this number, otherwise this is difficult to maintain.
> 
> Avery Ching wrote:
>     Fixed.  This is a good idea.

Actually my bad here, I didn't realize it's simply 4 (int) + 8 (long) + 1 
(byte). So I guess it wasn't a big deal, sorry.


> On Aug. 21, 2012, 4:53 p.m., Alessandro Presta wrote:
> > http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/RequestServerHandler.java,
> >  line 91
> > <https://reviews.apache.org/r/6687/diff/3/?file=142978#file142978line91>
> >
> >     I normally don't like having test code paths in non-test classes. Is 
> > there absolutely no way around this?
> 
> Avery Ching wrote:
>     I can understand the apprehension, but testing failure scenarios is a bit 
> tricky without in-code instrumentation.  The alternative would be to refactor 
> the logic and try to mock it in, but this is a bit yucky as well, since we 
> would only be factoring for testing the failure, not for any other purpose.

Yeah, I see what you're saying. It's alright then.


- Alessandro


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/6687/#review10573
-----------------------------------------------------------


On Aug. 21, 2012, 7:47 p.m., Avery Ching wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/6687/
> -----------------------------------------------------------
> 
> (Updated Aug. 21, 2012, 7:47 p.m.)
> 
> 
> Review request for giraph.
> 
> 
> Description
> -------
> 
> One of the biggest scalability challenges is getting Giraph to run reliably 
> on a large number of tasks (i.e. > 200). Several problems exist:
> 
> 1) If the connection fails after the initial connection was made, the job 
> will die.
> 2) Requests must be completed exactly once. This is difficult to implement, 
> but required since we cannot have multiple retried requests succeed (i.e. a 
> vertex gets more messages than expected).
> 3) Sometimes there are unresolved addresses, causing failure.
> 
> This patch addresses these issues by re-establishing failed connections and 
> keep tracking of every request sent to every worker. If the request fails or 
> passes a timeout, it will be resent. The server will keep track of requests 
> that succeeded to insure that the same request won't be processed more than 
> once. The structure for keeping track of the succeeded requests on the server 
> is efficient for handling increasing request ids (IncreasingBitSet). For 
> handling unresolved addresses, I added retry logic to keep trying to resolve 
> the problem.
> 
> This patch also adds several unit tests that use fault injection to simulate 
> a lost response or a closed channel exception on the server. It also has 
> unittests for IncreasingBitSet to insure it is working correctly and 
> efficiently.
> 
> This passes all unittests (including the new ones). Additionally, I have some 
> experience results as well.
> 
> Previously, I was unable to run reliably with more than 200 workers. With 
> this change I can reliably run 500+ workers. I also ran with 600 workers 
> successfully. This is a really big reliability win for us.
> 
> I can see the code working to do reconnections and re-issue requests when 
> necessary. It's very cool.
> 
> I.e.
> 
> 2012-08-18 00:16:52,109 INFO org.apache.giraph.comm.NettyClient: 
> checkAndFixChannel: Fixing disconnected channel to 
> xxx.xxx.xxx.xxx/xx.xx.xx.xx:30455, open = false, bound = false
> 2012-08-18 00:16:52,111 INFO org.apache.giraph.comm.NettyClient: 
> checkAndFixChannel: Connected to xxx.xxx.xxx.xxx/xxx.xxx.xxx.xxx:30455!
> 2012-08-18 00:16:52,123 INFO org.apache.giraph.comm.NettyClient: 
> checkAndFixChannel: Fixing disconnected channel to 
> xxx.xxx.xxx.xxx/xxx.xxx.xxx.xxx, open = false, bound = false
> 2012-08-18 00:16:52,124 INFO org.apache.giraph.comm.NettyClient: 
> checkAndFixChannel: Connected to xxx.xxx.xxx.xxx/xxx.xxx.xxx.xxx:30117!
> 
> 
> This addresses bug GIRAPH-306.
>     https://issues.apache.org/jira/browse/GIRAPH-306
> 
> 
> Diffs
> -----
> 
>   
> http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/AddressRequestIdGenerator.java
>  PRE-CREATION 
>   
> http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/ChannelRotater.java
>  1375393 
>   
> http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/ClientRequestId.java
>  PRE-CREATION 
>   
> http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/IncreasingBitSet.java
>  PRE-CREATION 
>   
> http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/NettyClient.java
>  1375393 
>   
> http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/NettyServer.java
>  1375393 
>   
> http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/NettyWorkerClient.java
>  1375393 
>   
> http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/RequestDecoder.java
>  1375393 
>   
> http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/RequestInfo.java
>  1375393 
>   
> http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/RequestServerHandler.java
>  1375393 
>   
> http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/ResponseClientHandler.java
>  1375393 
>   
> http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/WorkerRequestReservedMap.java
>  PRE-CREATION 
>   
> http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/WritableRequest.java
>  1375393 
>   
> http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/graph/BspServiceMaster.java
>  1375393 
>   
> http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/graph/BspServiceWorker.java
>  1375393 
>   
> http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/graph/GiraphJob.java
>  1375393 
>   
> http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/graph/WorkerInfo.java
>  1375393 
>   
> http://svn.apache.org/repos/asf/giraph/trunk/src/test/java/org/apache/giraph/comm/IncreasingBitSetTest.java
>  PRE-CREATION 
>   
> http://svn.apache.org/repos/asf/giraph/trunk/src/test/java/org/apache/giraph/comm/RequestFailureTest.java
>  PRE-CREATION 
>   
> http://svn.apache.org/repos/asf/giraph/trunk/src/test/java/org/apache/giraph/comm/RequestTest.java
>  1375393 
> 
> Diff: https://reviews.apache.org/r/6687/diff/
> 
> 
> Testing
> -------
> 
> mvn clean verify
> Lots of large test 500-600 workers with PageRankBenchmark
> 
> 
> Thanks,
> 
> Avery Ching
> 
>

Reply via email to