> On Aug. 21, 2012, 4:53 p.m., Alessandro Presta wrote:
> > Really cool! It looks like some network failures are unavoidable scaling 
> > up, so resending requests on one end and ensuring unicity on the other 
> > seems a solid approach.
> > From your numbers, it's a pretty dramatic step up and I can't wait to try 
> > it.
> > 
> > Is there any significant impact on jobs that wouldn't require the 
> > additional precautions? In other words, if you take a job that succeeds 
> > with current trunk, do the communication-heavy phases take considerably 
> > longer?
> > 
> > Other than that, only some comments about maintainability.
> > 
> > Thanks!

The additional overhead is an additional integer that is send/received with 
every request (small price to pay for reliability and also small when we are 
bulking requests).  Running tests now, but don't expect any noticeable 
differences.


> On Aug. 21, 2012, 4:53 p.m., Alessandro Presta wrote:
> > http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/NettyClient.java,
> >  line 117
> > <https://reviews.apache.org/r/6687/diff/3/?file=142973#file142973line117>
> >
> >     Can you put a more explanatory comment here? You're just making the 
> > previously hardcoded WAITING_REQUEST_MSECS parameter configurable, right?
> >     So maybe something like "Waiting interval for outstanding requests to 
> > finish".

Changed.


> On Aug. 21, 2012, 4:53 p.m., Alessandro Presta wrote:
> > http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/NettyClient.java,
> >  line 211
> > <https://reviews.apache.org/r/6687/diff/3/?file=142973#file142973line211>
> >
> >     I would turn this 13 into a named constant and briefly explain how you 
> > get to this number, otherwise this is difficult to maintain.

Fixed.  This is a good idea.


> On Aug. 21, 2012, 4:53 p.m., Alessandro Presta wrote:
> > http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/ResponseClientHandler.java,
> >  line 44
> > <https://reviews.apache.org/r/6687/diff/3/?file=142979#file142979line44>
> >
> >     Typo (missing blank).

Good eye, fixed.


> On Aug. 21, 2012, 4:53 p.m., Alessandro Presta wrote:
> > http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/RequestServerHandler.java,
> >  line 91
> > <https://reviews.apache.org/r/6687/diff/3/?file=142978#file142978line91>
> >
> >     I normally don't like having test code paths in non-test classes. Is 
> > there absolutely no way around this?

I can understand the apprehension, but testing failure scenarios is a bit 
tricky without in-code instrumentation.  The alternative would be to refactor 
the logic and try to mock it in, but this is a bit yucky as well, since we 
would only be factoring for testing the failure, not for any other purpose.  


> On Aug. 21, 2012, 4:53 p.m., Alessandro Presta wrote:
> > http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/WorkerIdRequestId.java,
> >  line 24
> > <https://reviews.apache.org/r/6687/diff/3/?file=142980#file142980line24>
> >
> >     Just a matter of taste, but I think we could use a more meaningful name 
> > here.
> >     Some ideas: GlobalRequestId (as this is a global, unique identifier), 
> > WorkerRequestIdPair (super-explicit).
> >     Also I would specify in the comments that we're referring to the 
> > destination worker.

I agree.  I changed the same to ClientRequestId since it is not globally 
unique, but rather unique to a client.


> On Aug. 21, 2012, 4:53 p.m., Alessandro Presta wrote:
> > http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/NettyClient.java,
> >  line 66
> > <https://reviews.apache.org/r/6687/diff/3/?file=142973#file142973line66>
> >
> >     Unrelated to your changes, but I think these should go in GiraphJob 
> > like all other config options, even if Netty-specific.

I'd like to address this in a followup JIRA if that is okay.


- Avery


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/6687/#review10573
-----------------------------------------------------------


On Aug. 21, 2012, 7:47 p.m., Avery Ching wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/6687/
> -----------------------------------------------------------
> 
> (Updated Aug. 21, 2012, 7:47 p.m.)
> 
> 
> Review request for giraph.
> 
> 
> Description
> -------
> 
> One of the biggest scalability challenges is getting Giraph to run reliably 
> on a large number of tasks (i.e. > 200). Several problems exist:
> 
> 1) If the connection fails after the initial connection was made, the job 
> will die.
> 2) Requests must be completed exactly once. This is difficult to implement, 
> but required since we cannot have multiple retried requests succeed (i.e. a 
> vertex gets more messages than expected).
> 3) Sometimes there are unresolved addresses, causing failure.
> 
> This patch addresses these issues by re-establishing failed connections and 
> keep tracking of every request sent to every worker. If the request fails or 
> passes a timeout, it will be resent. The server will keep track of requests 
> that succeeded to insure that the same request won't be processed more than 
> once. The structure for keeping track of the succeeded requests on the server 
> is efficient for handling increasing request ids (IncreasingBitSet). For 
> handling unresolved addresses, I added retry logic to keep trying to resolve 
> the problem.
> 
> This patch also adds several unit tests that use fault injection to simulate 
> a lost response or a closed channel exception on the server. It also has 
> unittests for IncreasingBitSet to insure it is working correctly and 
> efficiently.
> 
> This passes all unittests (including the new ones). Additionally, I have some 
> experience results as well.
> 
> Previously, I was unable to run reliably with more than 200 workers. With 
> this change I can reliably run 500+ workers. I also ran with 600 workers 
> successfully. This is a really big reliability win for us.
> 
> I can see the code working to do reconnections and re-issue requests when 
> necessary. It's very cool.
> 
> I.e.
> 
> 2012-08-18 00:16:52,109 INFO org.apache.giraph.comm.NettyClient: 
> checkAndFixChannel: Fixing disconnected channel to 
> xxx.xxx.xxx.xxx/xx.xx.xx.xx:30455, open = false, bound = false
> 2012-08-18 00:16:52,111 INFO org.apache.giraph.comm.NettyClient: 
> checkAndFixChannel: Connected to xxx.xxx.xxx.xxx/xxx.xxx.xxx.xxx:30455!
> 2012-08-18 00:16:52,123 INFO org.apache.giraph.comm.NettyClient: 
> checkAndFixChannel: Fixing disconnected channel to 
> xxx.xxx.xxx.xxx/xxx.xxx.xxx.xxx, open = false, bound = false
> 2012-08-18 00:16:52,124 INFO org.apache.giraph.comm.NettyClient: 
> checkAndFixChannel: Connected to xxx.xxx.xxx.xxx/xxx.xxx.xxx.xxx:30117!
> 
> 
> This addresses bug GIRAPH-306.
>     https://issues.apache.org/jira/browse/GIRAPH-306
> 
> 
> Diffs
> -----
> 
>   
> http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/AddressRequestIdGenerator.java
>  PRE-CREATION 
>   
> http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/ChannelRotater.java
>  1375393 
>   
> http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/ClientRequestId.java
>  PRE-CREATION 
>   
> http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/IncreasingBitSet.java
>  PRE-CREATION 
>   
> http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/NettyClient.java
>  1375393 
>   
> http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/NettyServer.java
>  1375393 
>   
> http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/NettyWorkerClient.java
>  1375393 
>   
> http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/RequestDecoder.java
>  1375393 
>   
> http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/RequestInfo.java
>  1375393 
>   
> http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/RequestServerHandler.java
>  1375393 
>   
> http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/ResponseClientHandler.java
>  1375393 
>   
> http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/WorkerRequestReservedMap.java
>  PRE-CREATION 
>   
> http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/WritableRequest.java
>  1375393 
>   
> http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/graph/BspServiceMaster.java
>  1375393 
>   
> http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/graph/BspServiceWorker.java
>  1375393 
>   
> http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/graph/GiraphJob.java
>  1375393 
>   
> http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/graph/WorkerInfo.java
>  1375393 
>   
> http://svn.apache.org/repos/asf/giraph/trunk/src/test/java/org/apache/giraph/comm/IncreasingBitSetTest.java
>  PRE-CREATION 
>   
> http://svn.apache.org/repos/asf/giraph/trunk/src/test/java/org/apache/giraph/comm/RequestFailureTest.java
>  PRE-CREATION 
>   
> http://svn.apache.org/repos/asf/giraph/trunk/src/test/java/org/apache/giraph/comm/RequestTest.java
>  1375393 
> 
> Diff: https://reviews.apache.org/r/6687/diff/
> 
> 
> Testing
> -------
> 
> mvn clean verify
> Lots of large test 500-600 workers with PageRankBenchmark
> 
> 
> Thanks,
> 
> Avery Ching
> 
>

Reply via email to