> On Aug. 21, 2012, 4:53 p.m., Alessandro Presta wrote: > > Really cool! It looks like some network failures are unavoidable scaling > > up, so resending requests on one end and ensuring unicity on the other > > seems a solid approach. > > From your numbers, it's a pretty dramatic step up and I can't wait to try > > it. > > > > Is there any significant impact on jobs that wouldn't require the > > additional precautions? In other words, if you take a job that succeeds > > with current trunk, do the communication-heavy phases take considerably > > longer? > > > > Other than that, only some comments about maintainability. > > > > Thanks!
The additional overhead is an additional integer that is send/received with every request (small price to pay for reliability and also small when we are bulking requests). Running tests now, but don't expect any noticeable differences. > On Aug. 21, 2012, 4:53 p.m., Alessandro Presta wrote: > > http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/NettyClient.java, > > line 117 > > <https://reviews.apache.org/r/6687/diff/3/?file=142973#file142973line117> > > > > Can you put a more explanatory comment here? You're just making the > > previously hardcoded WAITING_REQUEST_MSECS parameter configurable, right? > > So maybe something like "Waiting interval for outstanding requests to > > finish". Changed. > On Aug. 21, 2012, 4:53 p.m., Alessandro Presta wrote: > > http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/NettyClient.java, > > line 211 > > <https://reviews.apache.org/r/6687/diff/3/?file=142973#file142973line211> > > > > I would turn this 13 into a named constant and briefly explain how you > > get to this number, otherwise this is difficult to maintain. Fixed. This is a good idea. > On Aug. 21, 2012, 4:53 p.m., Alessandro Presta wrote: > > http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/ResponseClientHandler.java, > > line 44 > > <https://reviews.apache.org/r/6687/diff/3/?file=142979#file142979line44> > > > > Typo (missing blank). Good eye, fixed. > On Aug. 21, 2012, 4:53 p.m., Alessandro Presta wrote: > > http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/RequestServerHandler.java, > > line 91 > > <https://reviews.apache.org/r/6687/diff/3/?file=142978#file142978line91> > > > > I normally don't like having test code paths in non-test classes. Is > > there absolutely no way around this? I can understand the apprehension, but testing failure scenarios is a bit tricky without in-code instrumentation. The alternative would be to refactor the logic and try to mock it in, but this is a bit yucky as well, since we would only be factoring for testing the failure, not for any other purpose. > On Aug. 21, 2012, 4:53 p.m., Alessandro Presta wrote: > > http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/WorkerIdRequestId.java, > > line 24 > > <https://reviews.apache.org/r/6687/diff/3/?file=142980#file142980line24> > > > > Just a matter of taste, but I think we could use a more meaningful name > > here. > > Some ideas: GlobalRequestId (as this is a global, unique identifier), > > WorkerRequestIdPair (super-explicit). > > Also I would specify in the comments that we're referring to the > > destination worker. I agree. I changed the same to ClientRequestId since it is not globally unique, but rather unique to a client. > On Aug. 21, 2012, 4:53 p.m., Alessandro Presta wrote: > > http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/NettyClient.java, > > line 66 > > <https://reviews.apache.org/r/6687/diff/3/?file=142973#file142973line66> > > > > Unrelated to your changes, but I think these should go in GiraphJob > > like all other config options, even if Netty-specific. I'd like to address this in a followup JIRA if that is okay. - Avery ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/6687/#review10573 ----------------------------------------------------------- On Aug. 21, 2012, 7:47 p.m., Avery Ching wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/6687/ > ----------------------------------------------------------- > > (Updated Aug. 21, 2012, 7:47 p.m.) > > > Review request for giraph. > > > Description > ------- > > One of the biggest scalability challenges is getting Giraph to run reliably > on a large number of tasks (i.e. > 200). Several problems exist: > > 1) If the connection fails after the initial connection was made, the job > will die. > 2) Requests must be completed exactly once. This is difficult to implement, > but required since we cannot have multiple retried requests succeed (i.e. a > vertex gets more messages than expected). > 3) Sometimes there are unresolved addresses, causing failure. > > This patch addresses these issues by re-establishing failed connections and > keep tracking of every request sent to every worker. If the request fails or > passes a timeout, it will be resent. The server will keep track of requests > that succeeded to insure that the same request won't be processed more than > once. The structure for keeping track of the succeeded requests on the server > is efficient for handling increasing request ids (IncreasingBitSet). For > handling unresolved addresses, I added retry logic to keep trying to resolve > the problem. > > This patch also adds several unit tests that use fault injection to simulate > a lost response or a closed channel exception on the server. It also has > unittests for IncreasingBitSet to insure it is working correctly and > efficiently. > > This passes all unittests (including the new ones). Additionally, I have some > experience results as well. > > Previously, I was unable to run reliably with more than 200 workers. With > this change I can reliably run 500+ workers. I also ran with 600 workers > successfully. This is a really big reliability win for us. > > I can see the code working to do reconnections and re-issue requests when > necessary. It's very cool. > > I.e. > > 2012-08-18 00:16:52,109 INFO org.apache.giraph.comm.NettyClient: > checkAndFixChannel: Fixing disconnected channel to > xxx.xxx.xxx.xxx/xx.xx.xx.xx:30455, open = false, bound = false > 2012-08-18 00:16:52,111 INFO org.apache.giraph.comm.NettyClient: > checkAndFixChannel: Connected to xxx.xxx.xxx.xxx/xxx.xxx.xxx.xxx:30455! > 2012-08-18 00:16:52,123 INFO org.apache.giraph.comm.NettyClient: > checkAndFixChannel: Fixing disconnected channel to > xxx.xxx.xxx.xxx/xxx.xxx.xxx.xxx, open = false, bound = false > 2012-08-18 00:16:52,124 INFO org.apache.giraph.comm.NettyClient: > checkAndFixChannel: Connected to xxx.xxx.xxx.xxx/xxx.xxx.xxx.xxx:30117! > > > This addresses bug GIRAPH-306. > https://issues.apache.org/jira/browse/GIRAPH-306 > > > Diffs > ----- > > > http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/AddressRequestIdGenerator.java > PRE-CREATION > > http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/ChannelRotater.java > 1375393 > > http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/ClientRequestId.java > PRE-CREATION > > http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/IncreasingBitSet.java > PRE-CREATION > > http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/NettyClient.java > 1375393 > > http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/NettyServer.java > 1375393 > > http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/NettyWorkerClient.java > 1375393 > > http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/RequestDecoder.java > 1375393 > > http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/RequestInfo.java > 1375393 > > http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/RequestServerHandler.java > 1375393 > > http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/ResponseClientHandler.java > 1375393 > > http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/WorkerRequestReservedMap.java > PRE-CREATION > > http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/WritableRequest.java > 1375393 > > http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/graph/BspServiceMaster.java > 1375393 > > http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/graph/BspServiceWorker.java > 1375393 > > http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/graph/GiraphJob.java > 1375393 > > http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/graph/WorkerInfo.java > 1375393 > > http://svn.apache.org/repos/asf/giraph/trunk/src/test/java/org/apache/giraph/comm/IncreasingBitSetTest.java > PRE-CREATION > > http://svn.apache.org/repos/asf/giraph/trunk/src/test/java/org/apache/giraph/comm/RequestFailureTest.java > PRE-CREATION > > http://svn.apache.org/repos/asf/giraph/trunk/src/test/java/org/apache/giraph/comm/RequestTest.java > 1375393 > > Diff: https://reviews.apache.org/r/6687/diff/ > > > Testing > ------- > > mvn clean verify > Lots of large test 500-600 workers with PageRankBenchmark > > > Thanks, > > Avery Ching > >
