----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/6687/#review10573 -----------------------------------------------------------
Really cool! It looks like some network failures are unavoidable scaling up, so resending requests on one end and ensuring unicity on the other seems a solid approach. >From your numbers, it's a pretty dramatic step up and I can't wait to try it. Is there any significant impact on jobs that wouldn't require the additional precautions? In other words, if you take a job that succeeds with current trunk, do the communication-heavy phases take considerably longer? Other than that, only some comments about maintainability. Thanks! http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/NettyClient.java <https://reviews.apache.org/r/6687/#comment22634> Unrelated to your changes, but I think these should go in GiraphJob like all other config options, even if Netty-specific. http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/NettyClient.java <https://reviews.apache.org/r/6687/#comment22635> Can you put a more explanatory comment here? You're just making the previously hardcoded WAITING_REQUEST_MSECS parameter configurable, right? So maybe something like "Waiting interval for outstanding requests to finish". http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/NettyClient.java <https://reviews.apache.org/r/6687/#comment22636> I would turn this 13 into a named constant and briefly explain how you get to this number, otherwise this is difficult to maintain. http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/NettyClient.java <https://reviews.apache.org/r/6687/#comment22642> Well spotted! No reason to synchronize the entire while loop. http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/RequestServerHandler.java <https://reviews.apache.org/r/6687/#comment22644> I normally don't like having test code paths in non-test classes. Is there absolutely no way around this? http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/ResponseClientHandler.java <https://reviews.apache.org/r/6687/#comment22645> Typo (missing blank). http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/WorkerIdRequestId.java <https://reviews.apache.org/r/6687/#comment22649> Just a matter of taste, but I think we could use a more meaningful name here. Some ideas: GlobalRequestId (as this is a global, unique identifier), WorkerRequestIdPair (super-explicit). Also I would specify in the comments that we're referring to the destination worker. - Alessandro Presta On Aug. 20, 2012, 6:47 p.m., Avery Ching wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/6687/ > ----------------------------------------------------------- > > (Updated Aug. 20, 2012, 6:47 p.m.) > > > Review request for giraph. > > > Description > ------- > > One of the biggest scalability challenges is getting Giraph to run reliably > on a large number of tasks (i.e. > 200). Several problems exist: > > 1) If the connection fails after the initial connection was made, the job > will die. > 2) Requests must be completed exactly once. This is difficult to implement, > but required since we cannot have multiple retried requests succeed (i.e. a > vertex gets more messages than expected). > 3) Sometimes there are unresolved addresses, causing failure. > > This patch addresses these issues by re-establishing failed connections and > keep tracking of every request sent to every worker. If the request fails or > passes a timeout, it will be resent. The server will keep track of requests > that succeeded to insure that the same request won't be processed more than > once. The structure for keeping track of the succeeded requests on the server > is efficient for handling increasing request ids (IncreasingBitSet). For > handling unresolved addresses, I added retry logic to keep trying to resolve > the problem. > > This patch also adds several unit tests that use fault injection to simulate > a lost response or a closed channel exception on the server. It also has > unittests for IncreasingBitSet to insure it is working correctly and > efficiently. > > This passes all unittests (including the new ones). Additionally, I have some > experience results as well. > > Previously, I was unable to run reliably with more than 200 workers. With > this change I can reliably run 500+ workers. I also ran with 600 workers > successfully. This is a really big reliability win for us. > > I can see the code working to do reconnections and re-issue requests when > necessary. It's very cool. > > I.e. > > 2012-08-18 00:16:52,109 INFO org.apache.giraph.comm.NettyClient: > checkAndFixChannel: Fixing disconnected channel to > xxx.xxx.xxx.xxx/xx.xx.xx.xx:30455, open = false, bound = false > 2012-08-18 00:16:52,111 INFO org.apache.giraph.comm.NettyClient: > checkAndFixChannel: Connected to xxx.xxx.xxx.xxx/xxx.xxx.xxx.xxx:30455! > 2012-08-18 00:16:52,123 INFO org.apache.giraph.comm.NettyClient: > checkAndFixChannel: Fixing disconnected channel to > xxx.xxx.xxx.xxx/xxx.xxx.xxx.xxx, open = false, bound = false > 2012-08-18 00:16:52,124 INFO org.apache.giraph.comm.NettyClient: > checkAndFixChannel: Connected to xxx.xxx.xxx.xxx/xxx.xxx.xxx.xxx:30117! > > > This addresses bug GIRAPH-306. > https://issues.apache.org/jira/browse/GIRAPH-306 > > > Diffs > ----- > > > http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/AddressRequestIdGenerator.java > PRE-CREATION > > http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/ChannelRotater.java > 1374192 > > http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/IncreasingBitSet.java > PRE-CREATION > > http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/NettyClient.java > 1374192 > > http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/NettyServer.java > 1374192 > > http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/NettyWorkerClient.java > 1374192 > > http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/RequestDecoder.java > 1374192 > > http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/RequestInfo.java > 1374192 > > http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/RequestServerHandler.java > 1374192 > > http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/ResponseClientHandler.java > 1374192 > > http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/WorkerIdRequestId.java > PRE-CREATION > > http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/WorkerRequestReservedMap.java > PRE-CREATION > > http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/WritableRequest.java > 1374192 > > http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/graph/BspServiceMaster.java > 1374192 > > http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/graph/BspServiceWorker.java > 1374192 > > http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/graph/GiraphJob.java > 1374192 > > http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/graph/WorkerInfo.java > 1374192 > > http://svn.apache.org/repos/asf/giraph/trunk/src/test/java/org/apache/giraph/comm/IncreasingBitSetTest.java > PRE-CREATION > > http://svn.apache.org/repos/asf/giraph/trunk/src/test/java/org/apache/giraph/comm/RequestFailureTest.java > PRE-CREATION > > http://svn.apache.org/repos/asf/giraph/trunk/src/test/java/org/apache/giraph/comm/RequestTest.java > 1374192 > > Diff: https://reviews.apache.org/r/6687/diff/ > > > Testing > ------- > > mvn clean verify > Lots of large test 500-600 workers with PageRankBenchmark > > > Thanks, > > Avery Ching > >
