I found there was a bit more pressure on memory and a bit more time to finish the job, but not an outlandish amount of either.
On Tue, Aug 21, 2012 at 9:53 AM, Alessandro Presta <[email protected]>wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/6687/#review10573 > ----------------------------------------------------------- > > > Really cool! It looks like some network failures are unavoidable scaling > up, so resending requests on one end and ensuring unicity on the other > seems a solid approach. > From your numbers, it's a pretty dramatic step up and I can't wait to try > it. > > Is there any significant impact on jobs that wouldn't require the > additional precautions? In other words, if you take a job that succeeds > with current trunk, do the communication-heavy phases take considerably > longer? > > Other than that, only some comments about maintainability. > > Thanks! > > > > http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/NettyClient.java > <https://reviews.apache.org/r/6687/#comment22634> > > Unrelated to your changes, but I think these should go in GiraphJob > like all other config options, even if Netty-specific. > > > > > http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/NettyClient.java > <https://reviews.apache.org/r/6687/#comment22635> > > Can you put a more explanatory comment here? You're just making the > previously hardcoded WAITING_REQUEST_MSECS parameter configurable, right? > So maybe something like "Waiting interval for outstanding requests to > finish". > > > > > http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/NettyClient.java > <https://reviews.apache.org/r/6687/#comment22636> > > I would turn this 13 into a named constant and briefly explain how you > get to this number, otherwise this is difficult to maintain. > > > > > http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/NettyClient.java > <https://reviews.apache.org/r/6687/#comment22642> > > Well spotted! No reason to synchronize the entire while loop. > > > > > http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/RequestServerHandler.java > <https://reviews.apache.org/r/6687/#comment22644> > > I normally don't like having test code paths in non-test classes. Is > there absolutely no way around this? > > > > > http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/ResponseClientHandler.java > <https://reviews.apache.org/r/6687/#comment22645> > > Typo (missing blank). > > > > > http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/WorkerIdRequestId.java > <https://reviews.apache.org/r/6687/#comment22649> > > Just a matter of taste, but I think we could use a more meaningful > name here. > Some ideas: GlobalRequestId (as this is a global, unique identifier), > WorkerRequestIdPair (super-explicit). > Also I would specify in the comments that we're referring to the > destination worker. > > > - Alessandro Presta > > > On Aug. 20, 2012, 6:47 p.m., Avery Ching wrote: > > > > ----------------------------------------------------------- > > This is an automatically generated e-mail. To reply, visit: > > https://reviews.apache.org/r/6687/ > > ----------------------------------------------------------- > > > > (Updated Aug. 20, 2012, 6:47 p.m.) > > > > > > Review request for giraph. > > > > > > Description > > ------- > > > > One of the biggest scalability challenges is getting Giraph to run > reliably on a large number of tasks (i.e. > 200). Several problems exist: > > > > 1) If the connection fails after the initial connection was made, the > job will die. > > 2) Requests must be completed exactly once. This is difficult to > implement, but required since we cannot have multiple retried requests > succeed (i.e. a vertex gets more messages than expected). > > 3) Sometimes there are unresolved addresses, causing failure. > > > > This patch addresses these issues by re-establishing failed connections > and keep tracking of every request sent to every worker. If the request > fails or passes a timeout, it will be resent. The server will keep track of > requests that succeeded to insure that the same request won't be processed > more than once. The structure for keeping track of the succeeded requests > on the server is efficient for handling increasing request ids > (IncreasingBitSet). For handling unresolved addresses, I added retry logic > to keep trying to resolve the problem. > > > > This patch also adds several unit tests that use fault injection to > simulate a lost response or a closed channel exception on the server. It > also has unittests for IncreasingBitSet to insure it is working correctly > and efficiently. > > > > This passes all unittests (including the new ones). Additionally, I have > some experience results as well. > > > > Previously, I was unable to run reliably with more than 200 workers. > With this change I can reliably run 500+ workers. I also ran with 600 > workers successfully. This is a really big reliability win for us. > > > > I can see the code working to do reconnections and re-issue requests > when necessary. It's very cool. > > > > I.e. > > > > 2012-08-18 00:16:52,109 INFO org.apache.giraph.comm.NettyClient: > checkAndFixChannel: Fixing disconnected channel to > xxx.xxx.xxx.xxx/xx.xx.xx.xx:30455, open = false, bound = false > > 2012-08-18 00:16:52,111 INFO org.apache.giraph.comm.NettyClient: > checkAndFixChannel: Connected to xxx.xxx.xxx.xxx/xxx.xxx.xxx.xxx:30455! > > 2012-08-18 00:16:52,123 INFO org.apache.giraph.comm.NettyClient: > checkAndFixChannel: Fixing disconnected channel to > xxx.xxx.xxx.xxx/xxx.xxx.xxx.xxx, open = false, bound = false > > 2012-08-18 00:16:52,124 INFO org.apache.giraph.comm.NettyClient: > checkAndFixChannel: Connected to xxx.xxx.xxx.xxx/xxx.xxx.xxx.xxx:30117! > > > > > > This addresses bug GIRAPH-306. > > https://issues.apache.org/jira/browse/GIRAPH-306 > > > > > > Diffs > > ----- > > > > > http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/AddressRequestIdGenerator.javaPRE-CREATION > > > http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/ChannelRotater.java1374192 > > > http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/IncreasingBitSet.javaPRE-CREATION > > > http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/NettyClient.java1374192 > > > http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/NettyServer.java1374192 > > > http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/NettyWorkerClient.java1374192 > > > http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/RequestDecoder.java1374192 > > > http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/RequestInfo.java1374192 > > > http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/RequestServerHandler.java1374192 > > > http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/ResponseClientHandler.java1374192 > > > http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/WorkerIdRequestId.javaPRE-CREATION > > > http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/WorkerRequestReservedMap.javaPRE-CREATION > > > http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/WritableRequest.java1374192 > > > http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/graph/BspServiceMaster.java1374192 > > > http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/graph/BspServiceWorker.java1374192 > > > http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/graph/GiraphJob.java1374192 > > > http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/graph/WorkerInfo.java1374192 > > > http://svn.apache.org/repos/asf/giraph/trunk/src/test/java/org/apache/giraph/comm/IncreasingBitSetTest.javaPRE-CREATION > > > http://svn.apache.org/repos/asf/giraph/trunk/src/test/java/org/apache/giraph/comm/RequestFailureTest.javaPRE-CREATION > > > http://svn.apache.org/repos/asf/giraph/trunk/src/test/java/org/apache/giraph/comm/RequestTest.java1374192 > > > > Diff: https://reviews.apache.org/r/6687/diff/ > > > > > > Testing > > ------- > > > > mvn clean verify > > Lots of large test 500-600 workers with PageRankBenchmark > > > > > > Thanks, > > > > Avery Ching > > > > > >
