----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/6600/#review10330 -----------------------------------------------------------
This is a big win, thanks! So it was a reliability issue masked as a scalability one: more workers -> increased probability of network failure -> waiting forever on lost requests. Now is there anything that can be done to minimize those failures in the first place, or does it just depend on the cluster setup? I didn't know we had that partitioning bug. When is updatePartitionOwners() called concurrently with getPartitionOwner()? I guess we might be processing vertex requests while doing the partition exchange? http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/NettyClient.java <https://reviews.apache.org/r/6600/#comment21950> This looks a bit weird (using a TimedLogger for the timing but doing the actual logging on the raw Logger) although I see where the problem is (need to call it multiple times without waiting for the next deadline). http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/NettyServer.java <https://reviews.apache.org/r/6600/#comment21947> Missing a javadoc here. http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/utils/TimedLogger.java <https://reviews.apache.org/r/6600/#comment21949> We could call isPrintable() here to avoid duplication. - Alessandro Presta On Aug. 14, 2012, 7:32 a.m., Avery Ching wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/6600/ > ----------------------------------------------------------- > > (Updated Aug. 14, 2012, 7:32 a.m.) > > > Review request for giraph. > > > Description > ------- > > * Upgrade to the most recent stable version of Netty (3.5.3.Final) > * Try multiple connection attempts up to n failures > * Track requests throughout the system by keeping track of the request id and > then matching the request id to the response (minor refactoring of > WritableRequest to make requests simpler and support the request id) > * Improved handling of netty exceptions by dumping the exception stack to > help debug failures > * Fixes bug in HashWorkerPartitioner by making partitionList thread-safe > (this causes divide by zero exceptions in real life) > > > This addresses bug GIRAPH-300. > https://issues.apache.org/jira/browse/GIRAPH-300 > > > Diffs > ----- > > http://svn.apache.org/repos/asf/giraph/trunk/pom.xml 1372575 > > http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/NettyClient.java > 1372575 > > http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/NettyServer.java > 1372575 > > http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/NettyWorkerClient.java > 1372575 > > http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/RequestInfo.java > PRE-CREATION > > http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/RequestServerHandler.java > 1372575 > > http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/ResponseClientHandler.java > 1372575 > > http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/SendPartitionMessagesRequest.java > 1372575 > > http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/SendPartitionMutationsRequest.java > 1372575 > > http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/SendVertexRequest.java > 1372575 > > http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/WritableRequest.java > 1372575 > > http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/graph/BspServiceMaster.java > 1372575 > > http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/graph/GiraphJob.java > 1372575 > > http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/HashWorkerPartitioner.java > 1372575 > > http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/utils/TimedLogger.java > 1372575 > > http://svn.apache.org/repos/asf/giraph/trunk/src/test/java/org/apache/giraph/comm/ConnectionTest.java > 1372575 > > Diff: https://reviews.apache.org/r/6600/diff/ > > > Testing > ------- > > Currently, netty connection failures causes issues with more than 75 workers > in my setup. This allows us to reach over 200+ in a reasonably reliable > network that doesn't kill connections. > > This code passes the local Hadoop regressions and the single node Hadoop > instance regressions. It also succeeded on large runs (200+ workers) on a > real Hadoop cluster. > > > Thanks, > > Avery Ching > >
