-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/6600/#review10330
-----------------------------------------------------------


This is a big win, thanks!
So it was a reliability issue masked as a scalability one: more workers -> 
increased probability of network failure -> waiting forever on lost requests.
Now is there anything that can be done to minimize those failures in the first 
place, or does it just depend on the cluster setup?

I didn't know we had that partitioning bug. When is updatePartitionOwners() 
called concurrently with getPartitionOwner()? I guess we might be processing 
vertex requests while doing the partition exchange?


http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/NettyClient.java
<https://reviews.apache.org/r/6600/#comment21950>

    This looks a bit weird (using a TimedLogger for the timing but doing the 
actual logging on the raw Logger) although I see where the problem is (need to 
call it multiple times without waiting for the next deadline).



http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/NettyServer.java
<https://reviews.apache.org/r/6600/#comment21947>

    Missing a javadoc here.



http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/utils/TimedLogger.java
<https://reviews.apache.org/r/6600/#comment21949>

    We could call isPrintable() here to avoid duplication.


- Alessandro Presta


On Aug. 14, 2012, 7:32 a.m., Avery Ching wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/6600/
> -----------------------------------------------------------
> 
> (Updated Aug. 14, 2012, 7:32 a.m.)
> 
> 
> Review request for giraph.
> 
> 
> Description
> -------
> 
> * Upgrade to the most recent stable version of Netty (3.5.3.Final)
> * Try multiple connection attempts up to n failures
> * Track requests throughout the system by keeping track of the request id and 
> then matching the request id to the response (minor refactoring of 
> WritableRequest to make requests simpler and support the request id)
> * Improved handling of netty exceptions by dumping the exception stack to 
> help debug failures
> * Fixes bug in HashWorkerPartitioner by making partitionList thread-safe 
> (this causes divide by zero exceptions in real life)
> 
> 
> This addresses bug GIRAPH-300.
>     https://issues.apache.org/jira/browse/GIRAPH-300
> 
> 
> Diffs
> -----
> 
>   http://svn.apache.org/repos/asf/giraph/trunk/pom.xml 1372575 
>   
> http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/NettyClient.java
>  1372575 
>   
> http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/NettyServer.java
>  1372575 
>   
> http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/NettyWorkerClient.java
>  1372575 
>   
> http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/RequestInfo.java
>  PRE-CREATION 
>   
> http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/RequestServerHandler.java
>  1372575 
>   
> http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/ResponseClientHandler.java
>  1372575 
>   
> http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/SendPartitionMessagesRequest.java
>  1372575 
>   
> http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/SendPartitionMutationsRequest.java
>  1372575 
>   
> http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/SendVertexRequest.java
>  1372575 
>   
> http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/WritableRequest.java
>  1372575 
>   
> http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/graph/BspServiceMaster.java
>  1372575 
>   
> http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/graph/GiraphJob.java
>  1372575 
>   
> http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/HashWorkerPartitioner.java
>  1372575 
>   
> http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/utils/TimedLogger.java
>  1372575 
>   
> http://svn.apache.org/repos/asf/giraph/trunk/src/test/java/org/apache/giraph/comm/ConnectionTest.java
>  1372575 
> 
> Diff: https://reviews.apache.org/r/6600/diff/
> 
> 
> Testing
> -------
> 
> Currently, netty connection failures causes issues with more than 75 workers 
> in my setup. This allows us to reach over 200+ in a reasonably reliable 
> network that doesn't kill connections.
> 
> This code passes the local Hadoop regressions and the single node Hadoop 
> instance regressions. It also succeeded on large runs (200+ workers) on a 
> real Hadoop cluster.
> 
> 
> Thanks,
> 
> Avery Ching
> 
>

Reply via email to