> On Aug. 15, 2012, 2:03 p.m., Alessandro Presta wrote:
> > This is a big win, thanks!
> > So it was a reliability issue masked as a scalability one: more workers -> 
> > increased probability of network failure -> waiting forever on lost 
> > requests.
> > Now is there anything that can be done to minimize those failures in the 
> > first place, or does it just depend on the cluster setup?
> > 
> > I didn't know we had that partitioning bug. When is updatePartitionOwners() 
> > called concurrently with getPartitionOwner()? I guess we might be 
> > processing vertex requests while doing the partition exchange?

Thanks for the review!

Next patch can improve the situation by retrying. =)  The problem with 
updatePartitionOwners() is that it is called by the main thread in 
BspServiceWorker#setup(), but getPartitionOwners() is called by the netty 
server threads.


> On Aug. 15, 2012, 2:03 p.m., Alessandro Presta wrote:
> > http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/NettyClient.java,
> >  line 389
> > <https://reviews.apache.org/r/6600/diff/1/?file=139832#file139832line389>
> >
> >     This looks a bit weird (using a TimedLogger for the timing but doing 
> > the actual logging on the raw Logger) although I see where the problem is 
> > (need to call it multiple times without waiting for the next deadline).

yeah, we can probably augment TimedLogger to have a level in a future patch.


> On Aug. 15, 2012, 2:03 p.m., Alessandro Presta wrote:
> > http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/NettyServer.java,
> >  line 63
> > <https://reviews.apache.org/r/6600/diff/1/?file=139833#file139833line63>
> >
> >     Missing a javadoc here.

Thanks.  Fixed.


> On Aug. 15, 2012, 2:03 p.m., Alessandro Presta wrote:
> > http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/utils/TimedLogger.java,
> >  line 51
> > <https://reviews.apache.org/r/6600/diff/1/?file=139845#file139845line51>
> >
> >     We could call isPrintable() here to avoid duplication.

Great suggestion.  Fixed.


- Avery


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/6600/#review10330
-----------------------------------------------------------


On Aug. 15, 2012, 6:52 p.m., Avery Ching wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/6600/
> -----------------------------------------------------------
> 
> (Updated Aug. 15, 2012, 6:52 p.m.)
> 
> 
> Review request for giraph.
> 
> 
> Description
> -------
> 
> * Upgrade to the most recent stable version of Netty (3.5.3.Final)
> * Try multiple connection attempts up to n failures
> * Track requests throughout the system by keeping track of the request id and 
> then matching the request id to the response (minor refactoring of 
> WritableRequest to make requests simpler and support the request id)
> * Improved handling of netty exceptions by dumping the exception stack to 
> help debug failures
> * Fixes bug in HashWorkerPartitioner by making partitionList thread-safe 
> (this causes divide by zero exceptions in real life)
> 
> 
> This addresses bug GIRAPH-300.
>     https://issues.apache.org/jira/browse/GIRAPH-300
> 
> 
> Diffs
> -----
> 
>   http://svn.apache.org/repos/asf/giraph/trunk/pom.xml 1372575 
>   
> http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/NettyClient.java
>  1372575 
>   
> http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/NettyServer.java
>  1372575 
>   
> http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/NettyWorkerClient.java
>  1372575 
>   
> http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/RequestInfo.java
>  PRE-CREATION 
>   
> http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/RequestServerHandler.java
>  1372575 
>   
> http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/ResponseClientHandler.java
>  1372575 
>   
> http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/SendPartitionMessagesRequest.java
>  1372575 
>   
> http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/SendPartitionMutationsRequest.java
>  1372575 
>   
> http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/SendVertexRequest.java
>  1372575 
>   
> http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/WritableRequest.java
>  1372575 
>   
> http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/graph/BspServiceMaster.java
>  1372575 
>   
> http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/graph/GiraphJob.java
>  1372575 
>   
> http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/HashWorkerPartitioner.java
>  1372575 
>   
> http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/utils/TimedLogger.java
>  1372575 
>   
> http://svn.apache.org/repos/asf/giraph/trunk/src/test/java/org/apache/giraph/comm/ConnectionTest.java
>  1372575 
> 
> Diff: https://reviews.apache.org/r/6600/diff/
> 
> 
> Testing
> -------
> 
> Currently, netty connection failures causes issues with more than 75 workers 
> in my setup. This allows us to reach over 200+ in a reasonably reliable 
> network that doesn't kill connections.
> 
> This code passes the local Hadoop regressions and the single node Hadoop 
> instance regressions. It also succeeded on large runs (200+ workers) on a 
> real Hadoop cluster.
> 
> 
> Thanks,
> 
> Avery Ching
> 
>

Reply via email to