[ 
https://issues.apache.org/jira/browse/GIRAPH-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13436144#comment-13436144
 ] 

Eli Reisman commented on GIRAPH-300:
------------------------------------

This patch is great, I have now completed several large runs on a cluster busy 
enough to upset Netty under the old code, wow! The use case where Giraph is 
sharing a cluster with other Hadoop (and especially Pig) jobs that come on and 
off the grid during a long Giraph run has been a pain point all summer, this is 
really helping. I think this is a use case that will be typical for a lot of 
users, especially those with an existing Hadoop test cluster people are 
debugging jobs on who want to give Giraph a try, handing this situation with 
some grace is really a big step. Great contribution, thanks again!

                
> Improve netty reliability with retrying failed connections, tracking 
> requests, thread-safe hash partitioning
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: GIRAPH-300
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-300
>             Project: Giraph
>          Issue Type: Improvement
>            Reporter: Avery Ching
>            Assignee: Avery Ching
>         Attachments: GIRAPH-300.2.patch, GIRAPH-300.patch
>
>
> * Upgrade to the most recent stable version of Netty (3.5.3.Final)
> * Try multiple connection attempts up to n failures
> * Track requests throughout the system by keeping track of the request id and 
> then matching the request id to the response (minor refactoring of 
> WritableRequest to make requests simpler and support the request id)
> * Improved handling of netty exceptions by dumping the exception stack to 
> help debug failures
> * Fixes bug in HashWorkerPartitioner by making partitionList thread-safe 
> (this causes divide by zero exceptions in real life)
> Currently, netty connection failures causes issues with more than 75 workers 
> in my setup.  This allows us to reach over 200+ in a reasonably reliable 
> network that doesn't kill connections.
> This code passes the local Hadoop regressions and the single node Hadoop 
> instance regressions.  It also succeeded on large runs (200+ workers) on a 
> real Hadoop cluster.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to