Avery Ching created GIRAPH-300:
----------------------------------
Summary: Improve netty reliability with retrying failed
connections, tracking requests, thread-safe hash partitioning
Key: GIRAPH-300
URL: https://issues.apache.org/jira/browse/GIRAPH-300
Project: Giraph
Issue Type: Improvement
Reporter: Avery Ching
Assignee: Avery Ching
* Upgrade to the most recent stable version of Netty (3.5.3.Final)
* Try multiple connection attempts up to n failures
* Track requests throughout the system by keeping track of the request id and
then matching the request id to the response (minor refactoring of
WritableRequest to make requests simpler and support the request id)
* Improved handling of netty exceptions by dumping the exception stack to help
debug failures
* Fixes bug in HashWorkerPartitioner by making partitionList thread-safe (this
causes divide by zero exceptions in real life)
Currently, netty connection failures causes issues with more than 75 workers in
my setup. This allows us to reach over 200+ in a reasonably reliable network
that doesn't kill connections.
This code passes the local Hadoop regressions and the single node Hadoop
instance regressions. It also succeeded on large runs (200+ workers) on a real
Hadoop cluster.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira