I've run this many times this weekend in its original form "version 1". Ran 1000+ workers on it here with no problem. Barring any check style issues I skipped, ;) this thing is solid. In general Netty is less happy than before with large numbers of connections to maintain as we scale out, but I suspect that is transitional, and I didn't get very far into tweaking the perfect configuration for its current incarnation yet either.
On Mon, Aug 20, 2012 at 11:49 AM, Avery Ching (JIRA) <[email protected]>wrote: > > [ > https://issues.apache.org/jira/browse/GIRAPH-306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel] > > Avery Ching updated GIRAPH-306: > ------------------------------- > > Attachment: GIRAPH-306.2.patch > > - Added small change to not connect to one's self > - Also changed the backlog to be the number of workers by default > > Also updated https://reviews.apache.org/r/6687/ > > > Netty requests should be reliable and implement exactly once semantics > > ---------------------------------------------------------------------- > > > > Key: GIRAPH-306 > > URL: https://issues.apache.org/jira/browse/GIRAPH-306 > > Project: Giraph > > Issue Type: Improvement > > Reporter: Avery Ching > > Assignee: Avery Ching > > Priority: Critical > > Attachments: GIRAPH-306.2.patch, GIRAPH-306.patch > > > > > > One of the biggest scalability challenges is getting Giraph to run > reliably on a large number of tasks (i.e. > 200). Several problems exist: > > 1) If the connection fails after the initial connection was made, the > job will die. > > 2) Requests must be completed exactly once. This is difficult to > implement, but required since we cannot have multiple retried requests > succeed (i.e. a vertex gets more messages than expected). > > 3) Sometimes there are unresolved addresses, causing failure. > > This patch addresses these issues by re-establishing failed connections > and keep tracking of every request sent to every worker. If the request > fails or passes a timeout, it will be resent. The server will keep track > of requests that succeeded to insure that the same request won't be > processed more than once. The structure for keeping track of the succeeded > requests on the server is efficient for handling increasing request ids > (IncreasingBitSet). For handling unresolved addresses, I added retry logic > to keep trying to resolve the problem. > > This patch also adds several unit tests that use fault injection to > simulate a lost response or a closed channel exception on the server. It > also has unittests for IncreasingBitSet to insure it is working correctly > and efficiently. > > This passes all unittests (including the new ones). Additionally, I > have some experience results as well. > > Previously, I was unable to run reliably with more than 200 workers. > With this change I can reliably run 500+ workers. I also ran with 600 > workers successfully. This is a really big reliability win for us. > > I can see the code working to do reconnections and re-issue requests > when necessary. It's very cool. > > I.e. > > 2012-08-18 00:16:52,109 INFO org.apache.giraph.comm.NettyClient: > checkAndFixChannel: Fixing disconnected channel to > xxx.xxx.xxx.xxx/xx.xx.xx.xx:30455, open = false, bound = false > > 2012-08-18 00:16:52,111 INFO org.apache.giraph.comm.NettyClient: > checkAndFixChannel: Connected to xxx.xxx.xxx.xxx/xxx.xxx.xxx.xxx:30455! > > 2012-08-18 00:16:52,123 INFO org.apache.giraph.comm.NettyClient: > checkAndFixChannel: Fixing disconnected channel to > xxx.xxx.xxx.xxx/xxx.xxx.xxx.xxx, open = false, bound = false > > 2012-08-18 00:16:52,124 INFO org.apache.giraph.comm.NettyClient: > checkAndFixChannel: Connected to xxx.xxx.xxx.xxx/xxx.xxx.xxx.xxx:30117! > > -- > This message is automatically generated by JIRA. > If you think it was sent incorrectly, please contact your JIRA > administrators: > https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa > For more information on JIRA, see: http://www.atlassian.com/software/jira > > >
