Re: [jira] [Updated] (GIRAPH-306) Netty requests should be reliable and implement exactly once semantics

Eli Reisman Mon, 20 Aug 2012 14:11:34 -0700

I've run this many times this weekend in its original form "version 1". Ran
1000+ workers on it here with no problem. Barring any check style issues I
skipped, ;) this thing is solid. In general Netty is less happy than before
with large numbers of connections to maintain as we scale out, but I
suspect that is transitional, and I didn't get very far into tweaking the
perfect configuration for its current incarnation yet either.



On Mon, Aug 20, 2012 at 11:49 AM, Avery Ching (JIRA) <[email protected]>wrote:

>
>      [
> https://issues.apache.org/jira/browse/GIRAPH-306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel]
>
> Avery Ching updated GIRAPH-306:
> -------------------------------
>
>     Attachment: GIRAPH-306.2.patch
>
> - Added small change to not connect to one's self
> - Also changed the backlog to be the number of workers by default
>
> Also updated https://reviews.apache.org/r/6687/
>
> > Netty requests should be reliable and implement exactly once semantics
> > ----------------------------------------------------------------------
> >
> >                 Key: GIRAPH-306
> >                 URL: https://issues.apache.org/jira/browse/GIRAPH-306
> >             Project: Giraph
> >          Issue Type: Improvement
> >            Reporter: Avery Ching
> >            Assignee: Avery Ching
> >            Priority: Critical
> >         Attachments: GIRAPH-306.2.patch, GIRAPH-306.patch
> >
> >
> > One of the biggest scalability challenges is getting Giraph to run
> reliably on a large number of tasks (i.e. > 200).  Several problems exist:
> > 1) If the connection fails after the initial connection was made, the
> job will die.
> > 2) Requests must be completed exactly once.  This is difficult to
> implement, but required since we cannot have multiple retried requests
> succeed (i.e. a vertex gets more messages than expected).
> > 3) Sometimes there are unresolved addresses, causing failure.
> > This patch addresses these issues by re-establishing failed connections
> and keep tracking of every request sent to every worker.  If the request
> fails or passes a timeout, it will be resent.  The server will keep track
> of requests that succeeded to insure that the same request won't be
> processed more than once.  The structure for keeping track of the succeeded
> requests on the server is efficient for handling increasing request ids
> (IncreasingBitSet).  For handling unresolved addresses, I added retry logic
> to keep trying to resolve the problem.
> > This patch also adds several unit tests that use fault injection to
> simulate a lost response or a closed channel exception on the server.  It
> also has unittests for IncreasingBitSet to insure it is working correctly
> and efficiently.
> > This passes all unittests (including the new ones).  Additionally, I
> have some experience results as well.
> > Previously, I was unable to run reliably with more than 200 workers.
>  With this change I can reliably run 500+ workers.  I also ran with 600
> workers successfully.  This is a really big reliability win for us.
> > I can see the code working to do reconnections and re-issue requests
> when necessary.  It's very cool.
> > I.e.
> > 2012-08-18 00:16:52,109 INFO org.apache.giraph.comm.NettyClient:
> checkAndFixChannel: Fixing disconnected channel to
> xxx.xxx.xxx.xxx/xx.xx.xx.xx:30455, open = false, bound = false
> > 2012-08-18 00:16:52,111 INFO org.apache.giraph.comm.NettyClient:
> checkAndFixChannel: Connected to xxx.xxx.xxx.xxx/xxx.xxx.xxx.xxx:30455!
> > 2012-08-18 00:16:52,123 INFO org.apache.giraph.comm.NettyClient:
> checkAndFixChannel: Fixing disconnected channel to
> xxx.xxx.xxx.xxx/xxx.xxx.xxx.xxx, open = false, bound = false
> > 2012-08-18 00:16:52,124 INFO org.apache.giraph.comm.NettyClient:
> checkAndFixChannel: Connected to xxx.xxx.xxx.xxx/xxx.xxx.xxx.xxx:30117!
>
> --
> This message is automatically generated by JIRA.
> If you think it was sent incorrectly, please contact your JIRA
> administrators:
> https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
> For more information on JIRA, see: http://www.atlassian.com/software/jira
>
>
>

Re: [jira] [Updated] (GIRAPH-306) Netty requests should be reliable and implement exactly once semantics

Reply via email to