Re: [DISCUSS] Improve load shedding by enforcing timeouts throughout stack

Nick Vatamaniuc Mon, 22 Apr 2019 10:33:26 -0700

Hi everyone,

We partially implement the first part (cleaning rexi workers) for all the
fabric streaming requests. Which should be all_docs, changes, view map,
view reduce:
https://github.com/apache/couchdb/commit/632f303a47bd89a97c831fd0532cb7541b80355d


The pattern there is the following:

 - With every request spawn a monitoring process that is in charge of
keeping track of all the workers as they are spawned.
 - If regular cleanup takes place, then this monitoring process is killed,
to avoid sending double the number of kill messages to workers.
 - If the coordinating process doesn't run cleanup and just dies, the
monitoring process will performs cleanup on its behalf.

Cheers,
-Nick



On Thu, Apr 18, 2019 at 5:16 PM Robert Samuel Newson <[email protected]>
wrote:

> My view is a) the server was unavailable for this request due to all the
> other requests it’s currently dealing with b) the connection was not idle,
> the client is not at fault.
>
> B.
>
> > On 18 Apr 2019, at 22:03, Done Collectively <[email protected]> wrote:
> >
> > Any reason 408 would be undesirable?
> >
> > https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/408
> >
> >
> > On Thu, Apr 18, 2019 at 10:37 AM Robert Newson <[email protected]>
> wrote:
> >
> >> 503 imo.
> >>
> >> --
> >>  Robert Samuel Newson
> >>  [email protected]
> >>
> >> On Thu, 18 Apr 2019, at 18:24, Adam Kocoloski wrote:
> >>> Yes, we should. Currently it’s a 500, maybe there’s something more
> >> appropriate:
> >>>
> >>>
> >>
> https://github.com/apache/couchdb/blob/8ef42f7241f8788afc1b6e7255ce78ce5d5ea5c3/src/chttpd/src/chttpd.erl#L947-L949
> >>>
> >>> Adam
> >>>
> >>>> On Apr 18, 2019, at 12:50 PM, Joan Touzet <[email protected]> wrote:
> >>>>
> >>>> What happens when it turns out the client *hasn't* timed out and we
> >>>> just...hang up on them? Should we consider at least trying to send
> back
> >>>> some sort of HTTP status code?
> >>>>
> >>>> -Joan
> >>>>
> >>>> On 2019-04-18 10:58, Garren Smith wrote:
> >>>>> I'm +1 on this. With partition queries, we added a few more timeouts
> >> that
> >>>>> can be enabled which Cloudant enable. So having the ability to shed
> >> old
> >>>>> requests when these timeouts get hit would be great.
> >>>>>
> >>>>> Cheers
> >>>>> Garren
> >>>>>
> >>>>> On Tue, Apr 16, 2019 at 2:41 AM Adam Kocoloski <[email protected]>
> >> wrote:
> >>>>>
> >>>>>> Hi all,
> >>>>>>
> >>>>>> For once, I’m coming to you with a topic that is not strictly about
> >>>>>> FoundationDB :)
> >>>>>>
> >>>>>> CouchDB offers a few config settings (some of them undocumented) to
> >> put a
> >>>>>> limit on how long the server is allowed to take to generate a
> >> response. The
> >>>>>> trouble with many of these timeouts is that, when they fire, they do
> >> not
> >>>>>> actually clean up all of the work that they initiated. A couple of
> >> examples:
> >>>>>>
> >>>>>> - Each HTTP response coordinated by the “fabric” application spawns
> >>>>>> several ephemeral processes via “rexi" on different nodes in the
> >> cluster to
> >>>>>> retrieve data and send it back to the process coordinating the
> >> response. If
> >>>>>> the request timeout fires, the coordinating process will be killed
> >> off, but
> >>>>>> the ephemeral workers might not be. In a healthy cluster they’ll
> >> exit on
> >>>>>> their own when they finish their jobs, but there are conditions
> >> under which
> >>>>>> they can sit around for extended periods of time waiting for an
> >> overloaded
> >>>>>> gen_server (e.g. couch_server) to respond.
> >>>>>>
> >>>>>> - Those named gen_servers (like couch_server) responsible for
> >> serializing
> >>>>>> access to important data structures will dutifully process messages
> >>>>>> received from old requests without any regard for (of even knowledge
> >> of)
> >>>>>> the fact that the client that sent the message timed out long ago.
> >> This can
> >>>>>> lead to a sort of death spiral in which the gen_server is ultimately
> >>>>>> spending ~all of its time serving dead clients and every client is
> >> timing
> >>>>>> out.
> >>>>>>
> >>>>>> I’d like to see us introduce a documented maximum request duration
> >> for all
> >>>>>> requests except the _changes feed, and then use that information to
> >> aid in
> >>>>>> load shedding throughout the stack. We can audit the codebase for
> >>>>>> gen_server calls with long timeouts (I know of a few on the critical
> >> path
> >>>>>> that set their timeouts to `infinity`) and we can design servers
> that
> >>>>>> efficiently drop old requests, knowing that the client who made the
> >> request
> >>>>>> must have timed out. A couple of topics for discussion:
> >>>>>>
> >>>>>> - the “gen_server that sheds old requests” is a very generic
> >> pattern, one
> >>>>>> that seems like it could be well-suited to its own behaviour. A
> >> cursory
> >>>>>> search of the internet didn’t turn up any prior art here, which
> >> surprises
> >>>>>> me a bit. I’m wondering if this is worth bringing up with the
> broader
> >>>>>> Erlang community.
> >>>>>>
> >>>>>> - setting and enforcing timeouts is a healthy pattern for read-only
> >>>>>> requests as it gives a lot more feedback to clients about the health
> >> of the
> >>>>>> server. When it comes to updates things are a little bit more muddy,
> >> just
> >>>>>> because there remains a chance that an update can be committed, but
> >> the
> >>>>>> caller times out before learning of the successful commit. We should
> >> try to
> >>>>>> minimize the likelihood of that occurring.
> >>>>>>
> >>>>>> Cheers, Adam
> >>>>>>
> >>>>>> P.S. I did say that this wasn’t _strictly_ about FoundationDB, but
> of
> >>>>>> course FDB has a hard 5 second limit on all transactions, so it is a
> >> bit of
> >>>>>> a forcing function :).Even putting FoundationDB aside, I would still
> >> argue
> >>>>>> to pursue this path based on our Ops experience with the current
> >> codebase.
> >>>>>
> >>>>
> >>>
> >>>
> >>
>
>

Re: [DISCUSS] Improve load shedding by enforcing timeouts throughout stack

Reply via email to