Re: [DISCUSS] Improve load shedding by enforcing timeouts throughout stack

Adam Kocoloski Thu, 18 Apr 2019 10:24:59 -0700

Yes, we should. Currently it’s a 500, maybe there’s something more appropriate:


https://github.com/apache/couchdb/blob/8ef42f7241f8788afc1b6e7255ce78ce5d5ea5c3/src/chttpd/src/chttpd.erl#L947-L949

Adam

> On Apr 18, 2019, at 12:50 PM, Joan Touzet <[email protected]> wrote:
> 
> What happens when it turns out the client *hasn't* timed out and we
> just...hang up on them? Should we consider at least trying to send back
> some sort of HTTP status code?
> 
> -Joan
> 
> On 2019-04-18 10:58, Garren Smith wrote:
>> I'm +1 on this. With partition queries, we added a few more timeouts that
>> can be enabled which Cloudant enable. So having the ability to shed old
>> requests when these timeouts get hit would be great.
>> 
>> Cheers
>> Garren
>> 
>> On Tue, Apr 16, 2019 at 2:41 AM Adam Kocoloski <[email protected]> wrote:
>> 
>>> Hi all,
>>> 
>>> For once, I’m coming to you with a topic that is not strictly about
>>> FoundationDB :)
>>> 
>>> CouchDB offers a few config settings (some of them undocumented) to put a
>>> limit on how long the server is allowed to take to generate a response. The
>>> trouble with many of these timeouts is that, when they fire, they do not
>>> actually clean up all of the work that they initiated. A couple of examples:
>>> 
>>> - Each HTTP response coordinated by the “fabric” application spawns
>>> several ephemeral processes via “rexi" on different nodes in the cluster to
>>> retrieve data and send it back to the process coordinating the response. If
>>> the request timeout fires, the coordinating process will be killed off, but
>>> the ephemeral workers might not be. In a healthy cluster they’ll exit on
>>> their own when they finish their jobs, but there are conditions under which
>>> they can sit around for extended periods of time waiting for an overloaded
>>> gen_server (e.g. couch_server) to respond.
>>> 
>>> - Those named gen_servers (like couch_server) responsible for serializing
>>> access to important data structures will dutifully process messages
>>> received from old requests without any regard for (of even knowledge of)
>>> the fact that the client that sent the message timed out long ago. This can
>>> lead to a sort of death spiral in which the gen_server is ultimately
>>> spending ~all of its time serving dead clients and every client is timing
>>> out.
>>> 
>>> I’d like to see us introduce a documented maximum request duration for all
>>> requests except the _changes feed, and then use that information to aid in
>>> load shedding throughout the stack. We can audit the codebase for
>>> gen_server calls with long timeouts (I know of a few on the critical path
>>> that set their timeouts to `infinity`) and we can design servers that
>>> efficiently drop old requests, knowing that the client who made the request
>>> must have timed out. A couple of topics for discussion:
>>> 
>>> - the “gen_server that sheds old requests” is a very generic pattern, one
>>> that seems like it could be well-suited to its own behaviour. A cursory
>>> search of the internet didn’t turn up any prior art here, which surprises
>>> me a bit. I’m wondering if this is worth bringing up with the broader
>>> Erlang community.
>>> 
>>> - setting and enforcing timeouts is a healthy pattern for read-only
>>> requests as it gives a lot more feedback to clients about the health of the
>>> server. When it comes to updates things are a little bit more muddy, just
>>> because there remains a chance that an update can be committed, but the
>>> caller times out before learning of the successful commit. We should try to
>>> minimize the likelihood of that occurring.
>>> 
>>> Cheers, Adam
>>> 
>>> P.S. I did say that this wasn’t _strictly_ about FoundationDB, but of
>>> course FDB has a hard 5 second limit on all transactions, so it is a bit of
>>> a forcing function :).Even putting FoundationDB aside, I would still argue
>>> to pursue this path based on our Ops experience with the current codebase.
>> 
>

Re: [DISCUSS] Improve load shedding by enforcing timeouts throughout stack

Reply via email to