Hi everyone, We partially implement the first part (cleaning rexi workers) for all the fabric streaming requests. Which should be all_docs, changes, view map, view reduce: https://github.com/apache/couchdb/commit/632f303a47bd89a97c831fd0532cb7541b80355d
The pattern there is the following: - With every request spawn a monitoring process that is in charge of keeping track of all the workers as they are spawned. - If regular cleanup takes place, then this monitoring process is killed, to avoid sending double the number of kill messages to workers. - If the coordinating process doesn't run cleanup and just dies, the monitoring process will performs cleanup on its behalf. Cheers, -Nick On Thu, Apr 18, 2019 at 5:16 PM Robert Samuel Newson <rnew...@apache.org> wrote: > My view is a) the server was unavailable for this request due to all the > other requests it’s currently dealing with b) the connection was not idle, > the client is not at fault. > > B. > > > On 18 Apr 2019, at 22:03, Done Collectively <sans...@inator.biz> wrote: > > > > Any reason 408 would be undesirable? > > > > https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/408 > > > > > > On Thu, Apr 18, 2019 at 10:37 AM Robert Newson <rnew...@apache.org> > wrote: > > > >> 503 imo. > >> > >> -- > >> Robert Samuel Newson > >> rnew...@apache.org > >> > >> On Thu, 18 Apr 2019, at 18:24, Adam Kocoloski wrote: > >>> Yes, we should. Currently it’s a 500, maybe there’s something more > >> appropriate: > >>> > >>> > >> > https://github.com/apache/couchdb/blob/8ef42f7241f8788afc1b6e7255ce78ce5d5ea5c3/src/chttpd/src/chttpd.erl#L947-L949 > >>> > >>> Adam > >>> > >>>> On Apr 18, 2019, at 12:50 PM, Joan Touzet <woh...@apache.org> wrote: > >>>> > >>>> What happens when it turns out the client *hasn't* timed out and we > >>>> just...hang up on them? Should we consider at least trying to send > back > >>>> some sort of HTTP status code? > >>>> > >>>> -Joan > >>>> > >>>> On 2019-04-18 10:58, Garren Smith wrote: > >>>>> I'm +1 on this. With partition queries, we added a few more timeouts > >> that > >>>>> can be enabled which Cloudant enable. So having the ability to shed > >> old > >>>>> requests when these timeouts get hit would be great. > >>>>> > >>>>> Cheers > >>>>> Garren > >>>>> > >>>>> On Tue, Apr 16, 2019 at 2:41 AM Adam Kocoloski <kocol...@apache.org> > >> wrote: > >>>>> > >>>>>> Hi all, > >>>>>> > >>>>>> For once, I’m coming to you with a topic that is not strictly about > >>>>>> FoundationDB :) > >>>>>> > >>>>>> CouchDB offers a few config settings (some of them undocumented) to > >> put a > >>>>>> limit on how long the server is allowed to take to generate a > >> response. The > >>>>>> trouble with many of these timeouts is that, when they fire, they do > >> not > >>>>>> actually clean up all of the work that they initiated. A couple of > >> examples: > >>>>>> > >>>>>> - Each HTTP response coordinated by the “fabric” application spawns > >>>>>> several ephemeral processes via “rexi" on different nodes in the > >> cluster to > >>>>>> retrieve data and send it back to the process coordinating the > >> response. If > >>>>>> the request timeout fires, the coordinating process will be killed > >> off, but > >>>>>> the ephemeral workers might not be. In a healthy cluster they’ll > >> exit on > >>>>>> their own when they finish their jobs, but there are conditions > >> under which > >>>>>> they can sit around for extended periods of time waiting for an > >> overloaded > >>>>>> gen_server (e.g. couch_server) to respond. > >>>>>> > >>>>>> - Those named gen_servers (like couch_server) responsible for > >> serializing > >>>>>> access to important data structures will dutifully process messages > >>>>>> received from old requests without any regard for (of even knowledge > >> of) > >>>>>> the fact that the client that sent the message timed out long ago. > >> This can > >>>>>> lead to a sort of death spiral in which the gen_server is ultimately > >>>>>> spending ~all of its time serving dead clients and every client is > >> timing > >>>>>> out. > >>>>>> > >>>>>> I’d like to see us introduce a documented maximum request duration > >> for all > >>>>>> requests except the _changes feed, and then use that information to > >> aid in > >>>>>> load shedding throughout the stack. We can audit the codebase for > >>>>>> gen_server calls with long timeouts (I know of a few on the critical > >> path > >>>>>> that set their timeouts to `infinity`) and we can design servers > that > >>>>>> efficiently drop old requests, knowing that the client who made the > >> request > >>>>>> must have timed out. A couple of topics for discussion: > >>>>>> > >>>>>> - the “gen_server that sheds old requests” is a very generic > >> pattern, one > >>>>>> that seems like it could be well-suited to its own behaviour. A > >> cursory > >>>>>> search of the internet didn’t turn up any prior art here, which > >> surprises > >>>>>> me a bit. I’m wondering if this is worth bringing up with the > broader > >>>>>> Erlang community. > >>>>>> > >>>>>> - setting and enforcing timeouts is a healthy pattern for read-only > >>>>>> requests as it gives a lot more feedback to clients about the health > >> of the > >>>>>> server. When it comes to updates things are a little bit more muddy, > >> just > >>>>>> because there remains a chance that an update can be committed, but > >> the > >>>>>> caller times out before learning of the successful commit. We should > >> try to > >>>>>> minimize the likelihood of that occurring. > >>>>>> > >>>>>> Cheers, Adam > >>>>>> > >>>>>> P.S. I did say that this wasn’t _strictly_ about FoundationDB, but > of > >>>>>> course FDB has a hard 5 second limit on all transactions, so it is a > >> bit of > >>>>>> a forcing function :).Even putting FoundationDB aside, I would still > >> argue > >>>>>> to pursue this path based on our Ops experience with the current > >> codebase. > >>>>> > >>>> > >>> > >>> > >> > >