Indeed; "async" doesn't support cancellation. Deleting the request ID doesn't cancel. BTW I haven't been meaning to talk about cancellation; I've been talking about not starting something if it's already too late. And I haven't been talking about anything that would impact "async" calls; I'm referring to an approach only affecting the synchronous ones. Maybe I haven't been clear. https://solr.apache.org/guide/solr/latest/configuration-guide/collections-api.html#asynchronous-calls
On Fri, Feb 2, 2024 at 4:12 PM Ilan Ginzburg <ilans...@gmail.com> wrote: > > A cluster wide timeout makes sense and is simpler if it is only used by the > Overseer (or whatever entity processes a request) to decide not to start > processing (that delay would not be request specific but depends on the > load put by other concurrent activity in the cluster). > If we consider a timeout for interrupting in progress processing (which > carries its lot of challenges), it should be overridable per request. > Creating a big collection (multiple shards and/or multiple replicas) takes > time, and a cluster wide timeout would have to be large enough to > accommodate this, then likely too long for simpler requests. > > To my knowledge (unless things changed recently and I've missed it) there's > no way to cancel an async (or sync, for that matter) Collection API request. > > Ilan > > On Fri, Feb 2, 2024 at 7:40 AM David Smiley <dsmi...@apache.org> wrote: > > > On Thu, Feb 1, 2024 at 1:53 PM Ilan Ginzburg <ilans...@gmail.com> wrote: > > > > > > I'd be in favor of the Overseer dropping synchronous requests for which > > the > > > requestor is no longer waiting (ephemeral ZK node is gone). > > > > I agree! As you know, we've customized Solr to do exactly that for > > collection creation. We suspect a misaligned timeout kept the > > requestor/client present from Solr's perspective even though they > > actually gave up in some other thread on their end and this wasn't > > cancelled/stopped. Still; the proposal here seems more resilient than > > fixing that; it's debatable. And if implemented; it may obsolete the > > ephemeral node check in practice even though they are complementary. > > > > > For sync or async requests, we could let the caller set a timeout after > > > which the processing should not start if it hasn't already, > > > > I thought it'd be nice to avoid a new per-request message and instead > > use a node-wide setting. A per-request setting would show up in > > basically every API definition in Solr's new OpenAPI that we machine > > generate -- solr/api/build/generated/openapi/*.json -- as does "async" > > already. I don't think it's worth all the API littering considering > > the Overseer mode (vs distributed processing) is what this concern > > applies to, it's maybe a niche issue, and I hope to see the new > > distributed mode used more and more. > > > > > or for async > > > messages allow a cancellation call (that would cancel if processing has > > not > > > started). > > > > This part works that way already I think. > > > > > Once processing has started, I suggest we let it finish (cancelling > > > processing in progress would be more complicated). > > > > Yeah I'm not proposing anything to the contrary. > > > > > Ilan > > > > > > On Thu, Feb 1, 2024 at 6:46 AM 6harat <bharat.gulati.ce...@gmail.com> > > wrote: > > > > > > > Thanks David for starting this thread. We have also seen this behavior > > from > > > > overseer resulting in "orphan collections" or "more than 1 replica > > created" > > > > due to timeouts especially when our cluster is scaled up during peak > > > > traffic days. > > > > We call them "orphan collections" too :-) > > > > > > While I am still at a nascent stage of my understanding of solr > > internals, > > > > I wanted to highlight the below points: (pardon me if these doesn't > > make > > > > much sense), > > > > > > > > 1. There may be situations where we want solr to still honor the late > > > > message and hence the functionality needs to be configurable and not a > > > > default. For instance, during decommissioning of boxes (when we are > > scaling > > > > down to our normal cluster size from peak), we send delete replica > > commands > > > > for 20+ boxes in a short time frame. Majority of these API hits > > inevitably > > > > times out, however we rely upon the behaviour that the cluster after X > > mins > > > > is able to reach to the desired state. > > > > I'd argue that if you get a timeout telling any system something... > > all bets are off on what happened and didn't happen. If you change > > this to use the Solr's async command style, it would be more reliable > > and wouldn't relate to my proposal. Do note it kind of litters an ID > > in ZK; it's ideal if your client can tend to deleting the ID but they > > will be deleted eventually by SolrCloud. > > > > > > > > > > 2. How do we intend to communicate the timeout based rejection of > > overseer > > > > message to the end-user > > > > I can only answer for "end user" if you mean the client talking to > > Solr. It would simply get an error response indicating that a timeout > > occurred. > > > > > > 3. In case of fail-over scenario where the overseer leader node goes > > down > > > > and is re-elected, the election may have some overhead which may > > inevitably > > > > result in many of the piled up messages being rejected due to time > > > > constraints. Do we intend to pause the clock ticks during this phase > > or the > > > > guidance should be to set timeout higher than sum of such possible > > > > overheads > > > > Definitely not pausing the clock. It may be worth repeating what we > > all know -- in a distributed system, failures (to include timeouts) > > are going to happen and clients need to be resilient to them (e.g. try > > again). > > > > ~ David > > > > > > > > > > On Wed, Jan 31, 2024 at 11:18 PM David Smiley <dsmi...@apache.org> > > wrote: > > > > > > > > > I have a proposal and am curious what folks think. When the Overseer > > > > > dequeues an admin command message to process, imagine it being > > > > > enhanced to examine the "ctime" (creation time) of the ZK message > > node > > > > > to determine how long it has been enqueued, and thus roughly how long > > > > > the client has been waiting. If it's greater than a configured > > > > > threshold (1 minute?), respond with an error of a timeout nature. > > > > > "Sorry, the Overseer is so backed up that we fear you have given up; > > > > > please try again". This would not apply to an "async" style > > > > > submission. > > > > > > > > > > Motivation: Due to miscellaneous reasons at scale that are very user > > > > > / situation dependent, the Overseer can get seriously backed up. The > > > > > client, making a typical synchronous call to, say, create a > > > > > collection, may reach its timeout (say a minute) and has given up. > > > > > Today, SolrCloud doesn't know this; it goes on its merry way and > > > > > creates a collection anyway. Depending on how Solr is used, this can > > > > > be an orphaned collection that the client doesn't want anymore. That > > > > > is to say, the client wants a collection but it wanted it at the time > > > > > it asked for it with the name it asked for at that time. If it > > fails, > > > > > it will come back later and propose a new name. This doesn't have to > > > > > be collection creation specific; I'm thinking that in principle it > > > > > doesn't really matter what the command is. If Solr takes too long > > for > > > > > the Overseer to receive the message; just timeout, basically. > > > > > > > > > > Thoughts? > > > > > > > > > > This wouldn't be a concern for the distributed mode of collection > > > > > processing as there is no queue bottleneck; the receiving node > > > > > processes the request immediately. > > > > > > > > > > ~ David Smiley > > > > > Apache Lucene/Solr Search Developer > > > > > http://www.linkedin.com/in/davidwsmiley > > > > > > > > > > --------------------------------------------------------------------- > > > > > To unsubscribe, e-mail: dev-unsubscr...@solr.apache.org > > > > > For additional commands, e-mail: dev-h...@solr.apache.org > > > > > > > > > > > > > > > > > > -- > > > > Regards > > > > 6harat > > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: dev-unsubscr...@solr.apache.org > > For additional commands, e-mail: dev-h...@solr.apache.org > > > > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@solr.apache.org For additional commands, e-mail: dev-h...@solr.apache.org