> Let's imagine (quite realistically) that I'm updating our upgrade notes and > providing advice for users to choose the Overseer or not. Given the > benefits & risks, at what replica count threshold (for the biggest > collection) would you advise continued use of the Overseer?
Even assuming we can figure out a value on 'N' that sounds reasonable, can be shown to be stable in load testing, etc....is that enough to recommend "non-overseer" mode? Sizing (which includes replica-count) is notoriously a guess-and-check process in Solr. And even for users who have done everything right and dialed in their replica-count with some benchmarking - what happens when their requirements change and they need to add replicas to (e.g.) accommodate a higher QPS. Is there an easy way for those users to switch back to the overseer, or do they have to risk instability going forward? I guess I'm worried about basing recommendations on a factor like replica-count which has a tendency to drift over time, if the decision itself (i.e. overseer or not) is difficult to reverse after the fact. Not 100% sure that's the case here, but that's my suspicion based on a hazy recollection of some past discussions. Best, Jason On Wed, Oct 1, 2025 at 10:10 AM Ilan Ginzburg <[email protected]> wrote: > > It's hard to provide a recommended threshold on collection size for > distributed mode. > I didn't run tests and it obviously depends on the number of nodes in the > cluster and how fast everything (including ZooKeeper) runs, but I'd say > that below a couple hundred replicas total for a collection it should be ok. > When a Solr node starts it marks all its replicas DOWN before marking them > ACTIVE. If PRS is not used this could take a long time with distributed > mode and be slower than Overseer due to the lack of batching of updates. > > Indexing and query performance is obviously not impacted by distributed > mode or Overseer performance, unless shard split performance is considered > part of indexing performance. > > Ilan > > On Tue, Sep 30, 2025 at 10:20 PM David Smiley <[email protected]> wrote: > > > Let's imagine (quite realistically) that I'm updating our upgrade notes and > > providing advice for users to choose the Overseer or not. Given the > > benefits & risks, at what replica count threshold (for the biggest > > collection) would you advise continued use of the Overseer? > > > > Sufficient stability is the characteristic I'm looking for in the above > > question, not performance. I understand that it'd be ideal if the > > cluster's state was structured differently to improve performance of > > certain administrative operations, but performance is not a requirement for > > stability. The most important performance considerations our users have > > relate to index & query. There's a basic assumption that nodes can restart > > in a "reasonable time"... maybe you'd care to try to define that. I think > > your recommendations around restructuring the cluster state would > > ultimately impact the performance of restarts and some other administrative > > scenarios but shouldn't be a prerequisite for a stable system. > > > > On Tue, Sep 30, 2025 at 4:20 AM Ilan Ginzburg <[email protected]> wrote: > > > > > Distributed mode doesn't behave nicely when there are many concurrent > > > updates to a given collection's state.json. > > > > > > I'd recommend *against* making it the default at this time. > > > > > > The "root cause" is the presence of replica specific information in > > > state.json. In addition to relatively rare cases of changes to the > > sharding > > > of the collection, state.json is updated when replicas are created or > > > destroyed or moved or have their properties changed, and when PRS is not > > > used, also when replicas change state (which happens a lot when a Solr > > node > > > restarts for example). > > > > > > Therefore before making distributed mode the default, something has to be > > > done. > > > As Pierre suggests, redesign Collection API operations that require > > > multiple updates to be more efficient and group them when executing in > > > distributed mode. Also make sure that smaller operations that happen > > > concurrently are efficient enough. > > > Another option is to remove replica information from state.json (keep > > > collection metadata and shard definitions there), and create state- > > > *<shardname>*.json for each shard with the replicas of that shard. > > > Contention on anything replica related will be restricted to replicas of > > > the same shard. > > > There will be more watches on ZooKeeper, they will trigger less often and > > > less data will be read each time. Also less data to compress/uncompress > > > each time state.json is written or read (when so configured). > > > > > > Throttling goes against making SolrCloud as fast as we can. > > > > > > SolrCloud started with a single clusterstate.json file describing all > > > collections (removed in 9.0), then moved to per collection state.json > > files > > > for scalability reasons. > > > Maybe the time has come to split that big blob further? > > > > > > Ilan > > > > > > On Tue, Sep 30, 2025 at 12:40 AM Chris Hostetter < > > [email protected] > > > > > > > wrote: > > > > > > > > > > > : I don't think this should prevent shipping a system that is > > objectively > > > > way > > > > : simpler than the Overseer. Solr 10 will have both modes, no matter > > > what > > > > : the default is. Changing the default makes it easier to remove it in > > > > Solr > > > > : 11. The impact on ease of understanding SolrCloud in 11 will be > > > amazing! > > > > > > > > I'm not understanding yoru claim that changing a default from A(x) to > > > A(y) > > > > in 10.0 makes removing A(x) in 11.0 easier? > > > > > > > > You could change the default in 10.1, 10.2, etc... and it would still > > be > > > > the same amount of effort to remove it in 11.0. > > > > > > > > No matter when you change the default, if the *option* to use A(x) > > still > > > > exists in all versions < 11.0, then any "removal" of the code > > > implementing > > > > A(x) in 11.0 still needs to ensure that all versions >= 11.0 have some > > > > code/process/documentation enabling users to migrate their cluster to > > > > A(y) > > > > > > > > > > > > -Hoss > > > > http://www.lucidworks.com/ > > > > > > > > --------------------------------------------------------------------- > > > > To unsubscribe, e-mail: [email protected] > > > > For additional commands, e-mail: [email protected] > > > > > > > > > > > > > --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
