I agree, I'm worried about the same i.e. defaulting to a mode that comes with caveats and isn't really tested and hardened.
On Wed, Oct 1, 2025 at 9:35 AM Jason Gerlowski <[email protected]> wrote: > > Let's imagine (quite realistically) that I'm updating our upgrade notes > and > > providing advice for users to choose the Overseer or not. Given the > > benefits & risks, at what replica count threshold (for the biggest > > collection) would you advise continued use of the Overseer? > > Even assuming we can figure out a value on 'N' that sounds reasonable, > can be shown to be stable in load testing, etc....is that enough to > recommend "non-overseer" mode? > > Sizing (which includes replica-count) is notoriously a guess-and-check > process in Solr. And even for users who have done everything right > and dialed in their replica-count with some benchmarking - what > happens when their requirements change and they need to add replicas > to (e.g.) accommodate a higher QPS. Is there an easy way for those > users to switch back to the overseer, or do they have to risk > instability going forward? > > I guess I'm worried about basing recommendations on a factor like > replica-count which has a tendency to drift over time, if the decision > itself (i.e. overseer or not) is difficult to reverse after the fact. > Not 100% sure that's the case here, but that's my suspicion based on a > hazy recollection of some past discussions. > > Best, > > Jason > > On Wed, Oct 1, 2025 at 10:10 AM Ilan Ginzburg <[email protected]> wrote: > > > > It's hard to provide a recommended threshold on collection size for > > distributed mode. > > I didn't run tests and it obviously depends on the number of nodes in the > > cluster and how fast everything (including ZooKeeper) runs, but I'd say > > that below a couple hundred replicas total for a collection it should be > ok. > > When a Solr node starts it marks all its replicas DOWN before marking > them > > ACTIVE. If PRS is not used this could take a long time with distributed > > mode and be slower than Overseer due to the lack of batching of updates. > > > > Indexing and query performance is obviously not impacted by distributed > > mode or Overseer performance, unless shard split performance is > considered > > part of indexing performance. > > > > Ilan > > > > On Tue, Sep 30, 2025 at 10:20 PM David Smiley <[email protected]> > wrote: > > > > > Let's imagine (quite realistically) that I'm updating our upgrade > notes and > > > providing advice for users to choose the Overseer or not. Given the > > > benefits & risks, at what replica count threshold (for the biggest > > > collection) would you advise continued use of the Overseer? > > > > > > Sufficient stability is the characteristic I'm looking for in the above > > > question, not performance. I understand that it'd be ideal if the > > > cluster's state was structured differently to improve performance of > > > certain administrative operations, but performance is not a > requirement for > > > stability. The most important performance considerations our users > have > > > relate to index & query. There's a basic assumption that nodes can > restart > > > in a "reasonable time"... maybe you'd care to try to define that. I > think > > > your recommendations around restructuring the cluster state would > > > ultimately impact the performance of restarts and some other > administrative > > > scenarios but shouldn't be a prerequisite for a stable system. > > > > > > On Tue, Sep 30, 2025 at 4:20 AM Ilan Ginzburg <[email protected]> > wrote: > > > > > > > Distributed mode doesn't behave nicely when there are many concurrent > > > > updates to a given collection's state.json. > > > > > > > > I'd recommend *against* making it the default at this time. > > > > > > > > The "root cause" is the presence of replica specific information in > > > > state.json. In addition to relatively rare cases of changes to the > > > sharding > > > > of the collection, state.json is updated when replicas are created or > > > > destroyed or moved or have their properties changed, and when PRS is > not > > > > used, also when replicas change state (which happens a lot when a > Solr > > > node > > > > restarts for example). > > > > > > > > Therefore before making distributed mode the default, something has > to be > > > > done. > > > > As Pierre suggests, redesign Collection API operations that require > > > > multiple updates to be more efficient and group them when executing > in > > > > distributed mode. Also make sure that smaller operations that happen > > > > concurrently are efficient enough. > > > > Another option is to remove replica information from state.json (keep > > > > collection metadata and shard definitions there), and create state- > > > > *<shardname>*.json for each shard with the replicas of that shard. > > > > Contention on anything replica related will be restricted to > replicas of > > > > the same shard. > > > > There will be more watches on ZooKeeper, they will trigger less > often and > > > > less data will be read each time. Also less data to > compress/uncompress > > > > each time state.json is written or read (when so configured). > > > > > > > > Throttling goes against making SolrCloud as fast as we can. > > > > > > > > SolrCloud started with a single clusterstate.json file describing all > > > > collections (removed in 9.0), then moved to per collection state.json > > > files > > > > for scalability reasons. > > > > Maybe the time has come to split that big blob further? > > > > > > > > Ilan > > > > > > > > On Tue, Sep 30, 2025 at 12:40 AM Chris Hostetter < > > > [email protected] > > > > > > > > > wrote: > > > > > > > > > > > > > > : I don't think this should prevent shipping a system that is > > > objectively > > > > > way > > > > > : simpler than the Overseer. Solr 10 will have both modes, no > matter > > > > what > > > > > : the default is. Changing the default makes it easier to remove > it in > > > > > Solr > > > > > : 11. The impact on ease of understanding SolrCloud in 11 will be > > > > amazing! > > > > > > > > > > I'm not understanding yoru claim that changing a default from A(x) > to > > > > A(y) > > > > > in 10.0 makes removing A(x) in 11.0 easier? > > > > > > > > > > You could change the default in 10.1, 10.2, etc... and it would > still > > > be > > > > > the same amount of effort to remove it in 11.0. > > > > > > > > > > No matter when you change the default, if the *option* to use A(x) > > > still > > > > > exists in all versions < 11.0, then any "removal" of the code > > > > implementing > > > > > A(x) in 11.0 still needs to ensure that all versions >= 11.0 have > some > > > > > code/process/documentation enabling users to migrate their cluster > to > > > > > A(y) > > > > > > > > > > > > > > > -Hoss > > > > > http://www.lucidworks.com/ > > > > > > > > > > > --------------------------------------------------------------------- > > > > > To unsubscribe, e-mail: [email protected] > > > > > For additional commands, e-mail: [email protected] > > > > > > > > > > > > > > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > > -- Anshum Gupta
