Let's imagine (quite realistically) that I'm updating our upgrade notes and providing advice for users to choose the Overseer or not. Given the benefits & risks, at what replica count threshold (for the biggest collection) would you advise continued use of the Overseer?
Sufficient stability is the characteristic I'm looking for in the above question, not performance. I understand that it'd be ideal if the cluster's state was structured differently to improve performance of certain administrative operations, but performance is not a requirement for stability. The most important performance considerations our users have relate to index & query. There's a basic assumption that nodes can restart in a "reasonable time"... maybe you'd care to try to define that. I think your recommendations around restructuring the cluster state would ultimately impact the performance of restarts and some other administrative scenarios but shouldn't be a prerequisite for a stable system. On Tue, Sep 30, 2025 at 4:20 AM Ilan Ginzburg <[email protected]> wrote: > Distributed mode doesn't behave nicely when there are many concurrent > updates to a given collection's state.json. > > I'd recommend *against* making it the default at this time. > > The "root cause" is the presence of replica specific information in > state.json. In addition to relatively rare cases of changes to the sharding > of the collection, state.json is updated when replicas are created or > destroyed or moved or have their properties changed, and when PRS is not > used, also when replicas change state (which happens a lot when a Solr node > restarts for example). > > Therefore before making distributed mode the default, something has to be > done. > As Pierre suggests, redesign Collection API operations that require > multiple updates to be more efficient and group them when executing in > distributed mode. Also make sure that smaller operations that happen > concurrently are efficient enough. > Another option is to remove replica information from state.json (keep > collection metadata and shard definitions there), and create state- > *<shardname>*.json for each shard with the replicas of that shard. > Contention on anything replica related will be restricted to replicas of > the same shard. > There will be more watches on ZooKeeper, they will trigger less often and > less data will be read each time. Also less data to compress/uncompress > each time state.json is written or read (when so configured). > > Throttling goes against making SolrCloud as fast as we can. > > SolrCloud started with a single clusterstate.json file describing all > collections (removed in 9.0), then moved to per collection state.json files > for scalability reasons. > Maybe the time has come to split that big blob further? > > Ilan > > On Tue, Sep 30, 2025 at 12:40 AM Chris Hostetter <[email protected] > > > wrote: > > > > > : I don't think this should prevent shipping a system that is objectively > > way > > : simpler than the Overseer. Solr 10 will have both modes, no matter > what > > : the default is. Changing the default makes it easier to remove it in > > Solr > > : 11. The impact on ease of understanding SolrCloud in 11 will be > amazing! > > > > I'm not understanding yoru claim that changing a default from A(x) to > A(y) > > in 10.0 makes removing A(x) in 11.0 easier? > > > > You could change the default in 10.1, 10.2, etc... and it would still be > > the same amount of effort to remove it in 11.0. > > > > No matter when you change the default, if the *option* to use A(x) still > > exists in all versions < 11.0, then any "removal" of the code > implementing > > A(x) in 11.0 still needs to ensure that all versions >= 11.0 have some > > code/process/documentation enabling users to migrate their cluster to > > A(y) > > > > > > -Hoss > > http://www.lucidworks.com/ > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [email protected] > > For additional commands, e-mail: [email protected] > > > > >
