Let's imagine (quite realistically) that I'm updating our upgrade notes and
providing advice for users to choose the Overseer or not.  Given the
benefits & risks, at what replica count threshold (for the biggest
collection) would you advise continued use of the Overseer?

Sufficient stability is the characteristic I'm looking for in the above
question, not performance.  I understand that it'd be ideal if the
cluster's state was structured differently to improve performance of
certain administrative operations, but performance is not a requirement for
stability.  The most important performance considerations our users have
relate to index & query.  There's a basic assumption that nodes can restart
in a "reasonable time"... maybe you'd care to try to define that.  I think
your recommendations around restructuring the cluster state would
ultimately impact the performance of restarts and some other administrative
scenarios but shouldn't be a prerequisite for a stable system.

On Tue, Sep 30, 2025 at 4:20 AM Ilan Ginzburg <[email protected]> wrote:

> Distributed mode doesn't behave nicely when there are many concurrent
> updates to a given collection's state.json.
>
> I'd recommend *against* making it the default at this time.
>
> The "root cause" is the presence of replica specific information in
> state.json. In addition to relatively rare cases of changes to the sharding
> of the collection, state.json is updated when replicas are created or
> destroyed or moved or have their properties changed, and when PRS is not
> used, also when replicas change state (which happens a lot when a Solr node
> restarts for example).
>
> Therefore before making distributed mode the default, something has to be
> done.
> As Pierre suggests, redesign Collection API operations that require
> multiple updates to be more efficient and group them when executing in
> distributed mode. Also make sure that smaller operations that happen
> concurrently are efficient enough.
> Another option is to remove replica information from state.json (keep
> collection metadata and shard definitions there), and create state-
> *<shardname>*.json for each shard with the replicas of that shard.
> Contention on anything replica related will be restricted to replicas of
> the same shard.
> There will be more watches on ZooKeeper, they will trigger less often and
> less data will be read each time. Also less data to compress/uncompress
> each time state.json is written or read (when so configured).
>
> Throttling goes against making SolrCloud as fast as we can.
>
> SolrCloud started with a single clusterstate.json file describing all
> collections (removed in 9.0), then moved to per collection state.json files
> for scalability reasons.
> Maybe the time has come to split that big blob further?
>
> Ilan
>
> On Tue, Sep 30, 2025 at 12:40 AM Chris Hostetter <[email protected]
> >
> wrote:
>
> >
> > : I don't think this should prevent shipping a system that is objectively
> > way
> > : simpler than the Overseer.  Solr 10 will have both modes, no matter
> what
> > : the default is.  Changing the default makes it easier to remove it in
> > Solr
> > : 11.  The impact on ease of understanding SolrCloud in 11 will be
> amazing!
> >
> > I'm not understanding yoru claim that changing a default from A(x) to
> A(y)
> > in 10.0 makes removing A(x) in 11.0 easier?
> >
> > You could change the default in 10.1, 10.2, etc... and it would still be
> > the same amount of effort to remove it in 11.0.
> >
> > No matter when you change the default, if the *option* to use A(x) still
> > exists in all versions < 11.0, then any "removal" of the code
> implementing
> > A(x) in 11.0 still needs to ensure that all versions >= 11.0 have some
> > code/process/documentation enabling users to migrate their cluster to
> > A(y)
> >
> >
> > -Hoss
> > http://www.lucidworks.com/
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
> >
> >
>

Reply via email to