> Let's imagine (quite realistically) that I'm updating our upgrade notes and
> providing advice for users to choose the Overseer or not.  Given the
> benefits & risks, at what replica count threshold (for the biggest
> collection) would you advise continued use of the Overseer?

Even assuming we can figure out a value on 'N' that sounds reasonable,
can be shown to be stable in load testing, etc....is that enough to
recommend "non-overseer" mode?

Sizing (which includes replica-count) is notoriously a guess-and-check
process in Solr.  And even for users who have done everything right
and dialed in their replica-count with some benchmarking - what
happens when their requirements change and they need to add replicas
to (e.g.) accommodate a higher QPS.  Is there an easy way for those
users to switch back to the overseer, or do they have to risk
instability going forward?

I guess I'm worried about basing recommendations on a factor like
replica-count which has a tendency to drift over time, if the decision
itself (i.e. overseer or not) is difficult to reverse after the fact.
Not 100% sure that's the case here, but that's my suspicion based on a
hazy recollection of some past discussions.

Best,

Jason

On Wed, Oct 1, 2025 at 10:10 AM Ilan Ginzburg <[email protected]> wrote:
>
> It's hard to provide a recommended threshold on collection size for
> distributed mode.
> I didn't run tests and it obviously depends on the number of nodes in the
> cluster and how fast everything (including ZooKeeper) runs, but I'd say
> that below a couple hundred replicas total for a collection it should be ok.
> When a Solr node starts it marks all its replicas DOWN before marking them
> ACTIVE. If PRS is not used this could take a long time with distributed
> mode and be slower than Overseer due to the lack of batching of updates.
>
> Indexing and query performance is obviously not impacted by distributed
> mode or Overseer performance, unless shard split performance is considered
> part of indexing performance.
>
> Ilan
>
> On Tue, Sep 30, 2025 at 10:20 PM David Smiley <[email protected]> wrote:
>
> > Let's imagine (quite realistically) that I'm updating our upgrade notes and
> > providing advice for users to choose the Overseer or not.  Given the
> > benefits & risks, at what replica count threshold (for the biggest
> > collection) would you advise continued use of the Overseer?
> >
> > Sufficient stability is the characteristic I'm looking for in the above
> > question, not performance.  I understand that it'd be ideal if the
> > cluster's state was structured differently to improve performance of
> > certain administrative operations, but performance is not a requirement for
> > stability.  The most important performance considerations our users have
> > relate to index & query.  There's a basic assumption that nodes can restart
> > in a "reasonable time"... maybe you'd care to try to define that.  I think
> > your recommendations around restructuring the cluster state would
> > ultimately impact the performance of restarts and some other administrative
> > scenarios but shouldn't be a prerequisite for a stable system.
> >
> > On Tue, Sep 30, 2025 at 4:20 AM Ilan Ginzburg <[email protected]> wrote:
> >
> > > Distributed mode doesn't behave nicely when there are many concurrent
> > > updates to a given collection's state.json.
> > >
> > > I'd recommend *against* making it the default at this time.
> > >
> > > The "root cause" is the presence of replica specific information in
> > > state.json. In addition to relatively rare cases of changes to the
> > sharding
> > > of the collection, state.json is updated when replicas are created or
> > > destroyed or moved or have their properties changed, and when PRS is not
> > > used, also when replicas change state (which happens a lot when a Solr
> > node
> > > restarts for example).
> > >
> > > Therefore before making distributed mode the default, something has to be
> > > done.
> > > As Pierre suggests, redesign Collection API operations that require
> > > multiple updates to be more efficient and group them when executing in
> > > distributed mode. Also make sure that smaller operations that happen
> > > concurrently are efficient enough.
> > > Another option is to remove replica information from state.json (keep
> > > collection metadata and shard definitions there), and create state-
> > > *<shardname>*.json for each shard with the replicas of that shard.
> > > Contention on anything replica related will be restricted to replicas of
> > > the same shard.
> > > There will be more watches on ZooKeeper, they will trigger less often and
> > > less data will be read each time. Also less data to compress/uncompress
> > > each time state.json is written or read (when so configured).
> > >
> > > Throttling goes against making SolrCloud as fast as we can.
> > >
> > > SolrCloud started with a single clusterstate.json file describing all
> > > collections (removed in 9.0), then moved to per collection state.json
> > files
> > > for scalability reasons.
> > > Maybe the time has come to split that big blob further?
> > >
> > > Ilan
> > >
> > > On Tue, Sep 30, 2025 at 12:40 AM Chris Hostetter <
> > [email protected]
> > > >
> > > wrote:
> > >
> > > >
> > > > : I don't think this should prevent shipping a system that is
> > objectively
> > > > way
> > > > : simpler than the Overseer.  Solr 10 will have both modes, no matter
> > > what
> > > > : the default is.  Changing the default makes it easier to remove it in
> > > > Solr
> > > > : 11.  The impact on ease of understanding SolrCloud in 11 will be
> > > amazing!
> > > >
> > > > I'm not understanding yoru claim that changing a default from A(x) to
> > > A(y)
> > > > in 10.0 makes removing A(x) in 11.0 easier?
> > > >
> > > > You could change the default in 10.1, 10.2, etc... and it would still
> > be
> > > > the same amount of effort to remove it in 11.0.
> > > >
> > > > No matter when you change the default, if the *option* to use A(x)
> > still
> > > > exists in all versions < 11.0, then any "removal" of the code
> > > implementing
> > > > A(x) in 11.0 still needs to ensure that all versions >= 11.0 have some
> > > > code/process/documentation enabling users to migrate their cluster to
> > > > A(y)
> > > >
> > > >
> > > > -Hoss
> > > > http://www.lucidworks.com/
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: [email protected]
> > > > For additional commands, e-mail: [email protected]
> > > >
> > > >
> > >
> >

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to