I agree, I'm worried about the same i.e. defaulting to a mode that comes
with caveats and isn't really tested and hardened.


On Wed, Oct 1, 2025 at 9:35 AM Jason Gerlowski <[email protected]>
wrote:

> > Let's imagine (quite realistically) that I'm updating our upgrade notes
> and
> > providing advice for users to choose the Overseer or not.  Given the
> > benefits & risks, at what replica count threshold (for the biggest
> > collection) would you advise continued use of the Overseer?
>
> Even assuming we can figure out a value on 'N' that sounds reasonable,
> can be shown to be stable in load testing, etc....is that enough to
> recommend "non-overseer" mode?
>
> Sizing (which includes replica-count) is notoriously a guess-and-check
> process in Solr.  And even for users who have done everything right
> and dialed in their replica-count with some benchmarking - what
> happens when their requirements change and they need to add replicas
> to (e.g.) accommodate a higher QPS.  Is there an easy way for those
> users to switch back to the overseer, or do they have to risk
> instability going forward?
>
> I guess I'm worried about basing recommendations on a factor like
> replica-count which has a tendency to drift over time, if the decision
> itself (i.e. overseer or not) is difficult to reverse after the fact.
> Not 100% sure that's the case here, but that's my suspicion based on a
> hazy recollection of some past discussions.
>
> Best,
>
> Jason
>
> On Wed, Oct 1, 2025 at 10:10 AM Ilan Ginzburg <[email protected]> wrote:
> >
> > It's hard to provide a recommended threshold on collection size for
> > distributed mode.
> > I didn't run tests and it obviously depends on the number of nodes in the
> > cluster and how fast everything (including ZooKeeper) runs, but I'd say
> > that below a couple hundred replicas total for a collection it should be
> ok.
> > When a Solr node starts it marks all its replicas DOWN before marking
> them
> > ACTIVE. If PRS is not used this could take a long time with distributed
> > mode and be slower than Overseer due to the lack of batching of updates.
> >
> > Indexing and query performance is obviously not impacted by distributed
> > mode or Overseer performance, unless shard split performance is
> considered
> > part of indexing performance.
> >
> > Ilan
> >
> > On Tue, Sep 30, 2025 at 10:20 PM David Smiley <[email protected]>
> wrote:
> >
> > > Let's imagine (quite realistically) that I'm updating our upgrade
> notes and
> > > providing advice for users to choose the Overseer or not.  Given the
> > > benefits & risks, at what replica count threshold (for the biggest
> > > collection) would you advise continued use of the Overseer?
> > >
> > > Sufficient stability is the characteristic I'm looking for in the above
> > > question, not performance.  I understand that it'd be ideal if the
> > > cluster's state was structured differently to improve performance of
> > > certain administrative operations, but performance is not a
> requirement for
> > > stability.  The most important performance considerations our users
> have
> > > relate to index & query.  There's a basic assumption that nodes can
> restart
> > > in a "reasonable time"... maybe you'd care to try to define that.  I
> think
> > > your recommendations around restructuring the cluster state would
> > > ultimately impact the performance of restarts and some other
> administrative
> > > scenarios but shouldn't be a prerequisite for a stable system.
> > >
> > > On Tue, Sep 30, 2025 at 4:20 AM Ilan Ginzburg <[email protected]>
> wrote:
> > >
> > > > Distributed mode doesn't behave nicely when there are many concurrent
> > > > updates to a given collection's state.json.
> > > >
> > > > I'd recommend *against* making it the default at this time.
> > > >
> > > > The "root cause" is the presence of replica specific information in
> > > > state.json. In addition to relatively rare cases of changes to the
> > > sharding
> > > > of the collection, state.json is updated when replicas are created or
> > > > destroyed or moved or have their properties changed, and when PRS is
> not
> > > > used, also when replicas change state (which happens a lot when a
> Solr
> > > node
> > > > restarts for example).
> > > >
> > > > Therefore before making distributed mode the default, something has
> to be
> > > > done.
> > > > As Pierre suggests, redesign Collection API operations that require
> > > > multiple updates to be more efficient and group them when executing
> in
> > > > distributed mode. Also make sure that smaller operations that happen
> > > > concurrently are efficient enough.
> > > > Another option is to remove replica information from state.json (keep
> > > > collection metadata and shard definitions there), and create state-
> > > > *<shardname>*.json for each shard with the replicas of that shard.
> > > > Contention on anything replica related will be restricted to
> replicas of
> > > > the same shard.
> > > > There will be more watches on ZooKeeper, they will trigger less
> often and
> > > > less data will be read each time. Also less data to
> compress/uncompress
> > > > each time state.json is written or read (when so configured).
> > > >
> > > > Throttling goes against making SolrCloud as fast as we can.
> > > >
> > > > SolrCloud started with a single clusterstate.json file describing all
> > > > collections (removed in 9.0), then moved to per collection state.json
> > > files
> > > > for scalability reasons.
> > > > Maybe the time has come to split that big blob further?
> > > >
> > > > Ilan
> > > >
> > > > On Tue, Sep 30, 2025 at 12:40 AM Chris Hostetter <
> > > [email protected]
> > > > >
> > > > wrote:
> > > >
> > > > >
> > > > > : I don't think this should prevent shipping a system that is
> > > objectively
> > > > > way
> > > > > : simpler than the Overseer.  Solr 10 will have both modes, no
> matter
> > > > what
> > > > > : the default is.  Changing the default makes it easier to remove
> it in
> > > > > Solr
> > > > > : 11.  The impact on ease of understanding SolrCloud in 11 will be
> > > > amazing!
> > > > >
> > > > > I'm not understanding yoru claim that changing a default from A(x)
> to
> > > > A(y)
> > > > > in 10.0 makes removing A(x) in 11.0 easier?
> > > > >
> > > > > You could change the default in 10.1, 10.2, etc... and it would
> still
> > > be
> > > > > the same amount of effort to remove it in 11.0.
> > > > >
> > > > > No matter when you change the default, if the *option* to use A(x)
> > > still
> > > > > exists in all versions < 11.0, then any "removal" of the code
> > > > implementing
> > > > > A(x) in 11.0 still needs to ensure that all versions >= 11.0 have
> some
> > > > > code/process/documentation enabling users to migrate their cluster
> to
> > > > > A(y)
> > > > >
> > > > >
> > > > > -Hoss
> > > > > http://www.lucidworks.com/
> > > > >
> > > > >
> ---------------------------------------------------------------------
> > > > > To unsubscribe, e-mail: [email protected]
> > > > > For additional commands, e-mail: [email protected]
> > > > >
> > > > >
> > > >
> > >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

-- 
Anshum Gupta

Reply via email to