Re: Moving away from Zookeeper in SolrJ

David Smiley Thu, 19 Dec 2024 20:06:05 -0800

It's on our roadmap:
https://cwiki.apache.org/confluence/display/SOLR/Roadmap
and I created an "umbrella" JIRA issue with most issues I could think of,
some minor or tangential/complementary:
https://issues.apache.org/jira/browse/SOLR-17605


I could create a SIP but it feels redundant.  Maybe a SIP that mostly links
to the JIRA issue and states the summary?  I think of them as about
creating visibility for an important direction for the project.  But that
can be accomplished right here in the dev list paired with JIRA and
optionally a doc if desired.  I could still see SIPs then being useful to
catalog major decisions / directions.  Albeit again that's somewhat
redundant with the Roadmap.

On Sat, Oct 19, 2024 at 1:47 AM David Smiley <[email protected]> wrote:

> I strongly believe that we need to get ZooKeeper out of our clients (that
> use CloudSolrClient), and use Solr URLs (HTTP) for the cluster state
> instead.  I'm arguing to make this strategic direction clear, and we're
> already going in the right direction.  Realistically, I don't think
> solrj-zookeeper should be eliminated as it exists for Solr 10 but I could
> see doing so eventually (no rush!).  Starting with Solr 9.8, I'd like users
> to start using the Solr HTTP alternative option, encouraged by the release
> notes.  In Solr 10 we can remove any documentation in the ref guide on
> CloudSolrClient working with ZooKeeper.  Javadocs in
> CloudSolrClient.Builder can recommend Solr URLs instead of the ZooKeeper
> option.  I don't have a strong opinion on exactly when to deprecate it.
> Today is too soon.
>
> Why:
>
>    - Principled — ZooKeeper is conceptually behind Solr; clients
>    shouldn’t talk to it.
>    - Fewer dependencies for clients (no ZooKeeper or Netty).
>    - Better security — only Solr should talk to ZooKeeper!  Security
>    settings and key configuration files are stored in ZooKeeper.
>    - Eliminate impact of ZK storage on clients.  The change of where the
>    configSet name was stored in ZK is an example.  PRS is another.  And
>    other changes I’ve seen in a fork.
>    - Reduce complexity of SolrJ from an operational standpoint and bug
>    risks (e.g. no ZkStateReader there).  No Zookeeper related
>    configuration (jute.maxbuffer, etc.)
>    - Reduce complexity of SolrCloud by limiting the range of use of key
>    classes like ZkStateReader to only be in Solr instead of also existing in
>    SolrJ.  For example it’s not clear if/when LazyCollectionRef’s are
>    used in SolrJ but with this separation, it’d be clearer that it couldn’t
>    exist in SolrJ.
>    - Increase our options for classes in solrj-zookeeper, like adding
>    more dependencies (traces & metrics) without concern of burdening any
>    user/client.
>    - Reliably working with a collection after collection creation.  If
>    you’ve seen waitForActiveCollection after creating a collection in our
>    tests, this is what I mean (and it’s not strictly a test issue).  It's
>    sad; make them go away!
>
> Progress has been made on the alternative:  Ishan & Noble got the ball
> rolling years ago to introduce the HTTP alternative option.  I call it
> HttpCSP internally based on an abbreviation of its class name.  But I don't
> think anyone actually uses it based on how poorly it performed, as reported
> in JIRA.  In Solr 9.1, SolrJ was modularized, creating the
> "solrj-zookeeper" module (opt-out), and made opt-in for Solr 10.  Finally,
> key performance improvements landed in Solr 9.8 for the HTTP option making
> it viable for most users (IMO).  Credit to my colleagues Haythem & Aparna
> on some of these.
>
>
> That said, HttpCSP (and CloudSolrClient actually) hasn't reached its ideal
> state yet.  Some improvement possibilities / problems:
>
>    - The cached DocCollection (i.e. a collection's state) expires out of
>    a cache with a hard-coded TTL, even if it’s actively being used.  I
>    don’t think it should.  It’d lead to poor p99 client experienced
>    request metrics for those that have to additionally fetch the DocCollection
>    — avoidably.
>    - There’s a DocCollection version staleness mechanism but IMO it’s not
>    robust.
>    - If all live nodes disappear temporarily (hard cluster restart), I
>    could imagine the client failing permanently.  (credit to Ilan)
>    - CloudSolrClient.getClusterState (and its equivalent method on the
>    provider) goes from a trivial getter to a slow remote call fetching the
>    entire cluster’s state; no cache.  We have code using it in various places;
>    surely users too.  This class has issues (out of scope of this post),
>    so I want to deprecate this so that the client never touches ClusterState.
>    Getting live-nodes, DocCollection, and cluster properties are still
>    accessible though.
>
> The last one, basically banning ClusterState in SolrJ, is the biggest
> performance trap / issue that needs to be prioritized; I plan to create a
> JIRA or two.
>
> I suppose I could make a SIP out of this... albeit maybe the time for that
> was years ago when HttpCSP came into existence.  I'm just trying to see
> this through to a conclusion.
>
> ~ David Smiley
> Apache Lucene/Solr Search Developer
> http://www.linkedin.com/in/davidwsmiley
>

Re: Moving away from Zookeeper in SolrJ

Reply via email to