Moving away from Zookeeper in SolrJ

David Smiley Fri, 18 Oct 2024 22:49:04 -0700

I strongly believe that we need to get ZooKeeper out of our clients (that
use CloudSolrClient), and use Solr URLs (HTTP) for the cluster state
instead.  I'm arguing to make this strategic direction clear, and we're
already going in the right direction.  Realistically, I don't think
solrj-zookeeper should be eliminated as it exists for Solr 10 but I could
see doing so eventually (no rush!).  Starting with Solr 9.8, I'd like users
to start using the Solr HTTP alternative option, encouraged by the release
notes.  In Solr 10 we can remove any documentation in the ref guide on
CloudSolrClient working with ZooKeeper.  Javadocs in
CloudSolrClient.Builder can recommend Solr URLs instead of the ZooKeeper
option.  I don't have a strong opinion on exactly when to deprecate it.
Today is too soon.


Why:

   - Principled — ZooKeeper is conceptually behind Solr; clients shouldn’t
   talk to it.
   - Fewer dependencies for clients (no ZooKeeper or Netty).
   - Better security — only Solr should talk to ZooKeeper!  Security
   settings and key configuration files are stored in ZooKeeper.
   - Eliminate impact of ZK storage on clients.  The change of where the
   configSet name was stored in ZK is an example.  PRS is another.  And
   other changes I’ve seen in a fork.
   - Reduce complexity of SolrJ from an operational standpoint and bug
   risks (e.g. no ZkStateReader there).  No Zookeeper related configuration
   (jute.maxbuffer, etc.)
   - Reduce complexity of SolrCloud by limiting the range of use of key
   classes like ZkStateReader to only be in Solr instead of also existing in
   SolrJ.  For example it’s not clear if/when LazyCollectionRef’s are used
   in SolrJ but with this separation, it’d be clearer that it couldn’t exist
   in SolrJ.
   - Increase our options for classes in solrj-zookeeper, like adding more
   dependencies (traces & metrics) without concern of burdening any
   user/client.
   - Reliably working with a collection after collection creation.  If
   you’ve seen waitForActiveCollection after creating a collection in our
   tests, this is what I mean (and it’s not strictly a test issue).  It's
   sad; make them go away!

Progress has been made on the alternative:  Ishan & Noble got the ball
rolling years ago to introduce the HTTP alternative option.  I call it
HttpCSP internally based on an abbreviation of its class name.  But I don't
think anyone actually uses it based on how poorly it performed, as reported
in JIRA.  In Solr 9.1, SolrJ was modularized, creating the
"solrj-zookeeper" module (opt-out), and made opt-in for Solr 10.  Finally,
key performance improvements landed in Solr 9.8 for the HTTP option making
it viable for most users (IMO).  Credit to my colleagues Haythem & Aparna
on some of these.


That said, HttpCSP (and CloudSolrClient actually) hasn't reached its ideal
state yet.  Some improvement possibilities / problems:

   - The cached DocCollection (i.e. a collection's state) expires out of a
   cache with a hard-coded TTL, even if it’s actively being used.  I don’t
   think it should.  It’d lead to poor p99 client experienced request
   metrics for those that have to additionally fetch the DocCollection —
   avoidably.
   - There’s a DocCollection version staleness mechanism but IMO it’s not
   robust.
   - If all live nodes disappear temporarily (hard cluster restart), I
   could imagine the client failing permanently.  (credit to Ilan)
   - CloudSolrClient.getClusterState (and its equivalent method on the
   provider) goes from a trivial getter to a slow remote call fetching the
   entire cluster’s state; no cache.  We have code using it in various places;
   surely users too.  This class has issues (out of scope of this post), so
   I want to deprecate this so that the client never touches ClusterState.
   Getting live-nodes, DocCollection, and cluster properties are still
   accessible though.

The last one, basically banning ClusterState in SolrJ, is the biggest
performance trap / issue that needs to be prioritized; I plan to create a
JIRA or two.

I suppose I could make a SIP out of this... albeit maybe the time for that
was years ago when HttpCSP came into existence.  I'm just trying to see
this through to a conclusion.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley

Moving away from Zookeeper in SolrJ

Reply via email to