Hi! I'm all for (eventually) nudging folks towards HttpCSP. It's a "Good Thing", conceptually, to hide ZK!
But it sounds like there are still some significant (if few) downsides to using HttpCSP (the "all live nodes disappear" case being the biggest, to my mind). IMO we should close the more significant of those gaps before we start nudging folks in that direction. If I was a client I'd be pretty ticked if I switched to the "recommended" HttpCSP and had to restart my client pods after every sufficiently-long network blip. WDYT about closing some of those gaps before tweaking example code, javadocs, etc.? That said - really glad to hear about some of 9.8's HttpCSP performance improvements! Kudos to Haythem and Aparna! Any chance you have pointers to that work? (I grepped around a bit in CHANGES.txt but didn't see anything at a glance...) Are the perf improvements enough to bring HttpCSP level with ZkClientCSP performance wise, or is there still a gap? I've been on teams in the past that wanted to use HttpCSP but that were held back by some of the performance issues... Best, Jason On Sat, Oct 19, 2024 at 1:47 AM David Smiley <dsmi...@apache.org> wrote: > > I strongly believe that we need to get ZooKeeper out of our clients (that > use CloudSolrClient), and use Solr URLs (HTTP) for the cluster state > instead. I'm arguing to make this strategic direction clear, and we're > already going in the right direction. Realistically, I don't think > solrj-zookeeper should be eliminated as it exists for Solr 10 but I could > see doing so eventually (no rush!). Starting with Solr 9.8, I'd like users > to start using the Solr HTTP alternative option, encouraged by the release > notes. In Solr 10 we can remove any documentation in the ref guide on > CloudSolrClient working with ZooKeeper. Javadocs in > CloudSolrClient.Builder can recommend Solr URLs instead of the ZooKeeper > option. I don't have a strong opinion on exactly when to deprecate it. > Today is too soon. > > Why: > > - Principled — ZooKeeper is conceptually behind Solr; clients shouldn’t > talk to it. > - Fewer dependencies for clients (no ZooKeeper or Netty). > - Better security — only Solr should talk to ZooKeeper! Security > settings and key configuration files are stored in ZooKeeper. > - Eliminate impact of ZK storage on clients. The change of where the > configSet name was stored in ZK is an example. PRS is another. And > other changes I’ve seen in a fork. > - Reduce complexity of SolrJ from an operational standpoint and bug > risks (e.g. no ZkStateReader there). No Zookeeper related configuration > (jute.maxbuffer, etc.) > - Reduce complexity of SolrCloud by limiting the range of use of key > classes like ZkStateReader to only be in Solr instead of also existing in > SolrJ. For example it’s not clear if/when LazyCollectionRef’s are used > in SolrJ but with this separation, it’d be clearer that it couldn’t exist > in SolrJ. > - Increase our options for classes in solrj-zookeeper, like adding more > dependencies (traces & metrics) without concern of burdening any > user/client. > - Reliably working with a collection after collection creation. If > you’ve seen waitForActiveCollection after creating a collection in our > tests, this is what I mean (and it’s not strictly a test issue). It's > sad; make them go away! > > Progress has been made on the alternative: Ishan & Noble got the ball > rolling years ago to introduce the HTTP alternative option. I call it > HttpCSP internally based on an abbreviation of its class name. But I don't > think anyone actually uses it based on how poorly it performed, as reported > in JIRA. In Solr 9.1, SolrJ was modularized, creating the > "solrj-zookeeper" module (opt-out), and made opt-in for Solr 10. Finally, > key performance improvements landed in Solr 9.8 for the HTTP option making > it viable for most users (IMO). Credit to my colleagues Haythem & Aparna > on some of these. > > > That said, HttpCSP (and CloudSolrClient actually) hasn't reached its ideal > state yet. Some improvement possibilities / problems: > > - The cached DocCollection (i.e. a collection's state) expires out of a > cache with a hard-coded TTL, even if it’s actively being used. I don’t > think it should. It’d lead to poor p99 client experienced request > metrics for those that have to additionally fetch the DocCollection — > avoidably. > - There’s a DocCollection version staleness mechanism but IMO it’s not > robust. > - If all live nodes disappear temporarily (hard cluster restart), I > could imagine the client failing permanently. (credit to Ilan) > - CloudSolrClient.getClusterState (and its equivalent method on the > provider) goes from a trivial getter to a slow remote call fetching the > entire cluster’s state; no cache. We have code using it in various places; > surely users too. This class has issues (out of scope of this post), so > I want to deprecate this so that the client never touches ClusterState. > Getting live-nodes, DocCollection, and cluster properties are still > accessible though. > > The last one, basically banning ClusterState in SolrJ, is the biggest > performance trap / issue that needs to be prioritized; I plan to create a > JIRA or two. > > I suppose I could make a SIP out of this... albeit maybe the time for that > was years ago when HttpCSP came into existence. I'm just trying to see > this through to a conclusion. > > ~ David Smiley > Apache Lucene/Solr Search Developer > http://www.linkedin.com/in/davidwsmiley --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@solr.apache.org For additional commands, e-mail: dev-h...@solr.apache.org