It's on our roadmap: https://cwiki.apache.org/confluence/display/SOLR/Roadmap and I created an "umbrella" JIRA issue with most issues I could think of, some minor or tangential/complementary: https://issues.apache.org/jira/browse/SOLR-17605
I could create a SIP but it feels redundant. Maybe a SIP that mostly links to the JIRA issue and states the summary? I think of them as about creating visibility for an important direction for the project. But that can be accomplished right here in the dev list paired with JIRA and optionally a doc if desired. I could still see SIPs then being useful to catalog major decisions / directions. Albeit again that's somewhat redundant with the Roadmap. On Sat, Oct 19, 2024 at 1:47 AM David Smiley <dsmi...@apache.org> wrote: > I strongly believe that we need to get ZooKeeper out of our clients (that > use CloudSolrClient), and use Solr URLs (HTTP) for the cluster state > instead. I'm arguing to make this strategic direction clear, and we're > already going in the right direction. Realistically, I don't think > solrj-zookeeper should be eliminated as it exists for Solr 10 but I could > see doing so eventually (no rush!). Starting with Solr 9.8, I'd like users > to start using the Solr HTTP alternative option, encouraged by the release > notes. In Solr 10 we can remove any documentation in the ref guide on > CloudSolrClient working with ZooKeeper. Javadocs in > CloudSolrClient.Builder can recommend Solr URLs instead of the ZooKeeper > option. I don't have a strong opinion on exactly when to deprecate it. > Today is too soon. > > Why: > > - Principled — ZooKeeper is conceptually behind Solr; clients > shouldn’t talk to it. > - Fewer dependencies for clients (no ZooKeeper or Netty). > - Better security — only Solr should talk to ZooKeeper! Security > settings and key configuration files are stored in ZooKeeper. > - Eliminate impact of ZK storage on clients. The change of where the > configSet name was stored in ZK is an example. PRS is another. And > other changes I’ve seen in a fork. > - Reduce complexity of SolrJ from an operational standpoint and bug > risks (e.g. no ZkStateReader there). No Zookeeper related > configuration (jute.maxbuffer, etc.) > - Reduce complexity of SolrCloud by limiting the range of use of key > classes like ZkStateReader to only be in Solr instead of also existing in > SolrJ. For example it’s not clear if/when LazyCollectionRef’s are > used in SolrJ but with this separation, it’d be clearer that it couldn’t > exist in SolrJ. > - Increase our options for classes in solrj-zookeeper, like adding > more dependencies (traces & metrics) without concern of burdening any > user/client. > - Reliably working with a collection after collection creation. If > you’ve seen waitForActiveCollection after creating a collection in our > tests, this is what I mean (and it’s not strictly a test issue). It's > sad; make them go away! > > Progress has been made on the alternative: Ishan & Noble got the ball > rolling years ago to introduce the HTTP alternative option. I call it > HttpCSP internally based on an abbreviation of its class name. But I don't > think anyone actually uses it based on how poorly it performed, as reported > in JIRA. In Solr 9.1, SolrJ was modularized, creating the > "solrj-zookeeper" module (opt-out), and made opt-in for Solr 10. Finally, > key performance improvements landed in Solr 9.8 for the HTTP option making > it viable for most users (IMO). Credit to my colleagues Haythem & Aparna > on some of these. > > > That said, HttpCSP (and CloudSolrClient actually) hasn't reached its ideal > state yet. Some improvement possibilities / problems: > > - The cached DocCollection (i.e. a collection's state) expires out of > a cache with a hard-coded TTL, even if it’s actively being used. I > don’t think it should. It’d lead to poor p99 client experienced > request metrics for those that have to additionally fetch the DocCollection > — avoidably. > - There’s a DocCollection version staleness mechanism but IMO it’s not > robust. > - If all live nodes disappear temporarily (hard cluster restart), I > could imagine the client failing permanently. (credit to Ilan) > - CloudSolrClient.getClusterState (and its equivalent method on the > provider) goes from a trivial getter to a slow remote call fetching the > entire cluster’s state; no cache. We have code using it in various places; > surely users too. This class has issues (out of scope of this post), > so I want to deprecate this so that the client never touches ClusterState. > Getting live-nodes, DocCollection, and cluster properties are still > accessible though. > > The last one, basically banning ClusterState in SolrJ, is the biggest > performance trap / issue that needs to be prioritized; I plan to create a > JIRA or two. > > I suppose I could make a SIP out of this... albeit maybe the time for that > was years ago when HttpCSP came into existence. I'm just trying to see > this through to a conclusion. > > ~ David Smiley > Apache Lucene/Solr Search Developer > http://www.linkedin.com/in/davidwsmiley >