This drives me crazy too. +1 to Ilan's point. For a CloudSolrClient, it's state knowledge should merely be a hint and not the final word -- need to go to ZK for that. For the HTTP based ClusterStateProvider, the receiving Solr side needs to use non-cached information -- must go to ZK always (maybe toggle-able with a param if need be).
Still, here's a public service announcement on a guarantee that ZooKeeper does *not* have: https://zookeeper.apache.org/doc/r3.5.9/zookeeperProgrammers.html#ch_zkGuarantees see lack of "Simultaneously Consistent Cross-Client Views" in the note. After reading this (and being shocked by its implications), I added https://github.com/apache/solr/blob/122c88a0748769432ef62cc3fb94c2226dd67aa7/solr/solrj/src/java/org/apache/solr/common/cloud/ZkStateReader.java#L2071 And I also tried to highlight this... seems maybe not the dev list (I can't find it now) but at least in JIRA somewhere. So maybe all ClusterStateProviders need to ask that a Zk "sync" is called to guarantee the view is up-to-date? I'm not sure what the cost is but it may be a cost we can't safely avoid. ~ David Smiley Apache Lucene/Solr Search Developer http://www.linkedin.com/in/davidwsmiley On Wed, Sep 22, 2021 at 6:26 PM Ilan Ginzburg <[email protected]> wrote: > Not sure Gus I would blame the create collection code. To the best of my > recollection, when the create collection call returns the collection IS > fully created. > This doesn't mean though (and that's the problem IMO) that the cluster > state on the node that issued the collection creation call is aware of it: > its cache of cluster state is updated async at a later point once Zookeeper > watches decide it's time). > > I would tend to blame the way cluster state is managed in general in the > cluster. > > I didn't look at this test specifically, so the actual issue might still > be different. > > Ilan > > On Wed, Sep 22, 2021 at 5:37 PM Gus Heck <[email protected]> wrote: > >> why it often can’t find the collection it’s currently supposed to be >>> creating >> >> >> This sounds like things that pestered us while writing TRA tests. IIRC >> the problem basically comes from 2 things: 1) we return from create >> collection before the collection is fully created and ready to use, 2) >> watching code to determine when it IS ready is non-trivial. I think #1 is >> the real problem and #2 is a bandaid that shouldn't be needed. >> >> I think I recall mark previously ranting about how insane and terrible it >> would be if an RDBMS did this with CREATE TABLE... >> >> On Wed, Sep 22, 2021 at 11:24 AM Ishan Chattopadhyaya < >> [email protected]> wrote: >> >>> Sure, Mark. >>> Noble or I will get to this at the earliest, hopefully by end of this >>> week. >>> Unfortunately, I haven't been paying attention to test failures lately. >>> >>> On Wed, Sep 22, 2021 at 8:09 PM Mark Miller <[email protected]> >>> wrote: >>> >>>> Perhaps I just have a unique test running experience, but this test has >>>> been an outlier failure test in my test runs for months. Given that it’s >>>> newer than most tests, I imagine it’s attention grabbing days are on a >>>> downslope, so here is a poke if someone wants to check out why it often >>>> can’t find the collection it’s currently supposed to be creating. >>>> >>>> >>>> -- >>>> - Mark >>>> >>>> http://about.me/markrmiller >>>> >>> >> >> -- >> http://www.needhamsoftware.com (work) >> http://www.the111shift.com (play) >> >
