I should also mention, I promise this test can be 100% reliable. It’s not code I’m going to ramp up on soon though. Also, as I said I may have a different test experience than others. What tests run together and how things run will depend on hardware, core count, etc. It’s just the most common fail I see, and given it’s a new test, they tend to be easier to get attention on vs old tests.
The issues itself could be a test problem or a real problem or a real problem that’s not likely to be seen in production. They run the gambit. At the moment all I know is that it’s the tests that causes me to have to rerun the tests the most. Mark Mark On Sun, Sep 26, 2021 at 8:55 PM Mark Miller <[email protected]> wrote: > I believe all tests still run with a 1 zk cluster, if still the case, zk > consistency shouldn’t matter. > > It’s been a long while since I’ve looked into that particular doc/issue, > but even with more than 1 zk instance I believe that is only in an issue in > a fairly specific case - when a client does something with zk and so it > assumes it’s done and then triggers something else with the assumption the > change is made. That something else may not see the change, though normally > this would require it’s using a different zk client instance. > Unfortunately, we don’t always currently use a single zk client per node, > but even still, this is not a normal pattern. Most Solr ZK usage should not > have an issue with this case as most behavior is driven directly by > notifications from zookeeper or does not trigger something else with this > assumption. > > Mark > > On Sun, Sep 26, 2021 at 8:24 AM David Smiley <[email protected]> wrote: > >> This drives me crazy too. >> >> +1 to Ilan's point. For a CloudSolrClient, it's state knowledge should >> merely be a hint and not the final word -- need to go to ZK for that. For >> the HTTP based ClusterStateProvider, the receiving Solr side needs to use >> non-cached information -- must go to ZK always (maybe toggle-able with a >> param if need be). >> >> Still, here's a public service announcement on a guarantee that ZooKeeper >> does *not* have: >> https://zookeeper.apache.org/doc/r3.5.9/zookeeperProgrammers.html#ch_zkGuarantees >> see lack of "Simultaneously Consistent Cross-Client Views" in the note. >> After reading this (and being shocked by its implications), I added >> https://github.com/apache/solr/blob/122c88a0748769432ef62cc3fb94c2226dd67aa7/solr/solrj/src/java/org/apache/solr/common/cloud/ZkStateReader.java#L2071 >> And I also tried to highlight this... seems maybe not the dev list (I can't >> find it now) but at least in JIRA somewhere. >> So maybe all ClusterStateProviders need to ask that a Zk "sync" is called >> to guarantee the view is up-to-date? I'm not sure what the cost is but it >> may be a cost we can't safely avoid. >> >> ~ David Smiley >> Apache Lucene/Solr Search Developer >> http://www.linkedin.com/in/davidwsmiley >> >> >> On Wed, Sep 22, 2021 at 6:26 PM Ilan Ginzburg <[email protected]> wrote: >> >>> Not sure Gus I would blame the create collection code. To the best of my >>> recollection, when the create collection call returns the collection IS >>> fully created. >>> This doesn't mean though (and that's the problem IMO) that the cluster >>> state on the node that issued the collection creation call is aware of it: >>> its cache of cluster state is updated async at a later point once Zookeeper >>> watches decide it's time). >>> >>> I would tend to blame the way cluster state is managed in general in the >>> cluster. >>> >>> I didn't look at this test specifically, so the actual issue might still >>> be different. >>> >>> Ilan >>> >>> On Wed, Sep 22, 2021 at 5:37 PM Gus Heck <[email protected]> wrote: >>> >>>> why it often can’t find the collection it’s currently supposed to be >>>>> creating >>>> >>>> >>>> This sounds like things that pestered us while writing TRA tests. IIRC >>>> the problem basically comes from 2 things: 1) we return from create >>>> collection before the collection is fully created and ready to use, 2) >>>> watching code to determine when it IS ready is non-trivial. I think #1 is >>>> the real problem and #2 is a bandaid that shouldn't be needed. >>>> >>>> I think I recall mark previously ranting about how insane and terrible >>>> it would be if an RDBMS did this with CREATE TABLE... >>>> >>>> On Wed, Sep 22, 2021 at 11:24 AM Ishan Chattopadhyaya < >>>> [email protected]> wrote: >>>> >>>>> Sure, Mark. >>>>> Noble or I will get to this at the earliest, hopefully by end of this >>>>> week. >>>>> Unfortunately, I haven't been paying attention to test failures lately. >>>>> >>>>> On Wed, Sep 22, 2021 at 8:09 PM Mark Miller <[email protected]> >>>>> wrote: >>>>> >>>>>> Perhaps I just have a unique test running experience, but this test >>>>>> has been an outlier failure test in my test runs for months. Given that >>>>>> it’s newer than most tests, I imagine it’s attention grabbing days are >>>>>> on a >>>>>> downslope, so here is a poke if someone wants to check out why it often >>>>>> can’t find the collection it’s currently supposed to be creating. >>>>>> >>>>>> >>>>>> -- >>>>>> - Mark >>>>>> >>>>>> http://about.me/markrmiller >>>>>> >>>>> >>>> >>>> -- >>>> http://www.needhamsoftware.com (work) >>>> http://www.the111shift.com (play) >>>> >>> -- > - Mark > > http://about.me/markrmiller > -- - Mark http://about.me/markrmiller
