I don't know for the fix to this specific test, but the way cluster state is maintained on a node does not depend on how many ZK nodes there are.
When a node does an action against ZK, it does its write to ZK. When it needs to read, it reads from its local cache. The local cache of the node is updated (following a write) by the ZK watch firing, followed by the callback reading the changed bits. So it's always some time after the write has completed. One therefore can't assume a change made is immediately visible on a node, no matter the ZK config. That's why code often busy-waits for the update to become visible before continuing (common pattern in the Collection API commands). Ilan On Mon, Sep 27, 2021 at 8:13 AM Mark Miller <[email protected]> wrote: > Okay never mind. Somehow I cling to this idea that it’s easier not to get > drawn into every test or feature that’s causing me problems, but I have > should have known the 30 seconds it takes to address most of these things > will easily be dwarfed by the theoretical back and forth over them. I’ll > put in the fix for it. > > Mark > > On Sun, Sep 26, 2021 at 9:01 PM Mark Miller <[email protected]> wrote: > >> I should also mention, I promise this test can be 100% reliable. It’s not >> code I’m going to ramp up on soon though. Also, as I said I may have a >> different test experience than others. What tests run together and how >> things run will depend on hardware, core count, etc. It’s just the most >> common fail I see, and given it’s a new test, they tend to be easier to get >> attention on vs old tests. >> >> The issues itself could be a test problem or a real problem or a real >> problem that’s not likely to be seen in production. They run the gambit. At >> the moment all I know is that it’s the tests that causes me to have to >> rerun the tests the most. >> >> Mark >> >> >> Mark >> >> On Sun, Sep 26, 2021 at 8:55 PM Mark Miller <[email protected]> >> wrote: >> >>> I believe all tests still run with a 1 zk cluster, if still the case, zk >>> consistency shouldn’t matter. >>> >>> It’s been a long while since I’ve looked into that particular doc/issue, >>> but even with more than 1 zk instance I believe that is only in an issue in >>> a fairly specific case - when a client does something with zk and so it >>> assumes it’s done and then triggers something else with the assumption the >>> change is made. That something else may not see the change, though normally >>> this would require it’s using a different zk client instance. >>> Unfortunately, we don’t always currently use a single zk client per node, >>> but even still, this is not a normal pattern. Most Solr ZK usage should not >>> have an issue with this case as most behavior is driven directly by >>> notifications from zookeeper or does not trigger something else with this >>> assumption. >>> >>> Mark >>> >>> On Sun, Sep 26, 2021 at 8:24 AM David Smiley <[email protected]> wrote: >>> >>>> This drives me crazy too. >>>> >>>> +1 to Ilan's point. For a CloudSolrClient, it's state knowledge should >>>> merely be a hint and not the final word -- need to go to ZK for that. For >>>> the HTTP based ClusterStateProvider, the receiving Solr side needs to use >>>> non-cached information -- must go to ZK always (maybe toggle-able with a >>>> param if need be). >>>> >>>> Still, here's a public service announcement on a guarantee that >>>> ZooKeeper does *not* have: >>>> https://zookeeper.apache.org/doc/r3.5.9/zookeeperProgrammers.html#ch_zkGuarantees >>>> see lack of "Simultaneously Consistent Cross-Client Views" in the note. >>>> After reading this (and being shocked by its implications), I added >>>> https://github.com/apache/solr/blob/122c88a0748769432ef62cc3fb94c2226dd67aa7/solr/solrj/src/java/org/apache/solr/common/cloud/ZkStateReader.java#L2071 >>>> And I also tried to highlight this... seems maybe not the dev list (I can't >>>> find it now) but at least in JIRA somewhere. >>>> So maybe all ClusterStateProviders need to ask that a Zk "sync" is >>>> called to guarantee the view is up-to-date? I'm not sure what the cost is >>>> but it may be a cost we can't safely avoid. >>>> >>>> ~ David Smiley >>>> Apache Lucene/Solr Search Developer >>>> http://www.linkedin.com/in/davidwsmiley >>>> >>>> >>>> On Wed, Sep 22, 2021 at 6:26 PM Ilan Ginzburg <[email protected]> >>>> wrote: >>>> >>>>> Not sure Gus I would blame the create collection code. To the best of >>>>> my recollection, when the create collection call returns the collection IS >>>>> fully created. >>>>> This doesn't mean though (and that's the problem IMO) that the cluster >>>>> state on the node that issued the collection creation call is aware of it: >>>>> its cache of cluster state is updated async at a later point once >>>>> Zookeeper >>>>> watches decide it's time). >>>>> >>>>> I would tend to blame the way cluster state is managed in general in >>>>> the cluster. >>>>> >>>>> I didn't look at this test specifically, so the actual issue might >>>>> still be different. >>>>> >>>>> Ilan >>>>> >>>>> On Wed, Sep 22, 2021 at 5:37 PM Gus Heck <[email protected]> wrote: >>>>> >>>>>> why it often can’t find the collection it’s currently supposed to be >>>>>>> creating >>>>>> >>>>>> >>>>>> This sounds like things that pestered us while writing TRA tests. >>>>>> IIRC the problem basically comes from 2 things: 1) we return from create >>>>>> collection before the collection is fully created and ready to use, 2) >>>>>> watching code to determine when it IS ready is non-trivial. I think #1 is >>>>>> the real problem and #2 is a bandaid that shouldn't be needed. >>>>>> >>>>>> I think I recall mark previously ranting about how insane and >>>>>> terrible it would be if an RDBMS did this with CREATE TABLE... >>>>>> >>>>>> On Wed, Sep 22, 2021 at 11:24 AM Ishan Chattopadhyaya < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> Sure, Mark. >>>>>>> Noble or I will get to this at the earliest, hopefully by end of >>>>>>> this week. >>>>>>> Unfortunately, I haven't been paying attention to test failures >>>>>>> lately. >>>>>>> >>>>>>> On Wed, Sep 22, 2021 at 8:09 PM Mark Miller <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> Perhaps I just have a unique test running experience, but this test >>>>>>>> has been an outlier failure test in my test runs for months. Given that >>>>>>>> it’s newer than most tests, I imagine it’s attention grabbing days are >>>>>>>> on a >>>>>>>> downslope, so here is a poke if someone wants to check out why it often >>>>>>>> can’t find the collection it’s currently supposed to be creating. >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> - Mark >>>>>>>> >>>>>>>> http://about.me/markrmiller >>>>>>>> >>>>>>> >>>>>> >>>>>> -- >>>>>> http://www.needhamsoftware.com (work) >>>>>> http://www.the111shift.com (play) >>>>>> >>>>> -- >>> - Mark >>> >>> http://about.me/markrmiller >>> >> -- >> - Mark >> >> http://about.me/markrmiller >> > -- > - Mark > > http://about.me/markrmiller >
