Re: PerReplicaStatesIntegrationTest

Mark Miller Sun, 26 Sep 2021 19:01:51 -0700

I should also mention, I promise this test can be 100% reliable. It’s not
code I’m going to ramp up on soon though. Also, as I said I may have a
different test experience than others. What tests run together and how
things run will depend on hardware, core count, etc. It’s just the most
common fail I see, and given it’s a new test, they tend to be easier to get
attention on vs old tests.


The issues itself could be a test problem or a real problem or a real
problem that’s not likely to be seen in production. They run the gambit. At
the moment all I know is that it’s the tests that causes me to have to
rerun the tests the most.

Mark


Mark

On Sun, Sep 26, 2021 at 8:55 PM Mark Miller <[email protected]> wrote:

> I believe all tests still run with a 1 zk cluster, if still the case, zk
> consistency shouldn’t matter.
>
> It’s been a long while since I’ve looked into that particular doc/issue,
> but even with more than 1 zk instance I believe that is only in an issue in
> a fairly specific case - when a client does something with zk and so it
> assumes it’s done and then triggers something else with the assumption the
> change is made. That something else may not see the change, though normally
> this would require it’s using a different zk client instance.
> Unfortunately, we don’t always currently use a single zk client per node,
> but even still, this is not a normal pattern. Most Solr ZK usage should not
> have an issue with this case as most behavior is driven directly by
> notifications from zookeeper or does not trigger something else with this
> assumption.
>
> Mark
>
> On Sun, Sep 26, 2021 at 8:24 AM David Smiley <[email protected]> wrote:
>
>> This drives me crazy too.
>>
>> +1 to Ilan's point.  For a CloudSolrClient, it's state knowledge should
>> merely be a hint and not the final word -- need to go to ZK for that.  For
>> the HTTP based ClusterStateProvider, the receiving Solr side needs to use
>> non-cached information -- must go to ZK always (maybe toggle-able with a
>> param if need be).
>>
>> Still, here's a public service announcement on a guarantee that ZooKeeper
>> does *not* have:
>> https://zookeeper.apache.org/doc/r3.5.9/zookeeperProgrammers.html#ch_zkGuarantees
>> see lack of "Simultaneously Consistent Cross-Client Views" in the note.
>> After reading this (and being shocked by its implications), I added
>> https://github.com/apache/solr/blob/122c88a0748769432ef62cc3fb94c2226dd67aa7/solr/solrj/src/java/org/apache/solr/common/cloud/ZkStateReader.java#L2071
>> And I also tried to highlight this... seems maybe not the dev list (I can't
>> find it now) but at least in JIRA somewhere.
>> So maybe all ClusterStateProviders need to ask that a Zk "sync" is called
>> to guarantee the view is up-to-date?  I'm not sure what the cost is but it
>> may be a cost we can't safely avoid.
>>
>> ~ David Smiley
>> Apache Lucene/Solr Search Developer
>> http://www.linkedin.com/in/davidwsmiley
>>
>>
>> On Wed, Sep 22, 2021 at 6:26 PM Ilan Ginzburg <[email protected]> wrote:
>>
>>> Not sure Gus I would blame the create collection code. To the best of my
>>> recollection, when the create collection call returns the collection IS
>>> fully created.
>>> This doesn't mean though (and that's the problem IMO) that the cluster
>>> state on the node that issued the collection creation call is aware of it:
>>> its cache of cluster state is updated async at a later point once Zookeeper
>>> watches decide it's time).
>>>
>>> I would tend to blame the way cluster state is managed in general in the
>>> cluster.
>>>
>>> I didn't look at this test specifically, so the actual issue might still
>>> be different.
>>>
>>> Ilan
>>>
>>> On Wed, Sep 22, 2021 at 5:37 PM Gus Heck <[email protected]> wrote:
>>>
>>>> why it often can’t find the collection it’s currently supposed to be
>>>>> creating
>>>>
>>>>
>>>> This sounds like things that pestered us while writing TRA tests. IIRC
>>>> the problem basically comes from 2 things: 1) we return from create
>>>> collection before the collection is fully created and ready to use, 2)
>>>> watching code to determine when it IS ready is non-trivial. I think #1 is
>>>> the real problem and #2 is a bandaid that shouldn't be needed.
>>>>
>>>> I think I recall mark previously ranting about how insane and terrible
>>>> it would be if an RDBMS did this with CREATE TABLE...
>>>>
>>>> On Wed, Sep 22, 2021 at 11:24 AM Ishan Chattopadhyaya <
>>>> [email protected]> wrote:
>>>>
>>>>> Sure, Mark.
>>>>> Noble or I will get to this at the earliest, hopefully by end of this
>>>>> week.
>>>>> Unfortunately, I haven't been paying attention to test failures lately.
>>>>>
>>>>> On Wed, Sep 22, 2021 at 8:09 PM Mark Miller <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Perhaps I just have a unique test running experience, but this test
>>>>>> has been an outlier failure test in my test runs for months. Given that
>>>>>> it’s newer than most tests, I imagine it’s attention grabbing days are 
>>>>>> on a
>>>>>> downslope, so here is a poke if someone wants to check out why it often
>>>>>> can’t find the collection it’s currently supposed to be creating.
>>>>>>
>>>>>>
>>>>>> --
>>>>>> - Mark
>>>>>>
>>>>>> http://about.me/markrmiller
>>>>>>
>>>>>
>>>>
>>>> --
>>>> http://www.needhamsoftware.com (work)
>>>> http://www.the111shift.com (play)
>>>>
>>> --
> - Mark
>
> http://about.me/markrmiller
>
-- 
- Mark

http://about.me/markrmiller

Re: PerReplicaStatesIntegrationTest

Reply via email to