This drives me crazy too.

+1 to Ilan's point.  For a CloudSolrClient, it's state knowledge should
merely be a hint and not the final word -- need to go to ZK for that.  For
the HTTP based ClusterStateProvider, the receiving Solr side needs to use
non-cached information -- must go to ZK always (maybe toggle-able with a
param if need be).

Still, here's a public service announcement on a guarantee that ZooKeeper
does *not* have:
https://zookeeper.apache.org/doc/r3.5.9/zookeeperProgrammers.html#ch_zkGuarantees
see lack of "Simultaneously Consistent Cross-Client Views" in the note.
After reading this (and being shocked by its implications), I added
https://github.com/apache/solr/blob/122c88a0748769432ef62cc3fb94c2226dd67aa7/solr/solrj/src/java/org/apache/solr/common/cloud/ZkStateReader.java#L2071
And I also tried to highlight this... seems maybe not the dev list (I can't
find it now) but at least in JIRA somewhere.
So maybe all ClusterStateProviders need to ask that a Zk "sync" is called
to guarantee the view is up-to-date?  I'm not sure what the cost is but it
may be a cost we can't safely avoid.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Wed, Sep 22, 2021 at 6:26 PM Ilan Ginzburg <[email protected]> wrote:

> Not sure Gus I would blame the create collection code. To the best of my
> recollection, when the create collection call returns the collection IS
> fully created.
> This doesn't mean though (and that's the problem IMO) that the cluster
> state on the node that issued the collection creation call is aware of it:
> its cache of cluster state is updated async at a later point once Zookeeper
> watches decide it's time).
>
> I would tend to blame the way cluster state is managed in general in the
> cluster.
>
> I didn't look at this test specifically, so the actual issue might still
> be different.
>
> Ilan
>
> On Wed, Sep 22, 2021 at 5:37 PM Gus Heck <[email protected]> wrote:
>
>> why it often can’t find the collection it’s currently supposed to be
>>> creating
>>
>>
>> This sounds like things that pestered us while writing TRA tests. IIRC
>> the problem basically comes from 2 things: 1) we return from create
>> collection before the collection is fully created and ready to use, 2)
>> watching code to determine when it IS ready is non-trivial. I think #1 is
>> the real problem and #2 is a bandaid that shouldn't be needed.
>>
>> I think I recall mark previously ranting about how insane and terrible it
>> would be if an RDBMS did this with CREATE TABLE...
>>
>> On Wed, Sep 22, 2021 at 11:24 AM Ishan Chattopadhyaya <
>> [email protected]> wrote:
>>
>>> Sure, Mark.
>>> Noble or I will get to this at the earliest, hopefully by end of this
>>> week.
>>> Unfortunately, I haven't been paying attention to test failures lately.
>>>
>>> On Wed, Sep 22, 2021 at 8:09 PM Mark Miller <[email protected]>
>>> wrote:
>>>
>>>> Perhaps I just have a unique test running experience, but this test has
>>>> been an outlier failure test in my test runs for months. Given that it’s
>>>> newer than most tests, I imagine it’s attention grabbing days are on a
>>>> downslope, so here is a poke if someone wants to check out why it often
>>>> can’t find the collection it’s currently supposed to be creating.
>>>>
>>>>
>>>> --
>>>> - Mark
>>>>
>>>> http://about.me/markrmiller
>>>>
>>>
>>
>> --
>> http://www.needhamsoftware.com (work)
>> http://www.the111shift.com (play)
>>
>

Reply via email to