Re: RFR: 8292407: Improve Weak CAS VarHandle/Unsafe tests resilience under spurious failures [v4]

Aleksey Shipilev Mon, 22 Aug 2022 10:10:59 -0700

On Mon, 22 Aug 2022 16:49:34 GMT, Martin Doerr <mdo...@openjdk.org> wrote:


> My concern is that we may not notice implementation problems any more when 
> retrying so often. Accidental cache line sharing should better get fixed in 
> the tests if possible. Context switching or cache capacity limits may cause 1 
> failure, not 100. What do you think?

I would not dare to say that 100 retries is going to be enough for everything; 
I don't even dare to say that is "too many", my test experiments ran with tens 
of thousands retries before.

Here is a thing, though: for weak LL/SC hardware, we would enter the same kind 
(but unbounded!) retry loop inside strong CAS implementation. If the 
implementation is laggy, we would spin there a lot. In fact, I don't think 
anyone actually measured how much retries we need to succeed at strong CAS on 
such platforms. But as long as such CAS eventually succeeds, it looks to be a 
performance question, not the functionality one. The tests affected by this PR 
are testing the functional part: weak CAS _eventually_ succeeds after 
aggressive retries/backoffs.

The failing platforms I have (RISC-V) are remarkably slow to dissect what might 
be going on there. We can dive there, for sure, but I reckon that would take 
weeks to resolve. It can wait, if you feel strongly about it.

-------------

PR: https://git.openjdk.org/jdk/pull/9889

Re: RFR: 8292407: Improve Weak CAS VarHandle/Unsafe tests resilience under spurious failures [v4]

Reply via email to