On Tue, 30 Aug 2022 14:18:11 GMT, Aleksey Shipilev <sh...@openjdk.org> wrote:

>> We have a few reports that existing Weak* VarHandle tests are still flaky, 
>> for example on large AArch64 machines or small RISC-V machines.
>> 
>> The flakiness is intrinsic to the nature of Weak* operations under tests, 
>> that can spuriously fail. The last attempt to fix these was 
>> [JDK-8155739](https://bugs.openjdk.org/browse/JDK-8155739). We need to 
>> strengthen these a bit more.
>> 
>> The actual values depend on the successful testing on known-failing 
>> platforms. I ballparked bumping the attempts 5x and introducing the delay 
>> would help without exploding test time in worst cases.
>
> Aleksey Shipilev has updated the pull request with a new target base due to a 
> merge or a rebase. The pull request now contains 12 commits:
> 
>  - Rewind back to 100 attempts, 1ms delay
>  - Merge branch 'master' into JDK-8292407-varhandle-weak-resilient
>  - Merge branch 'master' into JDK-8292407-varhandle-weak-resilient
>  - Pull Handles.get out of the weak retry loop
>  - Drop weakDelay to 1
>  - Merge branch 'master' into JDK-8292407-varhandle-weak-resilient
>  - Rework timeouts
>  - Merge branch 'master' into JDK-8292407-varhandle-weak-resilient
>  - Merge branch 'master' into JDK-8292407-varhandle-weak-resilient
>  - Update copyrights
>  - ... and 2 more: https://git.openjdk.org/jdk/compare/b3450e93...fd6aa17b

So, I have been playing with HiFive Unmatched board, and I can see that even in 
the single-threaded mode there weak CASes can spuriously fail when JIT compiler 
threads are very active at the same time. I suspect L1-tagging LR/SC fails just 
because we context-switch out a lot. 

On HiFive Unmatched, without any delay, it is common to see the long-tailed 
attempt counts that exceed 200 attempts. Adding the 1ms delay drops that long 
tail to fit under 10 attempts; I suspect because it provides a natural backoff 
on over-subscribed system, as thread would return back later if no resources 
are currently available. Still, that does not seem to hold when the rest of the 
system is running other parallel tests. Running `java/lang/invoke/VarHandles` 
with 50 attempts and 1ms delay still ocassionally fails the weak CAS tests. 

Running with 100 attempts and 1ms delay seems to pass it well (tried several 
times) -- which is what current version does. I think this is as good as it 
gets for a default mode. I also ran other platform tests, and there spurious 
failures are either not observed at all (x86 and friends), or we have 1..2 
retries very rarely (for example, on AArch64 without LSE) . This means the 
current 100 attempts would almost never be taken on those platforms, and thus 
we can avoid introducing any platform-specific test config selection.

-------------

PR: https://git.openjdk.org/jdk/pull/9889

Reply via email to