On Thu, 28 Oct 2021 08:47:31 GMT, Aleksey Shipilev <sh...@openjdk.org> wrote:
> `Unsafe.{load|store}Fence` falls back to `unsafe.cpp` for > `OrderAccess::{acquire|release}Fence()`. It seems too heavy-handed (useless?) > to call to runtime for a single memory barrier. We can simplify the native > `Unsafe` interface by falling back to `fullFence` when `{load|store}Fence` > intrinsics are not available. This would be similar to what > `Unsafe.{loadLoad|storeStore}Fences` do. > > This is the behavior of these intrinsics now, on x86_64, using benchmarks > from JDK-8276054: > > > Benchmark Mode Cnt Score Error Units > > # Default > Single.acquire avgt 3 0.407 ± 0.060 ns/op > Single.full avgt 3 4.693 ± 0.005 ns/op > Single.loadLoad avgt 3 0.415 ± 0.095 ns/op > Single.plain avgt 3 0.406 ± 0.002 ns/op > Single.release avgt 3 0.408 ± 0.047 ns/op > Single.storeStore avgt 3 0.408 ± 0.043 ns/op > > # -XX:DisableIntrinsic=_storeFence > Single.acquire avgt 3 0.408 ± 0.016 ns/op > Single.full avgt 3 4.694 ± 0.002 ns/op > Single.loadLoad avgt 3 0.406 ± 0.002 ns/op > Single.plain avgt 3 0.406 ± 0.001 ns/op > Single.release avgt 3 4.694 ± 0.003 ns/op <--- upgraded to full > Single.storeStore avgt 3 4.690 ± 0.005 ns/op <--- upgraded to full > > # -XX:DisableIntrinsic=_loadFence > Single.acquire avgt 3 4.691 ± 0.001 ns/op <--- upgraded to full > Single.full avgt 3 4.693 ± 0.009 ns/op > Single.loadLoad avgt 3 4.693 ± 0.013 ns/op <--- upgraded to full > Single.plain avgt 3 0.408 ± 0.072 ns/op > Single.release avgt 3 0.415 ± 0.016 ns/op > Single.storeStore avgt 3 0.416 ± 0.041 ns/op > > # -XX:DisableIntrinsic=_fullFence > Single.acquire avgt 3 0.406 ± 0.014 ns/op > Single.full avgt 3 15.836 ± 0.151 ns/op <--- calls runtime > Single.loadLoad avgt 3 0.406 ± 0.001 ns/op > Single.plain avgt 3 0.426 ± 0.361 ns/op > Single.release avgt 3 0.407 ± 0.021 ns/op > Single.storeStore avgt 3 0.410 ± 0.061 ns/op > > # -XX:DisableIntrinsic=_fullFence,_loadFence > Single.acquire avgt 3 15.822 ± 0.282 ns/op <--- upgraded, calls > runtime > Single.full avgt 3 15.851 ± 0.127 ns/op <--- calls runtime > Single.loadLoad avgt 3 15.829 ± 0.045 ns/op <--- upgraded, calls > runtime > Single.plain avgt 3 0.406 ± 0.001 ns/op > Single.release avgt 3 0.414 ± 0.156 ns/op > Single.storeStore avgt 3 0.422 ± 0.452 ns/op > > # -XX:DisableIntrinsic=_fullFence,_storeFence > Single.acquire avgt 3 0.407 ± 0.016 ns/op > Single.full avgt 3 15.347 ± 6.783 ns/op <--- calls runtime > Single.loadLoad avgt 3 0.406 ± 0.001 ns/op > Single.plain avgt 3 0.406 ± 0.002 ns/op > Single.release avgt 3 15.828 ± 0.019 ns/op <--- upgraded, calls > runtime > Single.storeStore avgt 3 15.834 ± 0.045 ns/op <--- upgraded, calls > runtime > > # -XX:DisableIntrinsic=_fullFence,_loadFence,_storeFence > Single.acquire avgt 3 15.838 ± 0.030 ns/op <--- upgraded, calls > runtime > Single.full avgt 3 15.854 ± 0.277 ns/op <--- calls runtime > Single.loadLoad avgt 3 15.826 ± 0.160 ns/op <--- upgraded, calls > runtime > Single.plain avgt 3 0.406 ± 0.003 ns/op > Single.release avgt 3 15.838 ± 0.019 ns/op <--- upgraded, calls > runtime > Single.storeStore avgt 3 15.844 ± 0.104 ns/op <--- upgraded, calls > runtime > > > Additional testing: > - [x] Linux x86_64 fastdebug `tier1` I'm not quite seeing the motivation here. Your claim is that the non-intrinsic implementations involve a native call and so that is too expensive; yet the new code still relies on the fullFence being intrinsified else it is still a native call and a heavier barrier. If these fences were intrinisified piecemeal then perhaps this is an issue on some platform, but is that really the case? If you intrinsified one wouldn't you intrinsify all? ------------- PR: https://git.openjdk.java.net/jdk/pull/6149