On Wed, 12 Oct 2022 17:00:15 GMT, Andrew Haley <a...@openjdk.org> wrote:
>> A bug in GCC causes shared libraries linked with -ffast-math to disable >> denormal arithmetic. This breaks Java's floating-point semantics. >> >> The bug is https://gcc.gnu.org/bugzilla/show_bug.cgi?id=55522 >> >> One solution is to save and restore the floating-point control word around >> System.loadLibrary(). This isn't perfect, because some shared library might >> load another shared library at runtime, but it's a lot better than what we >> do now. >> >> However, this fix is not complete. `dlopen()` is called from many places in >> the JDK. I guess the best thing to do is find and wrap them all. I'd like to >> hear people's opinions. > > Andrew Haley has updated the pull request incrementally with one additional > commit since the last revision: > > 8295159: DSO created with -ffast-math breaks Java floating-point arithmetic I now have some performance results. `java.lang.foreign.CallOverheadConstant` is the test that I used to measure JNI overhead. At present, without `-XX:+RestoreMXCSROnJNICalls`, it looks like this: Benchmark Mode Cnt Score Error Units CallOverheadConstant.jni_blank avgt 40 9.968 ? 0.037 ns/op CallOverheadConstant.panama_blank avgt 40 8.745 ? 0.012 ns/op Enabling `-XX:+RestoreMXCSROnJNICalls` makes the overhead much worse: Benchmark Mode Cnt Score Error Units CallOverheadConstant.jni_blank avgt 40 14.741 ? 0.031 ns/op CallOverheadConstant.panama_blank avgt 40 14.620 ? 0.022 ns/op and with JMH perfasm we can see why: 0x00007f9f43d5698d: sub rsp,0x8 1.56% 0x00007f9f43d56991: vstmxcsr DWORD PTR [rsp] 25.01% 0x00007f9f43d56996: mov eax,DWORD PTR [rsp] 11.09% 0x00007f9f43d56999: and eax,0xffc0 0x00007f9f43d5699e: cmp eax,DWORD PTR [rip+0xe02d234] # 0x00007f9f51d83bd8 That adds 50% to the total JNI overhead. 70% to the Panama overhead. 25% of the total elapsed time is MXCSR! Reading MXCSR is expensive. So we don't do that. So, after a lot of head scratching, I've invented an instruction sequence which doesn't read MXCSR but does a little arithmetic, and `-XX:+RestoreMXCSROnJNICalls` is: CallOverheadConstant.jni_blank avgt 40 10.675 ± 0.100 ns/op CallOverheadConstant.panama_blank avgt 40 10.284 ± 0.018 ns/op Which is 7% added overhead for JNI, 17% for Panama. 1ns is 3.5 machine cycles: that's a bit less than the latency of a load from L1 cache. I'm wondering if I could get away with fixing `RestoreMXCSROnJNICalls` and turning it on by default. ------------- PR: https://git.openjdk.org/jdk/pull/10661