Hi Joe,
On 17-11-2016 19:33, joe darcy wrote:
>>>> Currently, optimization for building fdlibm is disabled, except for the
>>>> "solaris" OS target [1].
>>> The reason for that is because historically the Solaris compilers have had
>>> sufficient discipline and control regarding floating-point semantics and
>>> compiler optimizations to still implement the
>>> Java-mandated results when optimization was enabled. The gcc family of
>>> compilers, for example, has lacked such discipline.
>> oh, I see. Thanks for clarifying that. I was exactly wondering why fdlibm
>> optimization is off even for x86_x64 as it, AFAICS regarding gcc 5 only, does
>> not affect the precision, even if setting -O3 does not improve the
>> performance
>> as much as on PPC64.
>
> The fdlibm code relies on aliasing a two-element array of int with a double
> to do bit-level reads and writes of floating-point values. As I understand
> it, the C spec allows compilers to assume values
> of different types don't overlap in memory. The compilation environment has
> to be configured in such a way that the C compiler disables code generation
> and optimization techniques that would run afoul
> of these fdlibm coding practices.
On discussing with the Power toolchain folks we narrowed down the issue on PPC64
to the FMA. -fno-strict-aliasing has no effect and when used with an aggressive
optimization does not solve the issue on precision. Thus -ffp-contract=off is
the best options we have by now to optimize the fdlibm on PPC64.
>>> Methods in the Math class, such as pow, are often intrinsified and use a
>>> different algorithm so a straight performance comparison may not be as fair
>>> or meaningful in those cases.
>> I agree. It's just that the issue on StrictMath methods was first noted due
>> to
>> that huge gap (Math vs StrictMath) on PPC64, which is not prominent on x64.
>
> Depending on how Math.{sin, cos} is implemented on PPC64, compiling the
> fdlibm sin/cos with more aggressive optimizations should not be expected to
> close the performance gap. In particular, if
> Math.{sin, cos} is an intrinsic on PPC64 (I haven't checked the sources) that
> used platform-specific feature (say fused multiply add instructions) then
> just compiling fdlibm more aggressively wouldn't
> necessarily make up that gap.
In our case (PPC64) it does close the gap. Non-optimized code will suffer a lot,
for instance, from load-hit-store issues. Contrary to what happens on PPC64, the
gap on x64 seems to be quite small as you said.
>
> To allow cross-platform and cross-release reproducibility, StrictMath is
> specified to use the particular fdlibm algorithms, which precludes using
> better algorithms developed more recently. If we were
> to start with a clean slate today, to get such reproducibility we would
> specify correctly-rounded behavior of all those methods, but such an approach
> was much less tractable technical 20+ years ago
> without benefit of the research that was been done in the interim, such as
> the work of Prof. Muller and associates:
> https://lipforge.ens-lyon.fr/projects/crlibm/.
>
>>
>>
>>> Accumulating the the results of the functions and comparisons the sums is
>>> not a sufficiently robust way of checking to see if the optimized versions
>>> are indeed equivalent to the non-optimized ones.
>>> The specification of StrictMath requires a particular result for each set
>>> of floating-point arguments and sums get round-away low-order bits that
>>> differ.
>> That's really good point, thanks for letting me know about that. I'll
>> re-test my
>> change under that perspective.
>>
>>
>>> Running the JDK math library regression tests and corresponding JCK tests
>>> is recommended for work in this area.
>> Got it. By "the JDK math library regression tests" you mean exactly which
>> test
>> suite? the jtreg tests?
>
> Specifically, the regression tests under test/java/lang/Math and
> test/java/lang/StrictMath in the jdk repository. There are some other math
> library tests in the hotspot repo, but I don't know where
> they are offhand.
>
> A note on methodologies, when I've been writing test for my port I've tried
> to include test cases that exercise all the branches point in the code. Due
> to the large input space (~2^64 for a
> single-argument method), random sampling alone is an inefficient way to try
> to find differences in behavior.
>> For testing against JCK/TCK I'll need some help on that.
>>
>
> I believe the JCK/TCK does have additional testcases relevant here.
>
> HTH; thanks,
>
> -Joe
>
Thank you very much for the valuable comments.
I'll send a webrev accordingly for review.
I filed a bug: https://bugs.openjdk.java.net/browse/JDK-8170153
Best regards,
Gustavo