[
https://issues.apache.org/jira/browse/NUMBERS-143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17032038#comment-17032038
]
Alex Herbert commented on NUMBERS-143:
--------------------------------------
The following methods for computing the result sqrt(x^2 + y^2) were tested. All
methods compute a standard precision result that is passed to Math.sqrt(double).
||Method||Description||
|xx+yy|x*x + y*y|
|dekker|Dekker's method for extended precision multiplication and summation|
|fma|Math.fma(x, x, y*y)|
|hypot|Uses the Math.hypot implementation of the fdlibm hypot function|
|hypot2|Uses a re-implementation of the fdlibm hypot function with Dekker's
split to compute the extended precision numbers|
All methods except the standard x*x+y*y use some form of extended precision and
are sensitive to the input order of x and y. They thus require a sort on
magnitude.
Note: The reference implementation for hypot from fdlibm only sorts on the
upper 32-bits of the two input values (i.e the exponent and the first 20 bits
of the mantissa). There are edge cases when numbers differing in only the lower
32-bits have the result hypot(x, y) != hypot(y, x). This breaks the ISO C99
specification for hypot. The Java implementation does not have this bug. All my
implementations fix this bug too. If you leave it then the code is faster, but
is not ISO compliant about 1 in a million times with fixed exponent random
numbers.
h2. Accuracy Results
Accuracy was assessed by computing the expected value using Dekker's method for
double precision sqrt. This is similar to the {{dekker}} method but maintains
the extended precision result of the sum of x^2 + y^2. The round-off from this
is used to compute an extended precision sqrt. In all tests I ran this produces
the same result as using BigDecimal with JDK 9's BigDecimal sqrt function when
using normal numbers. Using the Dekker sqrt allows much faster computation than
using BigDecimal so more data can be tested. Sub-normal numbers require
BigDecimal so the size of datasets for sub-normal numbers is smaller.
The use of Dekker's sqrt as the reference is only for the initial phase of the
test. The result of each method is computed. If all are the same as the
reference then this is scored as the same. Effectively the 'same' score
indicates this is about as good as we can get when using extended precision but
avoiding BigDecimal. If any results are different then the exact result is
computed using BigDecimal to Decimal128 precision. This is enough so that a
distance from the exact result can be measured in fractional units of least
precision (ULPs) of the smallest double surrounding the exact result. This is
[Kahan's|https://eecs.berkeley.edu/~wkahan/LOG10HAF.TXT] definition of ulp(x)
as:
{noformat}
"the gap between the two floating-point numbers nearest x".
{noformat}
This is different from the output of Math.ulp(x) when x falls on the boundary
of an exponent change; otherwise it is the same. For example this specifies
ulp(1.0) as epsilon/2 not epsilon, i.e. the ulp distance of a power of 2 is the
distance to the next lower number:
{noformat}
-----|------|------------|------------|--
1-eps/2 1 1+eps 1+2eps
{noformat}
If we use this definition a computation of the next double above 1 would be 2
ulps if the exact answer is 1. This has significance for complex cis numbers
whose magnitude should be exactly 1.
The distance of each result to the exact result is expressed relative to the
ulp of the exact result. The target is to be below 1. Below 0.5 indicates that
there is no closer double value to the exact result.
h3. Datasets
||Method||Description||
|cis|A random angle theta is generated in [0, pi) for a cis complex number
x=cos(theta), y=sin(theta). The magnitude should be 1.0 but may not be due to
round-off in representing the sin and cos.|
|cis2|A random vector is generated using a normalised Gaussian sample (u=0,
s=1) for x and y. The magnitude is computed and the vector normalised to length
1. This avoids using the trigonomic functions to compute the cis number.|
|polar|A random angle theta is generated in [0, pi) and a magnitude m in [0, 2)
for a complex number x=m*cos(theta), y=m*sin(theta).|
|uniform|Creates x and y by sampling uniformly from [0, 1).|
|gaussian|A random vector is generated using a normalised Gaussian sample for x
and y (u=0, s=1). The magnitude is not normalised.|
|range_pA_pB|Create random numbers x and y using a random 52-bit mantissa and
exponent set to +A for x and +B for y. Allows control of the relative magnitude
of x and y. |
|range_mA_mB|Create random numbers x and y using a random 52-bit mantissa and
exponent set to -A for x and -B for y. Allows control of the relative magnitude
of x and y. |
|log-uniform|Create random numbers x and y using a random 52-bit mantissa and
random exponents sampled uniformly from [-64, 64]. This range is arbitrary but
large enough so the same exponent is rare (it will be 1 in 129).|
|mA sub-B|Create a random numbers x using a random 52-bit mantissa and exponent
set to -A. Create random number y as a sub-normal number with B-bits in the
mantissa|
|sub-A sub-B|Create two sub-normal numbers with A-bits and B-bits in the
mantissa|
h3. Normal numbers
The table shows the number of samples, the fraction where all methods produced
the same result, and when different the maximum error in relative ulps and the
fraction of times the method was ok (matched the closest double from the exact
result).
||Dataset||Samples||Same||xx+yy error||xx+yy ok||dekker error||dekker ok||fma
error||fma ok||hypot error||hypot ok||hypot2 error||hypot2 ok||
|cis|17179869184|0.73900|0.99996|0.82771|0.75000|0.78846|0.87499|0.78665|0.75000|0.81445|0.75000|0.81445|
|cis2|17179869184|0.79003|0.99991|0.85442|0.75000|0.82635|0.87500|0.82575|0.75000|0.84298|0.75004|0.84298|
|polar|17179869184|0.79680|1.21106|0.84211|0.85337|0.88288|1.01898|0.88014|0.96435|0.88119|0.96435|0.88119|
|uniform|17179869184|0.78512|1.21388|0.83362|0.85314|0.87221|1.02107|0.86893|0.96314|0.87063|0.96314|0.87063|
|gaussian|17179869184|0.78499|1.21062|0.83301|0.85334|0.87249|1.01854|0.86981|0.96869|0.87088|0.96869|0.87088|
|range_p0_p0|17179869184|0.79547|1.21498|0.84437|0.85253|0.88097|1.02279|0.87351|0.87224|0.88053|0.87224|0.88053|
|range_p0_p1|17179869184|0.77710|1.20136|0.82836|0.85325|0.86580|0.94046|0.86350|0.96684|0.86193|0.96684|0.86193|
|range_p0_p2|17179869184|0.78224|1.20161|0.82912|0.85312|0.86974|0.87358|0.86953|0.89684|0.86890|0.89666|0.86890|
|range_p0_p3|17179869184|0.78326|1.20006|0.82749|0.85310|0.87001|0.85810|0.87000|0.86346|0.86996|0.86346|0.86996|
|range_p0_p4|17179869184|0.78337|1.20307|0.82692|0.85319|0.87003|0.85419|0.87003|0.85537|0.87003|0.85537|0.87003|
|range_m1_m1|17179869184|0.79546|1.21180|0.84436|0.85068|0.88097|1.02544|0.87351|0.87350|0.88052|0.87350|0.88052|
|range_m1_m2|17179869184|0.77709|1.20329|0.82836|0.85331|0.86580|0.93927|0.86349|0.96876|0.86193|0.96876|0.86193|
|range_m1_m3|17179869184|0.78225|1.20053|0.82912|0.85328|0.86974|0.87329|0.86953|0.89568|0.86890|0.89497|0.86890|
|range_m1_m4|17179869184|0.78327|1.20052|0.82749|0.85330|0.87001|0.85830|0.87000|0.86327|0.86996|0.86327|0.86996|
|range_m1_m5|17179869184|0.78338|1.20251|0.82693|0.85336|0.87004|0.85423|0.87004|0.85528|0.87003|0.85528|0.87003|
|range_p128_p128|17179869184|0.79547|1.21224|0.84437|0.85108|0.88097|1.02458|0.87351|0.87279|0.88052|0.87279|0.88052|
|range_p128_p129|17179869184|0.77710|1.20129|0.82837|0.85330|0.86580|0.93925|0.86350|0.96611|0.86193|0.96611|0.86193|
|range_p128_p130|17179869184|0.78225|1.20361|0.82912|0.85333|0.86974|0.87436|0.86953|0.89569|0.86890|0.89569|0.86890|
|range_p128_p131|17179869184|0.78327|1.20486|0.82750|0.85337|0.87001|0.85823|0.87000|0.86323|0.86996|0.86323|0.86996|
|range_p128_p132|17179869184|0.78337|1.20341|0.82692|0.85330|0.87003|0.85448|0.87003|0.85555|0.87003|0.85555|0.87003|
|range_m128_m128|17179869184|0.79547|1.21313|0.84437|0.85108|0.88097|1.02269|0.87352|0.87294|0.88053|0.87294|0.88053|
|range_m128_m129|17179869184|0.77710|1.19993|0.82837|0.85327|0.86580|0.93951|0.86349|0.96956|0.86193|0.96956|0.86193|
|range_m128_m130|17179869184|0.78225|1.20425|0.82912|0.85317|0.86974|0.87377|0.86953|0.89632|0.86890|0.89632|0.86890|
|range_m128_m131|17179869184|0.78327|1.20026|0.82750|0.85298|0.87001|0.85817|0.87000|0.86340|0.86995|0.86340|0.86995|
|range_m128_m132|17179869184|0.78337|1.20259|0.82692|0.85328|0.87003|0.85419|0.87003|0.85517|0.87003|0.85517|0.87003|
|log-uniform|17179869184|0.78478|1.20308|0.82849|0.85319|0.87120|1.01436|0.87079|0.96162|0.87088|0.96162|0.87088|
Observations
* Just using x*x+y*y is correct most the time (82%). The maximum error is
<1.22 ulp. It is the best method for cis numbers (i.e. hypot(x, y)=1.0) in
terms of fraction matches to the correct answer.
* Use of extended precision on only x is not enough to achieve <1 ulp accuracy
as FMA achieves <1.03 ulp accuracy when the exponents are equal.
* hypot achieves 0.96 ulp accuracy.
* Dekker is the best at 0.86 ulp accuracy.
* There is no difference in these results between hypot and hypot2 to 5
significant figures. Differences are observed 1 in a billion times and is
neither better or worse on average (data not shown).
* FMA is better than the hypot method when the exponents are not equal. It is
worse when the exponents are equal.
* The FMA method max error reduces as the gap in the exponent increases, i.e.
the round-off error in y*y is less significant. It does not matter if the
values are smaller or larger than 1.
* The hypot method max error jumps to worse when the exponent moves from fixed
to a gap of 1. This is due to the switch of algorithm from extended precision x
and y to a different computation with extended precision x.
* Dekker's method is consistent for its max error across the datasets. The
exception is cis numbers where the max error of 0.75 corresponds to computing
Math.nextDown(1.0) when the actual answer was 1.0 + 2^54 (which cannot be
represented). The conclusion would be that if the exact representation were any
higher then Dekker's method would compute either 1.0 or Math.nextUp(1.0) so
would be a double flanking the exact result. This accuracy is also achieved by
hypot but not FMA or the standard precision computation.
h3. Sub-normal numbers
||Dataset||Samples||Same||xx+yy error||xx+yy ok||dekker error||dekker ok||fma
error||fma ok||hypot error||hypot ok||hypot2 error||hypot2 ok||
|m1022
sub-52|100663296|0.76693|1.19304|0.82828|0.85235|0.86786|0.93394|0.86665|1.19304|0.82742|0.95564|0.86569|
|m1022
sub-51|100663296|0.78281|1.19837|0.82816|0.85157|0.86991|0.86466|0.86981|1.19837|0.82816|0.88753|0.86950|
|sub-52
sub-52|100663296|0.75711|1.09731|0.80257|0.92538|0.80771|0.99137|0.80674|1.09055|0.80124|0.96407|0.80756|
|sub-52
sub-51|100663296|0.77987|1.08457|0.81104|0.92379|0.81175|0.96180|0.81172|1.08569|0.81102|0.96896|0.81166|
|sub-52
sub-50|100663296|0.80036|1.09044|0.82478|0.92507|0.82510|0.93468|0.82510|1.09044|0.82477|0.94702|0.82510|
|sub-32
sub-32|100663296|1.00000|0.50000|1.00000|0.50000|1.00000|0.50000|1.00000|0.50000|1.00000|0.50000|1.00000|
Observations
* x*x+y*y is still correct most the time (82%). The maximum error is still
around <1.22 ulp.
* hypot has the same maximum error as the standard precision computation. This
is because the method does not use a split representation for the computation
and falls back to the standard precision computation. The reason is the split
is performed by manipulating the exponent assuming a leading 1 on the mantissa.
This is not possible with sub-normal numbers which all have the same exponent
and the upper bits may be zero and only the lower 32-bits of the mantissa
contain the number.
* Dekker's method has a maximum error of <0.93. This is higher than the result
for normal numbers. This is due to a second rounding that occurs when scaling a
result computed as a normal number to a sub-normal number. It is why BigDecimal
must be used to compute the exact result for the reference as the Dekker sqrt
suffers from the same rounding inaccuracy.
* Very small sub-normal numbers are easier. More data may have shown
differences but these results show all methods are equal and as good as it is
possible to get (0.5 ulp).
h2. Performance
Tested using JMH with different datasets:
||Method||Description||
|cis|A random angle theta is generated in [0, pi) for a cis complex number
x=cos(theta), y=sin(theta). The magnitude should be 1.0.|
|uniform|Creates x and y by sampling uniformly from [0, 1).|
|gaussian|A random vector is generated using a normalised Gaussian sample for x
and y (u=0, s=1). The magnitude is not normalised.| |
|log-uniform|Create random numbers x and y using a random 52-bit mantissa and
random exponents sampled uniformly from [-64, 64].|
|fixed-exponent|Create random numbers x and y using a random 52-bit mantissa
and a fixed exponent of 0. The number is uniform in [1, 2).|
|edge|Create random numbers from POSITIVE_INFINITY, NEGATIVE_INFINITY,
MAX_VALUE, -MAX_VALUE, MIN_VALUE, -MIN_VALUE, 0.0, -0.0, NaN|
I tested by computing the hypot value for the complex numbers in a loop and
returning a newly allocated array of double values to JMH. A variant that used
a blackhole to consume each value show similar conclusions. Data sizes of 10,
100, 1000, 10000, 100000 were tested. The same conclusions can be reached using
only 10 and 10000 so I only show those results.
Note that the methods being tested contain many branches to check for
magnitudes, edge case of infinity and NaN, zeros, etc. The branch prediction
process can make a good effort to memorise the branches that should be taken on
a small dataset. So the size 10 result typically shows how fast the method is
when the correct path through the logic is taken. The size 10000 result is too
large for a branch predictor to memorise. For example
[Lemire|https://lemire.me/blog/2019/10/16/benchmarking-is-hard-processors-learn-to-predict-branches/]
recommends that 'branchy' code be tested using at least 2000 samples. Thus
this represents realistic performance on random data.
The following show results for the high accuracy methods. The FMA method was
tested using Math.fma from Java 9 or a simulated version using Dekker's
extended precision method. The two produce the same result with different
speeds. The standard sum is not shown since in practice it was the same speed
as the hardware accelerated FMA method. The full Dekker sqrt method is added
for reference. I also added Commons FastMath.hypot for reference. The baseline
is the creation of the result array populated with the x component of the input
data. Thus it encapsulates the array creation and the looping overhead of the
test.
The data are shown in order of performance on the large 10000 dataset from left
to right (slowest). There results are the median time (in ns) from 5 iterations
of 1 second.
||Size||Data||baseline||Fma||FmaSim||Dekker||Hypot2||FastMathHypot||DekkerSqrt||MathHypot||
|10|cis|21.8|40.5|67.1|150.6|83.2|161.0|173.4|178.2|
|10|edge|21.9|328.4|197.3|199.5|256.0|230.7|253.8|315.4|
|10|fixed-exponent|21.8|42.2|65.8|86.0|99.8|161.4|177.2|199.2|
|10|gaussian|21.8|40.4|66.3|146.4|82.7|170.2|176.3|177.9|
|10|log-uniform|21.9|42.8|66.3|88.3|79.0|162.2|177.0|169.5|
|10|uniform|21.8|41.9|66.5|87.5|83.3|161.4|176.4|193.7|
|10000|cis|13410.9|78835.6|94738.6|122327.0|141622.8|155651.6|185853.3|250018.2|
|10000|edge|13234.6|199845.1|213367.2|222034.2|249553.2|278202.6|232381.8|312181.7|
|10000|fixed-exponent|13190.2|80765.0|95878.9|122596.0|118547.3|156358.7|187509.1|231173.8|
|10000|gaussian|13343.8|76983.8|92164.7|123492.8|145467.8|156384.1|188861.2|249970.8|
|10000|log-uniform|13300.0|76143.8|95284.5|122141.8|114636.5|161191.4|188620.2|207361.7|
|10000|uniform|13268.3|76018.1|95402.2|121997.0|153419.8|156355.5|187633.7|258339.9|
Observations
* The fastest method is the hardware supported FMA.
* The FMA methods and the Dekker method have the same performance on each
large dataset excluding the edge case data. These have only 1 branch for
standard data to sort x and y by magnitude.
* The Dekker method is slow on cis and gaussian datasets for the size 10.
Reasons unknown but this could be repeated on other runs and machines to verify
it is not an artifact.
* The hypot methods have a branch when x and y magnitudes are within 2-fold.
The small dataset shows signs of branch prediction memorising the correct
branch path. The fixed exponent dataset which always requires the branch with
more computations is slower than the other datasets. On the size 10 data the
hypot method is usually faster than Dekker and slower than the simulated FMA
method. This is expected as the mean number of floating point operations
(flops) on the two branches is more than FMA and less than the Dekker method.
So if branch prediction is perfect the size 10 dataset shows the method at full
speed.
* On the size 10000 dataset the hypot method is slower than the Dekker method
on the cis, gaussian and uniform datasets. These datasets contain data where
the branch for high precision must be taken 41% and 50% of the time. On the
log-uniform dataset the high precision branch must be taken approximately 1 in
129 times (as the exponent is in the range [-64, 64]). On the fixed exponent
dataset the high precision branch is taken 100% of the time. This is the slower
branch but still has fewer flops than the Dekker computation and this is shown
in the faster performance.
* FastMath.hypot is not as fast as most of the new custom implementations
based on fdlibm. It is faster than the full Dekker sqrt computation.
* Math.hypot is slowest. This is using the new JDK 9 version which is faster
than JDK 8.
The reason for the slower speed of the FastMath.hypot and Math.hypot
computations is either due to (1) edge case handling, (2) scaling or (3)
splitting numbers.
For (1) these methods both do edge case handling at the start of the method to
check for infinites and nan on both input numbers. In the normal case this is
unnecessary as the input will be finite. If they are edge cases then checking
inf/nan separately takes more time than the fdlibm method which uses the
exponent and since inf/nan have the same exponent the check is done in one
conditional. Also the original fdlibm method does not check edge cases on
input. They are left to filter out of other conditions such as extremely large
(inf/nan) or small exponents (zero). Since these conditions rarely occur the
main flow through the code for finite numbers only has one major branch and
that is to sort the numbers by magnitude. All other branches are edge cases.
However assuming branch prediction is perfect on the edge case data the edge
cases still have fewer conditionals to evaluate in the fdlibm method.
For (2) FastMath.hypot and Math.hypot do scaling using Math.scaleb or similar
to create all/some scaling factors using Double.longBitsToDouble. The fdlibm
method uses fixed scaling factors and avoids dynamically creating the scaling
factor.
For (3) Math.hypot also creates split numbers using round-trips of
Double.doubleToRawLongBits and Double.longBitsToDouble. The implementations I
have created use the Dekker method to create split representations of the
doubles. This uses 1 multiply and 2 additions per split. If I replace with a
roundtrip of raw bits then the method is slower in JMH benchmarks but still
faster than Math.hypot (data not shown). So Dekker's split is faster. However
it does not produce the same split representation and thus hypot2 has some 1
ulp differences to Math.hypot. Using a split with raw bits allows the hypot2 to
be a clone and exactly match Math.hypot results on normal numbers but with
faster performance.
h2. Conclusions
Math.hypot is accurate but it is slow. It can be replaced with various methods:
||Method||Pros||Cons||
|Math.hypot|The standard. <1 ulp max error.|Low accuracy on sub-normals.|
|hypot clone|Match exactly Math.hypot on normal numbers using a split with raw
bits from the double. Would not correct the split for sub-normal
numbers.|Better alternatives for speed or accuracy exist. Still low accuracy
for sub-normals.|
|hypot2|<1 ulp max error. Fixes maximum error on sub-normals|Not exactly the
same result as Math.hypot.|
|FMA|Extremely fast.|<1.03 ulp error is larger than Math.hypot. Requires Java 9
and hardware support.|
|FMA simulated|Fast.|<1.03 ulp error is larger than Math.hypot.|
|Dekker|Faster than Math.hypot. More accurate than Math.hypot.|Worse on average
than Math.hypot on cis numbers (but has lower maximum error).|
|Dekker Sqrt|Faster than Math.hypot. Potentially perfect accuracy on normal
numbers (requires more testing).|Accuracy compromises speed. Note Dekker's
method would be correct 87% of the time anyway.|
If the input data is known to be within a 2-fold factor of magnitude then using
the alternative hypot2 implementation is a good choice. It is <1 ulp max error
and faster than Dekker's method.
If the input data is known to be outside a 2-fold factor of magnitude then the
simulated FMA method is a good choice. It is <1 ulp max error and very fast.
For this data it is more accurate than Math.hypot.
If the input data is sub-normal then Math.hypot should not be used. The variant
that does splitting using Dekker's method is a substitute as that still works
on sub-normals. But results would differ occasionally.
If the input data is random then the magnitude will be within a 2-fold factor
41% (polar coords) or 50% (uniform data) of the time. Here hypot has weaker
performance as the branch to distinguish the two-cases is unpredictable. Either
the FMA or Dekker methods are preferred.
Given the target of Math.hypot is to achieve <1 ulp and the processing to scale
the numbers takes a fair part of the run-time it makes sense to replace the
method with an alternative that also achieves <1 ulp. This would rule out a
replacement that does all the scaling just to return sqrt(x*x + y*y) as is done
by FastMath.hypot. The only choice for consistent <1 ulp results is the Dekker
method for extended precision. I would recommend that is used in Complex for
its computation for the absolute of the complex number.
h3. Postscript
This is only part of the data I have collected for this analysis. If anything
is not clear then more details can be posted. If my error analysis is incorrect
then at least it is consistently applied and so conclusions on relative
accuracy are still valid. The max ulp errors for the FMA method agree with data
posted on this topic on [StackOverflow: Why hypot() function is so
slow?|https://stackoverflow.com/questions/3764978/why-hypot-function-is-so-slow].
The errors for hypot are lower that the post by njuffa which states 1.15 ulp.
This may be testing a hypot implementation that does not sort using the entire
magnitude of the two inputs, have random data cases I did not cover, be
including sub-normals (where my max error is <1.20 ulp) or shows that using
Dekker's sqrt as a reference is not accurate enough to find errors verses
always using an arbitrary precision library.
> Investigate Math.hypot for computing the absolute of a complex number
> ---------------------------------------------------------------------
>
> Key: NUMBERS-143
> URL: https://issues.apache.org/jira/browse/NUMBERS-143
> Project: Commons Numbers
> Issue Type: Task
> Components: complex
> Reporter: Alex Herbert
> Priority: Minor
>
> {{Math.hypot}} computes the value {{sqrt(x^2+y^2)}} to within 1 ULP. The
> function uses the [e_hypot.c|https://www.netlib.org/fdlibm/e_hypot.c]
> implementation from the Freely Distributable Math Library (fdlibm).
> Pre-java 9 this function used JNI to call an external implementation. The
> performance was slow. Java 9 ported the function to native java (see
> [JDK-7130085 : Port fdlibm hypot to
> Java|https://bugs.java.com/bugdatabase/view_bug.do?bug_id=7130085]).
> This function is used to define the absolute value of a complex number. It is
> also used in sqrt() and log(). This ticket is to investigate the performance
> and accuracy of \{{Math.hypot}} against alternatives for use in Complex.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)