Re: [PATCH V3] Practical Improvement to libgcc Complex Divide

Patrick McGehearty via Gcc-patches Tue, 21 Jul 2020 10:19:26 -0700

Ping



On 7/1/2020 11:30 AM, Patrick McGehearty via Gcc-patches wrote:

(Version 3)

(Added in version 3)
Support for half, float, extended, and long double precision has
been added to the prior work for double precision. Since half precision
is computed with float precision as per current libgcc practice,
the enhanced underflow/overflow tests provide no benefit for half
precision and would cost performance. Therefore half precision is
left unchanged.

The existing constants for each precision:
float: FLT_MAX, FLT_MIN;
double: DBL_MAX, DBL_MIN;
extended and/or long double: LDBL_MAX, LDBL_MIN
are used for avoiding the more common overflow/underflow cases.

Additional investigation showed that testing for when both parts of
the denominator had exponents roughly small enough to allow shifting
any subnormal values to normal values, all input values could be
scaled up without risking unnecessary overflow and gaining a clear
improvement in accuracy. The test and scaling values used all fit
within the allowed exponent range for each precision required by the C
standard. The remaining number of troubling results in version 3 is
measurably smaller than in versions 1 and 2.

The timing and precision tables below have been revised appropriately
to match the algorithms used in this version for double precision
and additional tables added to include results for other precisions.

In prior versions, I omitted mention of the bug report that started me
on this project: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=59714
complex division is surprising on targets with FMA (was: on aarch64)
With the proposed method, whether using FMA or not, dividing
1.0+3.0i by 1.0+3.0i correctly returns 1.0+0.0i.

I also have added a reference to Beebe's "The Mathematical Function
Computation Handbook" [4] which was my starting point for research
into better complex divide methods.

(Added for Version 2)
In my initial research, I missed Elen Kalda's proposed patch
https://gcc.gnu.org/legacy-ml/gcc-patches/2019-08/msg01629.html [3]
Thanks to Joseph Myers for providing me with the pointer.
This version includes performance and accuracy comparisions
between Elen's proposed patch and my latest patch version for
double precision.

(from earlier Versions)

The following patch to libgcc/libgcc2.c __divdc3 provides an
opportunity to gain important improvements to the quality of answers
for the default complex divide routine (half, float, double, extended,
long double precisions) when dealing with very large or very small exponents.

The current code correctly implements Smith's method (1962) [1]
further modified by c99's requirements for dealing with NaN (not a
number) results. When working with input values where the exponents
are greater than *_MAX_EXP/2 or less than -(*_MAX_EXP)/2, results are
substantially different from the answers provided by quad precision
more than 1% of the time. This error rate may be unacceptable for many
applications that cannot a priori restrict their computations to the
safe range. The proposed method reduces the frequency of
"substantially different" answers by more than 99% for double
precision at a modest cost of performance.

Differences between current gcc methods and the new method will be
described. Then accuracy and performance differences will be discussed.

NOTATION

For all of the following, the notation is:
Input complex values:
   a+bi  (a= real part, b= imaginary part)
   c+di
Output complex value:
   e+fi = (a+bi)/(c+di)

For the result tables:
current = current method (SMITH)
b1div = method proposed by Elen Kalda
b2div = alternate method considered by Elen Kalda
new1 = new method using 1 divide and 2 multiplies
new = new method proposed by this patch

DESCRIPTIONS of different complex divide methods:

NAIVE COMPUTATION (-fcx-limited-range):
   e = (a*c + b*d)/(c*c + d*d)
   f = (b*c - a*d)/(c*c + d*d)

Note that c*c and d*d will overflow or underflow if either
c or d is outside the range 2^-538 to 2^512.

This method is available in gcc when the switch -fcx-limited-range is
used. That switch is also enabled by -ffast-math. Only one who has a
clear understanding of the maximum range of intermediate values
generated by a computation should consider using this switch.

SMITH's METHOD (current libgcc):
   if(fabs(c)<fabs(d) {
     r = c/d;
     denom = (c*r) + d;
     e = (a*r + b) / denom;
     f = (b*r - a) / denom;
   } else {
     r = d/c;
     denom = c + (d*r);
     e = (a + b*r) / denom;
     f = (b - a*r) / denom;
   }

Smith's method is the current default method available with __divdc3.

Elen Kalda's METHOD

This method applies the most significant part of the algorithm
proposed by Baudin&Smith (2012) in the paper "A Robust Complex
Division in Scilab" [2]. Elen's method also replaces two divides
by one divide and two multiplies due to the high cost of divide
on aarch64. In the comparison sections, this method will be
labeled b1div. A variation discussed in that patch which
does not replace the two divides will be labeled b2div.

   inline void improved_internal (MTYPE a, MTYPE b, MTYPE c, MTYPE d)
   {
     r = d/c;
     t = 1.0 / (c + (d * r));
     if (r != 0) {
         x = (a + (b * r)) * t;
         y = (b - (a * r)) * t;
     }  else {
     /* Changing the order of operations avoids the underflow of r impacting
      the result. */
         x = (a + (d * (b / c))) * t;
         y = (b - (d * (a / c))) * t;
     }
   }

   if (FABS (d) < FABS (c)) {
       improved_internal (a, b, c, d);
   } else {
       improved_internal (b, a, d, c);
       y = -y;
   }

NEW METHOD (proposed by patch) to replace the current default method:

The proposed method starts with an algorithm proposed by Baudin&Smith
(2012) in the paper "A Robust Complex Division in Scilab" [2]. The
patch makes additional modifications to that method for further
reductions in the error rate. The following code shows the #define
values for double precision. See the patch for #define values used
for other precisions.

   #define RBIG ((DBL_MAX)/2.0)
   #define RMIN (DBL_MIN)
   #define RMIN2 (0x1.0p-53)
   #define RMINSCAL (0x1.0p+51)

   /* prevent overflow when arguments are near max representable */
   if ((FABS (d) > RBIG) || (FABS (a) > RBIG) || (FABS (b) > RBIG) ) {
       a = a * 0.5;
       b = b * 0.5;
       c = c * 0.5;
       d = d * 0.5;
   }
   /* minimize overflow/underflow issues when c and d are small */
   else if (FABS (d) < RMIN2) {
       a = a * RMINSCAL;
       b = b * RMINSCAL;
       c = c * RMINSCAL;
       d = d * RMINSCAL;
   }
   r = c/d; denom = (c*r) + d;
   if( r > RMIN ) {
       e = (a*r + b) / denom   ;
       f = (b*r - a) / denom
   } else {
       e = (c * (a/d) + b) / denom;
       f = (c * (b/d) - a) / denom;
   }
[ only presenting the fabs(c) < fabs(d) case here, full code in patch. ]

Before any computation of the answer, the code checks for any input
values near maximum to allow down scaling to avoid overflow.  These
scalings almost never harm the accuracy since they are by 2. Values that
are over RBIG are relatively rare but it is easy to test for them and
allow aviodance of overflows.

Testing for RMIN2 reveals when both c and d are less than 2^-53 (for
double precision, see patch for other values).  By scaling all values
by 2^51, the code avoids many underflows in intermediate computations
that otherwise might occur. If scaling a and b by 2^51 causes either
to overflow, then the computation will overflow whatever method is
used.

Next, r (the ratio of c to d) is checked for being near zero. Baudin
and Smith checked r for zero. Checking for values less than DBL_MIN
(subnormal) covers roughly 10 times as many cases and improves overall
accuracy. If r is too small, then when it is used in a multiplication,
there is a high chance that the result will underflow to zero, losing
significant accuracy. That underflow is avoided by reordering the
computation.  When r is subnormal, the code replaces a*r (= a*(c/d))
with ((a/d)*c) which is mathematically the same but avoids the
unnecessary underflow.

TEST Data

Two sets of data are presented to test these methods. Both sets
contain 10 million pairs of complex values.  The exponents and
mantissas are generated using multiple calls to random() and then
combining the results. Only values which give results to complex
divide that are representable in the appropriate precision after
being computed in quad precision are used.

The first data set is labeled "moderate exponents".
are greater than *_MAX_EXP/2 or less than -(*_MAX_EXP)/2, results are
The exponent range is limited to -DBL_MAX_EXP/2 to DBL_MAX_EXP/2
or FLT_MAX_EXP or LDBL_MAX_EXP for the appropriate precisions.
The second data set is labeled "full exponents".
The exponent range for these cases is the full exponent range
for the precision.

ACCURACY Test results:

Note: The following accuracy tests are based on IEEE-754 arithmetic.

Note: All results are based on use of fused multiply-add. If
fused multiply-add is not used, the error rate increases slightly
for the 1 and 2 ulp cases for both current and new complex divide.
Differences at 8 ulp are barely measurable.

The complex divide methods are evaluated by determining what
percentage of values exceed different ulp (units in last place)
levels.  If a "2 ulp" test results show 1%, that would mean that 1% of
10,000,000 values (100,000) have either a real or imaginary part that
had a greater than or equal to a 2 bit difference from the quad
precision result.

Results are reported for differences greater than or equal to 1 ulp, 2
ulp, 8 ulp, 16 ulp, 24 ulp, and 52 ulp for double precision.  Even
when the patch avoids overflows and underflows, some input values are
expected to have errors due to the potential for catastrophic roundoff
from floating point subtraction. For example, when b*c and a*d are
nearly equal, the result of subtraction may lose several places of
accuracy. This patch does not attempt to detect or minimize this type
of error, but neither does it increase them.

I only show the results for Elen Kalda's method (with both 1 and
2 divides) and the new method for only 1 divide in the double
precision table.

In the following charts, lower values are better.

current - current complex divide in libgcc
b1div - Elen Kalda's method from Baudin & Smith with one divide
b2div - Elen Kalda's method from Baudin & Smith with two divides
new1  - This patch except with 1 divide and 2 multiplies
new   - This patch which uses 2 divides

===============================================================
Errors   Moderate Dataset
gtr eq     current    b1div      b2div        new1       new
======    ========   ========   ========   ========   ========
  1 ulp    0.24707%   0.92986%   0.24707%   0.92986%   0.24707%
  2 ulp    0.01762%   0.01770%   0.01762%   0.01770%   0.01762%
  8 ulp    0.00026%   0.00026%   0.00026%   0.00026%   0.00026%
16 ulp    0.00000%   0.00000%   0.00000%   0.00000%   0.00000%
24 ulp          0%         0%         0%         0%         0%
52 ulp          0%         0%         0%         0%         0%
===============================================================
Table 1: Errors with Moderate Dataset

Note in Table 1 that both the old and new methods give identical error
rates for data with moderate exponents. Errors exceeding 16 ulp are
exceedingly rare. There are substantial increases in the 1 ulp error
rates for the 1 divide/2 multiply methods as compared to the 2 divides
methods. These differences are minimal for 2 ulp and larger error
measurements. I chose to use the more accurate method of two divides
to avoid loss of accuracy relative to current behavior where current
behavior is not at risk of under/overflow.

===============================================================
Errors   Full Dataset
gtr eq     current    b1div      b2div        new1       new
======    ========   ========   ========   ========   ========
  1 ulp      2.05%   1.23842%    0.67130%   0.71774%   0.17100%
  2 ulp      1.88%   0.51615%    0.50354%   0.02052%   0.01319%
  8 ulp      1.79%   0.43227%    0.42531%   0.00358%   0.00350%
16 ulp      1.69%   0.34503%    0.33542%   0.00212%   0.00212%
24 ulp      1.59%   0.26367%    0.25189%   0.00138%   0.00138%
52 ulp      1.28%   0.01914%    0.00378%   0.00001%   0.00001%
===============================================================
Table 2: Errors with Full Dataset (double)

Table 2 shows significant differences in error rates, especially
between the current method and the new method. With the current code,
1.13% of test values have results where no bits of the mantissa match
the correct answer. That level of error is extremely rare with the new
code. Consider the results for errors greater than or equal to 24 ulp.
The current code sees those errors at a rate of 1.6%. Most would agree
that such answers are "substantially different" from the answers
calculated using extended precision. The b1div and b2div still shows
errors about about one-sixth of that rate. The new code reduces errors
at that level by more than a factor of 1000. These improvements are
important for serious computation of complex numbers.

Correctness for float
=============================
Errors   Moderate Dataset
gtr eq     current     new
======    ========   ========
  1 ulp   28.68070%  28.68070%
  2 ulp    0.64386%   0.64386%
  8 ulp    0.00403%   0.00403%
16 ulp    0.00001%   0.00001%
24 ulp          0%         0%
=============================
Table 3: Errors with Moderate Dataset (float)

We see again in Table 3 as in Table 1, there is no change in accuracy
between the current and new methods when the input data does not
contain very large or very small exponents with potential for
under/overflow.

=============================
Errors   Full Dataset
gtr eq     current     new
======    ========   ========
  1 ulp     19.97%  17.85193%
  2 ulp      3.20%   0.43153%
  8 ulp      2.16%   0.04139%
16 ulp      1.40%   0.00818%  (99.4% better)
24 ulp      0.98%   0.00016%
=============================
Table 4: Errors with Full Dataset (float)

In Table 4, we see as in Table 2, when extreme exponents are encountered,
the new method provides a substantial reduction in unacceptable answers.
The improvements are not quite as much as for double precision due
to the inherent limitations of 32-bit floats.

There is no change in accuracy for half-precision since the code is
unchanged. libgcc computes half-precision functions in float precision
allowing the existing methods to avoid overflow/underflow issues
for the allowed range of exponents for half-precision.

Extended precision (using x87 80-bit format on x86) and Long double
(using IEEE-754 128-bit on x86 and aarch64) both have 15-bit exponents
as compared to 11-bit exponents in double precision. We note that the
C standard also allows Long Double to be implemented in the equivalent
range of Double. The RMIN2 and RMINSCAL constants are selected to work
within the Double range as well as with extended and 128-bit ranges.
We will limit our performance and accurancy discussions to the 80-bit
and 128-bit formats as seen on x86 here.

The extended and long double precision investigations were more
limited.  Aarch64 does not supported extended precision but does
support the software implementation of 128-bit long double
precision. For x86, long double defaults to the 80-bit precision but
using the -mlong-double-128 flag switches to using the software
implementation of 128-bit precision. Both 80-bit and 128-bit
precisions have the same exponent range, with the 128-bit precision
has extended mantissas. Since this change is only aimed at avoiding
underflow/overflow for extreme exponents, I studied the extended
precision results on x86 for 100,000 values. The limited exponent
dataset showed no differences.  For the dataset with full exponent
range, the current and new values showed major differences (greater
than 32 ulp) in 567 cases out of 100,000 (0.56%). In every one of
these cases, the ratio of c/d or d/c (as appropriate) was zero or
subnormal, indicating the advantage of the new method and its
continued correctness where needed.

PERFORMANCE Test results

In order for a library change to be practical, it is necessary to show
the slowdown is tolerable. The slowdowns observed are much less than
would be seen by (for example) switching from hardware double precison
to a software quad precision, which on the tested machines causes a
slowdown of around 100x).

The actual slowdown depends on the machine architecture. It also
depends on the nature of the input data. If underflow/overflow is
rare, then implementations that have strong branch prediction will
only slowdown by a few cycles. If underflow/overflow is common, then
the branch predictors will be less successful and the cost will be
higher.

Results from two machines are presented as examples of the overhead
for the new method. The one labeled x86 is a 5 year old Intel x86
processor and the one labeled aarch64 is a 3 year old arm64 processor.

In the following chart, the times are averaged over a one million
value data set. All values are scaled to set the time of current
method to be 1.0. Lower values are better. A value of less than 1.0 would
be faster than the current method and a value greater than 1.0 would
be slower than the current method.

================================================
                Moderate set          full set
                x86  aarch64        x86  aarch64
========     ===============     ===============
float         1.15    1.15        1.96    1.96
double        1.07    1.40        1.31    1.78
long double   1.08    1.23        1.08    1.24
================================================
Table 5: Performance Comparisons (ratio new/current)

The above tables omit the timing for the 1 divide and 2 multiply
comparison with the 2 divide approach. I consider that optimization
vs accuracy issue separate from the rest of this proposal.
Roughly speaking the 1 divide and 2 multiply approach provides
a performance gain of 10 to 15% at a cost of 1 or 2 ulp about 0.5%
of the time for double precison. The loss of accurancy is somewhat
greater for float than for double precision.

For the proposed change, the moderate dataset shows less overhead for
the newer methods than the full dataset. That's because the moderate
dataset does not ever take the new branches which protect from
under/overflow. The better the branch predictor, the lower the cost
for these untaken branches. Both platforms are somewhat dated, with
the x86 having a better branch predictor which reduces the cost of the
additional branches in the new code. Of course, the relative slowdown
may be greater for some architectures, especially those with limited
branch prediction combined with a high cost of misprediction.

This cost is claimed to be tolerable on the grounds that:
(a) most applications will only spend a small fraction of their time
     calculating complex divide.
(b) it is much less than the cost of extended precision
(c) users are not forced to use it (as described below)
(d) the cost is worthwhile considering the accuracy improvement shown
     in Table 2 and 4.

Those users who find this degree of slowdown unsatisfactory may use
the switch -fcx-fortran-rules which does not use the library routine,
instead inlining Smith's method without the C99 requirement for
dealing with NaN results. The proposed patch for libgcc complex divide
does not affect the code generated by -fcx-fortran-rules.

As a side note, double precision is not slower than float for the
aarch64 and x64 platforms tested. A future change might investigate
the performance of promoting the float values of complex divide to
double precision and using the current Smith's method. Using double
precision for float values will avoid the round off effects and using
the larger mantissa will give better answers in the case of
catastrophic subtraction. Better accuracy with no loss of performance
seems a win all around, but would require performance testing on a
wide range of platforms to confirm the "no loss of performance"
observed on the two implementions tested so far.

SUMMARY

When input data to complex divide has exponents whose absolute value
is half than half of *_MAX_EXP, this patch makes no changes in
accuracy and has only a modest effect on performance.  When input data
contains values outside those ranges, the patch eliminates more than
99.5% of major errors with a tolerable cost in performance.

In comparison to Elen Kalda's method, this patch introduces more
performance overhead but reduces overflow/underflow more than 100
times as often.

There is an intermediate design choice which would catch 13x fewer
errors while introducing roughly 20% lower overhead. That could be
analyzed in depth if the overhead of the current proposal is
unacceptable.

REFERENCES

[1] Robert L. Smith. Algorithm 116: Complex division.  Commun. ACM,
  5(8):435, 1962.

[2] Michael Baudin and Robert L. Smith. "A robust complex division in
Scilab," October 2012, available at http://arxiv.org/abs/1210.4539.

[3] Elen Kalda: Complex division improvements in libgcc
https://gcc.gnu.org/legacy-ml/gcc-patches/2019-08/msg01629.html

[4] Nelson H.F. Beebe, "The Mathematical-Function Computation Handbook.
Springer International Publishing AG, 2017.
---
  libgcc/ChangeLog |   6 ++++
  libgcc/libgcc2.c | 105 ++++++++++++++++++++++++++++++++++++++++++++++++++++---
  2 files changed, 107 insertions(+), 4 deletions(-)

diff --git a/libgcc/ChangeLog b/libgcc/ChangeLog
index d6367e2..d1db004 100644
--- a/libgcc/ChangeLog
+++ b/libgcc/ChangeLog
@@ -1,3 +1,9 @@
+2020-06-24  Patrick McGehearty <patrick.mcgehea...@oracle.com>
+
+       * libgcc2.c (__divdc3): Enhance accuracy of complex divide by
+       avoiding underflow/overflow when input values have very large
+       or very small exponents.
+
  2020-06-25  Martin Liska  <mli...@suse.cz>

* libgcov-driver.c (merge_summary): Remove function as its name

diff --git a/libgcc/libgcc2.c b/libgcc/libgcc2.c
index e0a9fd7..feffb13 100644
--- a/libgcc/libgcc2.c
+++ b/libgcc/libgcc2.c
@@ -2036,13 +2036,35 @@ CONCAT3(__mul,MODE,3) (MTYPE a, MTYPE b, MTYPE c, MTYPE 
d)
  CTYPE
  CONCAT3(__div,MODE,3) (MTYPE a, MTYPE b, MTYPE c, MTYPE d)
  {
+#if defined(L_divsc3)
+#define RBIG   ((FLT_MAX)/2.0)
+#define RMIN   (FLT_MIN)
+#define RMIN2  (0x1.0p-21)
+#define RMINSCAL (0x1.0p+19)
+#endif
+
+#if defined(L_divdc3)
+#define RBIG   ((DBL_MAX)/2.0)
+#define RMIN   (DBL_MIN)
+#define RMIN2  (0x1.0p-53)
+#define RMINSCAL (0x1.0p+51)
+#endif
+
+#if (defined(L_divxc3) || defined(L_divtc3))
+#define RBIG   ((LDBL_MAX)/2.0)
+#define RMIN   (LDBL_MIN)
+#define RMIN2  (0x1.0p-53)
+#define RMINSCAL (0x1.0p+51)
+#endif
+
    MTYPE denom, ratio, x, y;
    CTYPE res;

- /* ??? We can get better behavior from logarithmic scaling instead of

-     the division.  But that would mean starting to link libgcc against
-     libm.  We could implement something akin to ldexp/frexp as gcc builtins
-     fairly easily...  */
+#if defined(L_divhc3)
+  /* half precision is handled with float precision.
+     no extra measures are needed to avoid overflow/underflow */
+
+  /* Scale by max(c,d) to reduce chances of denominator overflowing */
    if (FABS (c) < FABS (d))
      {
        ratio = c / d;
@@ -2057,6 +2079,81 @@ CONCAT3(__div,MODE,3) (MTYPE a, MTYPE b, MTYPE c, MTYPE 
d)
        x = ((b * ratio) + a) / denom;
        y = (b - (a * ratio)) / denom;
      }
+#else
+  /* float, double, extended, long double have significant potential
+    underflow/overflow errors than can be greatly reduced with
+    a limited number of adjustments  */
+
+  /* Scale by max(c,d) to reduce chances of denominator overflowing */
+  if (FABS (c) < FABS (d))
+    {
+      /* prevent underflow when denominator is near max representable */
+      if (FABS (d) >= RBIG)
+       {
+         a = a * 0.5;
+         b = b * 0.5;
+         c = c * 0.5;
+         d = d * 0.5;
+       }
+      /* avoid overflow/underflow issues when c and d are small */
+      /* scaling up helps avoid some underflows */
+      /* No new overflow possible since c&d < RMIN2 */
+      if (FABS (d) < RMIN2)
+       {
+         a = a * RMINSCAL;
+         b = b * RMINSCAL;
+         c = c * RMINSCAL;
+         d = d * RMINSCAL;
+       }
+      ratio = c / d;
+      denom = (c * ratio) + d;
+      /* choose alternate order of computation if ratio is subnormal */
+      if (FABS (ratio) > RMIN)
+       {
+         x = ((a * ratio) + b) / denom;
+         y = ((b * ratio) - a) / denom;
+       }
+      else
+       {
+         x = ((c * (a / d)) + b) / denom;
+         y = ((c * (b / d)) - a) / denom;
+       }
+    }
+  else
+    {
+      /* prevent underflow when denominator is near max representable */
+      if (FABS(c) >= RBIG)
+       {
+         a = a * 0.5;
+         b = b * 0.5;
+         c = c * 0.5;
+         d = d * 0.5;
+       }
+      /* avoid overflow/underflow issues when both c and d are small */
+      /* scaling up helps avoid some underflows */
+      /* No new overflow possible since both c&d are less than RMIN2 */
+      if (FABS(c) < RMIN2)
+       {
+         a = a * RMINSCAL;
+         b = b * RMINSCAL;
+         c = c * RMINSCAL;
+         d = d * RMINSCAL;
+       }
+      ratio = d / c;
+      denom = (d * ratio) + c;
+      /* choose alternate order of computation if ratio is subnormal */
+      if (FABS(ratio) > RMIN)
+       {
+         x = ((b * ratio) + a) / denom;
+         y = (b - (a * ratio)) / denom;
+       }
+      else
+       {
+         x = (a + (d * (b / c))) / denom;
+         y = (b - (d * (a / c))) / denom;
+       }
+    }
+#endif

/* Recover infinities and zeros that computed as NaN+iNaN; the only cases

       are nonzero/zero, infinite/finite, and finite/infinite.  */

Re: [PATCH V3] Practical Improvement to libgcc Complex Divide

Reply via email to