[PATCH V2] Practical Improvement to Double Precision Complex Divide

Patrick McGehearty via Gcc-patches Thu, 04 Jun 2020 16:41:50 -0700

The following patch to libgcc/libgcc2.c __divdc3 provides an
opportunity to gain important improvements to the quality of answers
for the default double precision complex divide routine when dealing
with very large or very small exponents.


The current code correctly implements Smith's method (1962) [1]
further modified by c99's requirements for dealing with NaN (not a
number) results. When working with input values where the exponents
are greater than 512 (i.e. 2.0^512) or less than -512 (i.e. 2.0^-512),
results are substantially different from the answers provided by quad
precision more than 1% of the time. Since the allowed exponent range
for double precision numbers is -1076 to +1023, the error rate may be
unacceptable for many applications. The proposed method reduces the
frequency of "substantially different" answers by more than 99% at a
modest cost of performance.

Differences between current gcc methods and the new method will be
described. Then accuracy and performance differences will be discussed.

(Added for Version 2)
In my initial research, I missed Elen Kalda's proposed patch
https://gcc.gnu.org/legacy-ml/gcc-patches/2019-08/msg01629.html [3]
Thanks to Joseph Myers for providing me with the pointer.
This version includes performance and accuracy comparisions
between Elen's proposed patch and my latest patch version which
has been modified to eliminate two branches.

NOTATION

For all of the following, the notation is:
Input complex values:
  a+bi  (a= 64bit real part, b= 64bit imaginary part)
  c+di
Output complex value:
  e+fi = (a+bi)/(c+di)

For the result tables:
current = current method (SMITH)
b1div = method proposed by Elen Kalda
b2div = alternate method considered by Elen Kalda
new1 = new method using 1 divide and 2 multiplies
new = new method proposed by this patch

DESCRIPTIONS of different complex divide methods:

NAIVE COMPUTATION (-fcx-limited-range):
  e = (a*c + b*d)/(c*c + d*d)
  f = (b*c - a*d)/(c*c + d*d)

Note that c*c and d*d will overflow or underflow if either
c or d is outside the range 2^-538 to 2^512.

This method is available in gcc when the switch -fcx-limited-range is
used. That switch is also enabled by -ffast-math. Only one who has a
clear understanding of the maximum range of intermediate values
generated by a computation should consider using this switch.

SMITH's METHOD (current libgcc):
  if(fabs(c)<fabs(d) {
    r = c/d;
    denom = (c*r) + d;
    e = (a*r + b) / denom;
    f = (b*r - a) / denom;
  } else {
    r = d/c;
    denom = c + (d*r);
    e = (a + b*r) / denom;
    f = (b - a*r) / denom;
  }

Smith's method is the current default method available with __divdc3.

Elen Kalda's METHOD

This method applies the most significant part of the algorithm
proposed by Baudin&Smith (2012) in the paper "A Robust Complex
Division in Scilab" [2]. It also replaces two divides by
one divide and two multiplies due to the high cost of divide
on aarch64. In the comparison sections, this method will be
labeled b1div. A variation discussed in that patch which
does not replace the two divides will be labeled b2div.

  inline void improved_internal (MTYPE a, MTYPE b, MTYPE c, MTYPE d)
  {
    r = d/c;
    t = 1.0 / (c + (d * r));
    if (r != 0) {
        x = (a + (b * r)) * t;
        y = (b - (a * r)) * t;
    }  else {
    /* Changing the order of operations avoids the underflow of r impacting
     the result. */
        x = (a + (d * (b / c))) * t;
        y = (b - (d * (a / c))) * t;
    }
  }

  if (FABS (d) < FABS (c)) {
      improved_internal (a, b, c, d);
  } else {
      improved_internal (b, a, d, c);
      y = -y;
  }

NEW METHOD (proposed by patch) to replace the current default method:

The proposed method starts with an algorithm proposed by Baudin&Smith
(2012) in the paper "A Robust Complex Division in Scilab" [2]. The
patch makes additional modifications to that method for further
reductions in the error rate.

  #define RBIG ((DBL_MAX)/2.0)
  #define RMIN (DBL_MIN)
  #define RMIN2 (0x1.0p-512)
  #define RMINSCAL (0x1.0p+510)

  /* prevent overflow when arguments are near max representable */
  if ((FABS (d) > RBIG) || (FABS (a) > RBIG) || (FABS (b) > RBIG) ) {
      a = a * 0.5;
      b = b * 0.5;
      c = c * 0.5;
      d = d * 0.5;
  }
  /* minimize overflow/underflow issues when c and d are small */
  else if (FABS (d) < RMIN2) {
      a = a * RMINSCAL;
      b = b * RMINSCAL;
      c = c * RMINSCAL;
      d = d * RMINSCAL;
  }
  r = c/d; denom = (c*r) + d;
  if( r > RMIN ) {
      e = (a*r + b) / denom   ;
      f = (b*r - a) / denom
  } else {
      e = (c * (a/d) + b) / denom;
      f = (c * (b/d) - a) / denom;
  }
[ only presenting the fabs(c) < fabs(d) case here, full code in patch. ]

Before any computation of the answer, the code checks for near maximum or
near minimum inputs and scale the results to move all values away from
the extremes. If the complex divide can be computed at all without
generating infinities, these scalings will not affect the accuracy
since they are by a power of 2.  Values that are over RBIG are
relatively rare but it is easy to test for them and required to avoid
unnecessary overflows.

Testing for RMIN2 reveals when both c and d are less than 2^-512.  By
scaling all values by 2^510, the code avoids many underflows in
intermediate computations that otherwise might occur. If scaling a and
b by 2^510 causes either to overflow, then the computation will overflow
whatever method is used.

Next, r (the ratio of c to d) is checked for being near zero. Baudin
and Smith checked r for zero. Checking for values less than DBL_MIN
covers more cases and improves overall accuracy. If r is near zero,
then when it is used in a multiplication, there is a high chance that
the result will underflow to zero, losing significant accuracy. That
underflow can be avoided if the computation is done in a different
order.  When r is subnormal, the code replaces a*r (= a*(c/d)) with
((a/d)*c) which is mathematically the same but avoids the unnecessary
underflow.

TEST Data

Two sets of data are presented to test these methods.  Both sets
contain 10 million pairs of 64bit complex values.  The exponents and
mantissas are generated using multiple calls to random() and then
combining the results. Only values which give results to complex
divide that are representable in 64-bits after being computed in quad
precision are used.

The first data set is labeled "moderate exponents".
The exponent range is limited to -512 to +511.
The second data set is labeled "full exponents".
The exponent range is -1076 to + 1024.

ACCURACY Test results:

Note: All results are based on use of fused multiply-add. If
fused multiply-add is not used, the error rate increases slightly
for the 2 ulp and 8 ulp cases.

The complex divide methods are evaluated by determining what
percentage of values exceed different ulp (units in last place)
levels.  If a "2 ulp" test results show 1%, that would mean that 1% of
10,000,000 values (100,000) have either a real or imaginary part that
had a greater than 2 bit difference from the quad precision result.

Results are reported for differences greater than or equal to 1 ulp,
2 ulp, 8 ulp, 16 ulp, 24 ulp, and 52 ulp.  Even when the patch avoids
overflows and underflows, some input values are expected to have
errors due to normal limitations of floating point subtraction. For
example, when b*c and a*d are nearly equal subtraction has potential
loss of accuracy. This patch does not attempt to detect or minimize
this type of error, but neither does it increase them.

In the following charts, lower values are better.

===============================================================
Errors   Moderate Dataset
gtr eq     current    b1div      b2div        new1       new
======    ========   ========   ========   ========   ========
 1 ulp    0.24707%   0.92986%   0.24707%   0.92986%   0.24707%
 2 ulp    0.01762%   0.01770%   0.01762%   0.01770%   0.01762%
 8 ulp    0.00026%   0.00026%   0.00026%   0.00026%   0.00026%
16 ulp    0.00000%   0.00000%   0.00000%   0.00000%   0.00000%
24 ulp          0%         0%         0%         0%         0%
52 ulp          0%         0%         0%         0%         0%
===============================================================
Table 1: Errors with Moderate Dataset

Note in Table 1 that both the old and new methods give identical error
rates for data with moderate exponents. Errors exceeding 16 ulp are
exceedingly rare. There are substantial increases in the 1 ulp
error rates for the 1 divide/2 multiply methods as compared to
the 2 divides methods. These differences are minimal for 2 ulp
and larger error measurements.

===============================================================
Errors   Full Dataset
gtr eq     current    b1div      b2div        new1       new
======    ========   ========   ========   ========   ========
 1 ulp      1.87%   1.21724%    0.65349%   0.74080%   0.19755%
 2 ulp      1.70%   0.49497%    0.48573%   0.04254%   0.03857%
 8 ulp      1.61%   0.41617%    0.40929%   0.02394%   0.02394%
16 ulp      1.51%   0.33025%    0.32064%   0.01548%   0.01548%
24 ulp      1.41%   0.25092%    0.23914%   0.00945%   0.00945%
52 ulp      1.13%   0.01886%    0.00350%   0.00001%   0.00001%
===============================================================
Table 2: Errors with Full Dataset

Table 2 shows significant differences in error rates, especially
between the current method and the new method. With the current code,
1.13% of test values have results where no bits of the mantissa match
the correct answer. That level of error is extremely rare with the new
code. Consider the results for errors greater than or equal to 24 ulp.
The current code sees those errors at a rate of 1.4%. Most would agree
that such answers are "substantially different" from the answers
calculated using extended precision. The b1div and b2div still shows
errors about about one-fifth of that rate. The new code reduces errors
at that level by more than 99%. These improvements are important for
serious computation of complex numbers.

PERFORMANCE Test results

In order for a library change to be practical, it is necessary to
show the slowdown is tolerable. The slowdowns observed are much
less than would be seen by (for example) switching to quad
precision, which on many machines may cause a slowdown of 1000x.

The actual slowdown depends on the machine architecture. It also
depends on the nature of the input data. If underflow/overflow is rare,
then implementations that have strong branch prediction will only
slowdown by a few cycles. If underflow/overflow is common, then the
branch predictors will be less successful and the cost will be higher.

Results from two machines are presented as examples of the overhead
for the new method. The one labeled x86 is a 5 year old Intel x86
processor and the one labeled aarch64 is a 3 year old arm64 processor.

In the following chart, the times are averaged over the full 10
million value data set. All values are scaled to set the current value
to 1.0. Lower values are better. A value of less than 1.0 would be
faster than the current method and a value greater than 1.0 would be
slower than the current method.

================================================
               Moderate set            full set
               x86  aarch64        x86  aarch64
========     ===============     ===============
current       1.00    1.00        1.00    1.00
b1div         0.85    0.89        1.17    1.06
b2div         0.98    1.03        1.29    1.27
new1          0.87    1.12        1.15    1.28
new           1.01    1.35        1.25    1.58
================================================
Table 3: Performance Comparisons

There are two performance considerations in Table 3. One, there is a
clear performance advantage of 10% to 15% to use the 1 divide and 2
multiplies methods instead of the 2 divide methods. Since most modern
architectures have a much higher cost for floating point divide than
for floating point multiply, this difference is not a surprise.
Choosing to replace two divides by one divide and two multiplies
for the performance gain at the cost of a slight loss of accuracy
in an independent decision from the proposed changes to minimize
major errors due to overflow/underflow. For that reason, the
proposed patch does not include the change in the divide method.

Second, the moderate dataset shows less overhead for the newer methods
than the full dataset. That's because the moderate dataset does not
ever take the new branches which protect from under/overflow. The
better the branch predictor, the lower the cost for these untaken
branches. Both platforms are somewhat dated, with the x86 having a
better branch predictor which reduces the cost of the additional
branches in the new code. Of course, the relative slowdown may be
greater for some architectures, especially those with limited branch
prediction combined with a high cost of misprediction.

This cost is claimed to be tolerable on the grounds that:
(a) most applications will only spend a small fraction of their time
    calculating complex divide.
(b) it is much less than the cost of extended precision
(c) users are not forced to use it (as described below)
(d) the cost if worthwhile considering the accuracy improvement shown
    in Table 2.

Those users who find this degree of slowdown unsatisfactory may use
the switch -fcx-fortran-rules which does not use the library routine,
instead inlining Smith's method without the C99 requirement for
dealing with NaN results. The proposed patch for double precision
complex divide does not affect the behavior of -fcx-fortran-rules.

SUMMARY

When input data to complex divide has exponents between -512 and 511,
this patch makes no changes in accuracy and has only a modest effect
on performance.  When input data contains values outside those ranges,
the patch eliminates more than 99% of major errors with a tolerable
cost in performance.

In comparison to Elen Kalda's method, this patch introduces more
performance overhead but reduces overflow/underflow 20 times as
often.

REFERENCES

[1] Robert L. Smith. Algorithm 116: Complex division.  Commun. ACM,
 5(8):435, 1962.

[2] Michael Baudin and Robert L. Smith. "A robust complex division in
Scilab," October 2012, available at http://arxiv.org/abs/1210.4539.

[3] Elen Kalda: Complex division improvements in libgcc
https://gcc.gnu.org/legacy-ml/gcc-patches/2019-08/msg01629.html
---
 libgcc/ChangeLog |  6 +++++
 libgcc/libgcc2.c | 67 +++++++++++++++++++++++++++++++++++++++++++++++++-------
 2 files changed, 65 insertions(+), 8 deletions(-)

diff --git a/libgcc/ChangeLog b/libgcc/ChangeLog
index e39bd90..2640b84 100644
--- a/libgcc/ChangeLog
+++ b/libgcc/ChangeLog
@@ -1,3 +1,9 @@
+2020-06-04  Patrick McGehearty <patrick.mcgehea...@oracle.com>
+
+       * libgcc2.c (__divdc3): Enhance accuracy of double precision
+       complex divide for input values with very large and very
+       small exponents.
+
 2020-06-01  Uroš Bizjak  <ubiz...@gmail.com>
 
        * config/i386/sfp-exceptions.c (struct fenv):
diff --git a/libgcc/libgcc2.c b/libgcc/libgcc2.c
index e0a9fd7..2a1d3dc 100644
--- a/libgcc/libgcc2.c
+++ b/libgcc/libgcc2.c
@@ -2036,26 +2036,77 @@ CONCAT3(__mul,MODE,3) (MTYPE a, MTYPE b, MTYPE c, MTYPE 
d)
 CTYPE
 CONCAT3(__div,MODE,3) (MTYPE a, MTYPE b, MTYPE c, MTYPE d)
 {
+#define RBIG   ((DBL_MAX)/2.0)
+#define RMIN   (DBL_MIN)
+#define RMIN2  (0x1.0p-512)
+#define RMINSCAL (0x1.0p+510)
   MTYPE denom, ratio, x, y;
   CTYPE res;
 
-  /* ??? We can get better behavior from logarithmic scaling instead of
-     the division.  But that would mean starting to link libgcc against
-     libm.  We could implement something akin to ldexp/frexp as gcc builtins
-     fairly easily...  */
+  /* Scale by max(c,d) to reduce chance of denominator overflowing */
   if (FABS (c) < FABS (d))
     {
+      /* prevent underflow when denominator is near max representable */
+      if (FABS (d) >= RBIG)
+       {
+         a = a * 0.5;
+         b = b * 0.5;
+         c = c * 0.5;
+         d = d * 0.5;
+       } 
+      /* avoid overflow/underflow issues when c and d are very small */
+      if (FABS (d) < RMIN2)
+       {
+         a = a * RMINSCAL;
+         b = b * RMINSCAL;
+         c = c * RMINSCAL;
+         d = d * RMINSCAL;
+       }
       ratio = c / d;
       denom = (c * ratio) + d;
-      x = ((a * ratio) + b) / denom;
-      y = ((b * ratio) - a) / denom;
+      /* choose alternate order of computation if ratio is subnormal */
+      if ( FABS (ratio) > RMIN )
+       {
+         x = ((a * ratio) + b) / denom;
+         y = ((b * ratio) - a) / denom;
+       }
+      else
+       {
+         x = ((c * (a / d)) + b) / denom;
+         y = ((c * (b / d)) - a) / denom;
+       }
     }
   else
     {
+      /* prevent underflow when denominator is near max representable */
+      if (FABS(c) >= RBIG)
+       {
+         a = a * 0.5;
+         b = b * 0.5;
+         c = c * 0.5;
+         d = d * 0.5;
+       } 
+      /* avoid overflow/underflow issues when c and d are very small */
+      else if (FABS(c) < RMIN2)
+       {
+         a = a * RMINSCAL;
+         b = b * RMINSCAL;
+         c = c * RMINSCAL;
+         d = d * RMINSCAL;
+       }
       ratio = d / c;
       denom = (d * ratio) + c;
-      x = ((b * ratio) + a) / denom;
-      y = (b - (a * ratio)) / denom;
+      /* choose alternate order of computation if ratio is subnormal */
+      if ( FABS (ratio) > RMIN )
+       {
+         x = (a + (b * ratio)) / denom;
+         y = (b - (a * ratio)) / denom;
+       }
+      else
+       {
+         x = (a + (d * (b / c))) / denom;
+         y = (b - (d * (a / c))) / denom;
+       }
     }
 
   /* Recover infinities and zeros that computed as NaN+iNaN; the only cases
-- 
1.8.3.1

[PATCH V2] Practical Improvement to Double Precision Complex Divide

Reply via email to