The following patch to libgcc/libgcc2.c __divdc3 provides an opportunity to gain important improvements to the quality of answers for the default double precision complex divide routine when dealing with very large or very small exponents.
The current code correctly implements Smith's method (1962) [1] further modified by c99's requirements for dealing with NaN (not a number) results. When working with input values where the exponents are greater than 512 (i.e. 2.0^512) or less than -512 (i.e. 2.0^-512), results are substantially different from the answers provided by quad precision more than 1% of the time. Since the allowed exponent range for double precision numbers is -1076 to +1023, the error rate may be unacceptable for many applications. The proposed method reduces the frequency of "substantially different" answers by more than 99% at a modest cost of performance. Differences between current gcc methods and the new method will be described. Then accuracy and performance differences will be discussed. NOTATION For all of the following, the notation is: Input complex values: a+bi (a= 64bit real part, b= 64bit imaginary part) c+di Output complex value: e+fi = (a+bi)/(c+di) DESCRIPTIONS of different complex divide methods: NAIVE COMPUTATION (-fcx-limited-range): e = (a*c + b*d)/(c*c + d*d) f = (b*c - a*d)/(c*c + d*d) Note that c*c and d*d will overflow or underflow if either c or d is outside the range 2^-538 to 2^512. This method is available in gcc when the switch -fcx-limited-range is used. That switch is also enabled by -ffast-math. Only one who has a clear understanding of the maximum range of intermediate values generated by a computation should consider using this switch. SMITH's METHOD (current libgcc): if(fabs(c)<fabs(d) { r = c/d; denom = (c*r) + d; e = (a*r + b) / denom; f = (b*r - a) / denom; } else { r = d/c; denom = c + (d*r); e = (a + b*r) / denom; f = (b - a*r) / denom; } Smith's method is the current default method available with __divdc3. NEW METHOD (proposed by patch) to replace the current default method: The proposed method starts with an algorithm proposed by Baudin&Smith (2012) in the paper "A Robust Complex Division in Scilab" [2]. The patch makes additional modifications to that method for further reductions in the error rate. #define RBIG ((DBL_MAX)/2.0) #define RMIN (DBL_MIN) #define RMIN2 (0x1.0p-512) #define RMINSCAL (0x1.0p+510) /* prevent overflow when arguments are near max representable */ if ((FABS (d) > RBIG) || (FABS (a) > RBIG) || (FABS (b) > RBIG) ) { a = a * 0.5; b = b * 0.5; c = c * 0.5; d = d * 0.5; } /* minimize overflow/underflow issues when c and d are small */ else if (FABS (d) < RMIN2) { a = a * RMINSCAL; b = b * RMINSCAL; c = c * RMINSCAL; d = d * RMINSCAL; } r = c/d; denom = (c*r) + d; if( r > RMIN ) { e = (a*r + b) / denom ; f = (b*r - a) / denom } else { e = (c * (a/d) + b) / denom; f = (c * (b/d) - a) / denom; } [ only presenting the fabs(c) < fabs(d) case here, full code in patch. ] Before any computation of the answer, the code checks for near maximum or near minimum inputs and scale the results to move all values away from the extremes. If the complex divide can be computed at all without generating infinities, these scalings will not affect the accuracy since they are by a power of 2. Values that are over RBIG are relatively rare but it is easy to test for them and required to avoid unnecessary overflows. Testing for RMIN2 reveals when both c and d are less than 2^-512. By scaling all values by 2^510, the code avoids many underflows in intermediate computations that otherwise might occur. If scaling a and b by 2^510 causes either to overflow, then the computation will overflow whatever method is used. Next, r (the ratio of c to d) is checked for being near zero. Baudin and Smith checked r for zero. Checking for values less than DBL_MIN covers more cases and improves overall accuracy. If r is near zero, then when it is used in a multiplication, there is a high chance that the result will underflow to zero, losing significant accuracy. That underflow can be avoided if the computation is done in a different order. When r is subnormal, the code replaces a*r (= a*(c/d)) with ((a/d)*c) which is mathematically the same but avoids the unnecessary underflow. TEST Data Two sets of data are presented to test these methods. Both sets contain 10 million pairs of 64bit complex values. The exponents and mantissas are generated using multiple calls to random() and then combining the results. Only values which give results to complex divide that are representable in 64-bits after being computed in quad precision are used. The first data set is labeled "moderate exponents". The exponent range is limited to -512 to +511. The second data set is labeled "full exponents". The exponent range is -1076 to + 1024. ACCURACY Test results: Note: All results are based on use of fused multiply-add. If fused multiply-add is not used, the error rate increases slightly for the 2 ulp and 8 ulp cases. The complex divide methods are evaluated by determining what percentage of values exceed different ulp (units in last place) levels. If a "2 ulp" test results show 1%, that would mean that 1% of 10,000,000 values (100,000) have either a real or imaginary part that had a greater than 2 bit difference from the quad precision result. Results are reported for differences greater than or equal to 2 ulp, 8 ulp, 16 ulp, 24 ulp, and 52 ulp. Even when the patch avoids overflows and underflows, some input values are expected to have errors due to normal limitations of floating point subtraction. For example, when b*c and a*d are nearly equal subtraction has potential loss of accuracy. This patch does not attempt to detect or minimize this type of error, but neither does it increase them. In the following charts, lower values are better. ============================== Errors Moderate Dataset gtr eq current new ====== ======== ======== 2 ulp 0.01762% 0.01762% 8 ulp 0.00026% 0.00026% 16 ulp 0.00000% 0.00000% 24 ulp 0% 0% 52 ulp 0% 0% ============================== Table 1: Errors with Moderate Dataset Note in Table 1 that both the old and new methods give identical error rates for data with moderate exponents. Errors exceeding 16 ulp are exceedingly rare. ============================== Errors Full Dataset gtr eq current new ====== ======== ======== 2 ulp 1.70% 0.03874% 8 ulp 1.61% 0.02403% 16 ulp 1.51% 0.01554% 24 ulp 1.41% 0.00947% 52 ulp 1.13% 0.00001% ============================== Table 2: Errors with Full Dataset Table 2 shows significant differences in error rates. With the current code, 1.13% of test values have results where no bits of the mantissa match the correct answer. That level of error is extremely rare with the new code. Consider the results for errors greater than or equal to 24 ulp. The current code sees those errors at a rate of 1.4%. Most would agree that such answers are "substantially different" from the answers calculated using extended precision. The new code reduces errors at that level by more than 99%. These improvements are important for serious computation of complex numbers. PERFORMANCE Test results In order for a library change to be practical, it is necessary to show the slowdown is tolerable. The slowdowns observed are much less than would be seen by (for example) switching to quad precision, which on many machines may cause a slowdown of 1000x. The actual slowdown depends on the machine architecture. It also depends on the nature of the input data. If underflow/overflow is rare, then implementations that have strong branch prediction will only slowdown by a few cycles. If underflow/overflow is common, then the branch predictors will be less successful and the cost will be higher. Results from two machines are presented as examples of the overhead for the new method. The one labeled x86 is a 5 year old Intel x86 processor and the one labeled aarch64 is a 3 year old arm64 processor. In the following chart, lower values are better. The values are nanoseconds/complex divide (ns) averaged over the full 10 million value data sets. ===================================================== Moderate set full set x86 aarch64 x86 aarch64 ======== ================= ================= current 19.5 ns 34.9 ns 29.4 ns 32.9 ns new 20.0 ns 42.7 ns 37.4 ns 53.8 ns ======== ================= ================= %slowdown 3% 22% 27% 64% ===================================================== Table 3: Performance Comparisons In Table 3, overhead is modest when the new branches are consistently false. The overhead grows when the new branches are taken to get correct answers. Both platforms are somewhat dated, with the x86 having a better branch predictor which reduces the cost of the additional branches in the new code. Of course, the relative slowdown may be greater for some architectures, especially those with limited branch prediction combined with a high cost of misprediction. This cost is claimed to be tolerable on the grounds that: (a) most applications will only spend a small fraction of their time calculating complex divide. (b) it is much less than the cost of extended precision (c) users are not forced to use it (as described below) (d) the cost if worthwhile considering the accuracy improvement shown in Table 2. Those users who find this degree of slowdown unsatisfactory may use the switch -fcx-fortran-rules which does not use the library routine, instead inlining Smith's method without the C99 requirement for dealing with NaN results. The proposed patch for double precision complex divide does not affect the behavior of -fcx-fortran-rules. SUMMARY When input data to complex divide has exponents between -512 and 511, this patch makes no changes in accuracy and has only a modest effect on performance. When input data contains values outside those ranges, the patch eliminates more than 99% of major errors with a tolerable cost in performance. REFERENCES [1] Robert L. Smith. Algorithm 116: Complex division. Commun. ACM, 5(8):435, 1962. [2] Michael Baudin and Robert L. Smith. "A robust complex division in Scilab," October 2012, available at http://arxiv.org/abs/1210.4539. --- libgcc/ChangeLog | 6 ++++++ libgcc/libgcc2.c | 64 ++++++++++++++++++++++++++++++++++++++++++++++++++++---- 2 files changed, 66 insertions(+), 4 deletions(-) diff --git a/libgcc/ChangeLog b/libgcc/ChangeLog index 06e7986..5a666d4 100644 --- a/libgcc/ChangeLog +++ b/libgcc/ChangeLog @@ -1,3 +1,9 @@ +2020-05-29 Patrick McGehearty <patrick.mcgehea...@oracle.com> + + * libgcc2.c (__divdc3): Enhance accuracy of double precision + complex divide for input values with very large and very + small exponents. + 2020-05-28 Dong JianQiang <dongjianqia...@huawei.com> PR gcov-profile/95332 diff --git a/libgcc/libgcc2.c b/libgcc/libgcc2.c index e0a9fd7..2d32b7c 100644 --- a/libgcc/libgcc2.c +++ b/libgcc/libgcc2.c @@ -2036,6 +2036,10 @@ CONCAT3(__mul,MODE,3) (MTYPE a, MTYPE b, MTYPE c, MTYPE d) CTYPE CONCAT3(__div,MODE,3) (MTYPE a, MTYPE b, MTYPE c, MTYPE d) { +#define RBIG ((DBL_MAX)/2.0) +#define RMIN (DBL_MIN) +#define RMIN2 (0x1.0p-512) +#define RMINSCAL (0x1.0p+510) MTYPE denom, ratio, x, y; CTYPE res; @@ -2043,19 +2047,71 @@ CONCAT3(__div,MODE,3) (MTYPE a, MTYPE b, MTYPE c, MTYPE d) the division. But that would mean starting to link libgcc against libm. We could implement something akin to ldexp/frexp as gcc builtins fairly easily... */ + /* + Scaling by max(c,d) shifts a division and avoids a multiplication + reducing round-off errors by a modest amount (average of a half-bit) + */ if (FABS (c) < FABS (d)) { + /* prevent overflow when arguments are near max representable */ + if ((FABS (d) >= RBIG) || (FABS (a) >= RBIG) || (FABS (b) >= RBIG) ) + { + a = a * 0.5; + b = b * 0.5; + c = c * 0.5; + d = d * 0.5; + } + /* avoid overflow/underflow issues when c and d are small */ + else if (FABS (d) < RMIN2) + { + a = a * RMINSCAL; + b = b * RMINSCAL; + c = c * RMINSCAL; + d = d * RMINSCAL; + } ratio = c / d; denom = (c * ratio) + d; - x = ((a * ratio) + b) / denom; - y = ((b * ratio) - a) / denom; + if ( FABS (ratio) > RMIN ) + { + x = ((a * ratio) + b) / denom; + y = ((b * ratio) - a) / denom; + } + else + { + x = ((c * (a / d)) + b) / denom; + y = ((c * (b / d)) - a) / denom; + } } else { + /* prevent overflow when arguments are near max representable */ + if ((FABS(c) >= RBIG) || (FABS(a) >= RBIG) || (FABS(b) >= RBIG) ) + { + a = a * 0.5; + b = b * 0.5; + c = c * 0.5; + d = d * 0.5; + } + /* avoid overflow issues when c and d are small */ + else if (FABS(c) < RMIN2) + { + a = a * RMINSCAL; + b = b * RMINSCAL; + c = c * RMINSCAL; + d = d * RMINSCAL; + } ratio = d / c; denom = (d * ratio) + c; - x = ((b * ratio) + a) / denom; - y = (b - (a * ratio)) / denom; + if ( FABS (ratio) > RMIN ) + { + x = (a + (b * ratio)) / denom; + y = (b - (a * ratio)) / denom; + } + else + { + x = (a + (d * (b / c))) / denom; + y = (b - (d * (a / c))) / denom; + } } /* Recover infinities and zeros that computed as NaN+iNaN; the only cases -- 1.8.3.1