The following patch to libgcc/libgcc2.c __divdc3 provides an opportunity to gain important improvements to the quality of answers for the default double precision complex divide routine when dealing with very large or very small exponents.
The current code correctly implements Smith's method (1962) [1] further modified by c99's requirements for dealing with NaN (not a number) results. When working with input values where the exponents are greater than 512 (i.e. 2.0^512) or less than -512 (i.e. 2.0^-512), results are substantially different from the answers provided by quad precision more than 1% of the time. Since the allowed exponent range for double precision numbers is -1076 to +1023, the error rate may be unacceptable for many applications. The proposed method reduces the frequency of "substantially different" answers by more than 99% at a modest cost of performance. Differences between current gcc methods and the new method will be described. Then accuracy and performance differences will be discussed. (Added for Version 2) In my initial research, I missed Elen Kalda's proposed patch https://gcc.gnu.org/legacy-ml/gcc-patches/2019-08/msg01629.html [3] Thanks to Joseph Myers for providing me with the pointer. This version includes performance and accuracy comparisions between Elen's proposed patch and my latest patch version which has been modified to eliminate two branches. NOTATION For all of the following, the notation is: Input complex values: a+bi (a= 64bit real part, b= 64bit imaginary part) c+di Output complex value: e+fi = (a+bi)/(c+di) For the result tables: current = current method (SMITH) b1div = method proposed by Elen Kalda b2div = alternate method considered by Elen Kalda new1 = new method using 1 divide and 2 multiplies new = new method proposed by this patch DESCRIPTIONS of different complex divide methods: NAIVE COMPUTATION (-fcx-limited-range): e = (a*c + b*d)/(c*c + d*d) f = (b*c - a*d)/(c*c + d*d) Note that c*c and d*d will overflow or underflow if either c or d is outside the range 2^-538 to 2^512. This method is available in gcc when the switch -fcx-limited-range is used. That switch is also enabled by -ffast-math. Only one who has a clear understanding of the maximum range of intermediate values generated by a computation should consider using this switch. SMITH's METHOD (current libgcc): if(fabs(c)<fabs(d) { r = c/d; denom = (c*r) + d; e = (a*r + b) / denom; f = (b*r - a) / denom; } else { r = d/c; denom = c + (d*r); e = (a + b*r) / denom; f = (b - a*r) / denom; } Smith's method is the current default method available with __divdc3. Elen Kalda's METHOD This method applies the most significant part of the algorithm proposed by Baudin&Smith (2012) in the paper "A Robust Complex Division in Scilab" [2]. It also replaces two divides by one divide and two multiplies due to the high cost of divide on aarch64. In the comparison sections, this method will be labeled b1div. A variation discussed in that patch which does not replace the two divides will be labeled b2div. inline void improved_internal (MTYPE a, MTYPE b, MTYPE c, MTYPE d) { r = d/c; t = 1.0 / (c + (d * r)); if (r != 0) { x = (a + (b * r)) * t; y = (b - (a * r)) * t; } else { /* Changing the order of operations avoids the underflow of r impacting the result. */ x = (a + (d * (b / c))) * t; y = (b - (d * (a / c))) * t; } } if (FABS (d) < FABS (c)) { improved_internal (a, b, c, d); } else { improved_internal (b, a, d, c); y = -y; } NEW METHOD (proposed by patch) to replace the current default method: The proposed method starts with an algorithm proposed by Baudin&Smith (2012) in the paper "A Robust Complex Division in Scilab" [2]. The patch makes additional modifications to that method for further reductions in the error rate. #define RBIG ((DBL_MAX)/2.0) #define RMIN (DBL_MIN) #define RMIN2 (0x1.0p-512) #define RMINSCAL (0x1.0p+510) /* prevent overflow when arguments are near max representable */ if ((FABS (d) > RBIG) || (FABS (a) > RBIG) || (FABS (b) > RBIG) ) { a = a * 0.5; b = b * 0.5; c = c * 0.5; d = d * 0.5; } /* minimize overflow/underflow issues when c and d are small */ else if (FABS (d) < RMIN2) { a = a * RMINSCAL; b = b * RMINSCAL; c = c * RMINSCAL; d = d * RMINSCAL; } r = c/d; denom = (c*r) + d; if( r > RMIN ) { e = (a*r + b) / denom ; f = (b*r - a) / denom } else { e = (c * (a/d) + b) / denom; f = (c * (b/d) - a) / denom; } [ only presenting the fabs(c) < fabs(d) case here, full code in patch. ] Before any computation of the answer, the code checks for near maximum or near minimum inputs and scale the results to move all values away from the extremes. If the complex divide can be computed at all without generating infinities, these scalings will not affect the accuracy since they are by a power of 2. Values that are over RBIG are relatively rare but it is easy to test for them and required to avoid unnecessary overflows. Testing for RMIN2 reveals when both c and d are less than 2^-512. By scaling all values by 2^510, the code avoids many underflows in intermediate computations that otherwise might occur. If scaling a and b by 2^510 causes either to overflow, then the computation will overflow whatever method is used. Next, r (the ratio of c to d) is checked for being near zero. Baudin and Smith checked r for zero. Checking for values less than DBL_MIN covers more cases and improves overall accuracy. If r is near zero, then when it is used in a multiplication, there is a high chance that the result will underflow to zero, losing significant accuracy. That underflow can be avoided if the computation is done in a different order. When r is subnormal, the code replaces a*r (= a*(c/d)) with ((a/d)*c) which is mathematically the same but avoids the unnecessary underflow. TEST Data Two sets of data are presented to test these methods. Both sets contain 10 million pairs of 64bit complex values. The exponents and mantissas are generated using multiple calls to random() and then combining the results. Only values which give results to complex divide that are representable in 64-bits after being computed in quad precision are used. The first data set is labeled "moderate exponents". The exponent range is limited to -512 to +511. The second data set is labeled "full exponents". The exponent range is -1076 to + 1024. ACCURACY Test results: Note: All results are based on use of fused multiply-add. If fused multiply-add is not used, the error rate increases slightly for the 2 ulp and 8 ulp cases. The complex divide methods are evaluated by determining what percentage of values exceed different ulp (units in last place) levels. If a "2 ulp" test results show 1%, that would mean that 1% of 10,000,000 values (100,000) have either a real or imaginary part that had a greater than 2 bit difference from the quad precision result. Results are reported for differences greater than or equal to 1 ulp, 2 ulp, 8 ulp, 16 ulp, 24 ulp, and 52 ulp. Even when the patch avoids overflows and underflows, some input values are expected to have errors due to normal limitations of floating point subtraction. For example, when b*c and a*d are nearly equal subtraction has potential loss of accuracy. This patch does not attempt to detect or minimize this type of error, but neither does it increase them. In the following charts, lower values are better. =============================================================== Errors Moderate Dataset gtr eq current b1div b2div new1 new ====== ======== ======== ======== ======== ======== 1 ulp 0.24707% 0.92986% 0.24707% 0.92986% 0.24707% 2 ulp 0.01762% 0.01770% 0.01762% 0.01770% 0.01762% 8 ulp 0.00026% 0.00026% 0.00026% 0.00026% 0.00026% 16 ulp 0.00000% 0.00000% 0.00000% 0.00000% 0.00000% 24 ulp 0% 0% 0% 0% 0% 52 ulp 0% 0% 0% 0% 0% =============================================================== Table 1: Errors with Moderate Dataset Note in Table 1 that both the old and new methods give identical error rates for data with moderate exponents. Errors exceeding 16 ulp are exceedingly rare. There are substantial increases in the 1 ulp error rates for the 1 divide/2 multiply methods as compared to the 2 divides methods. These differences are minimal for 2 ulp and larger error measurements. =============================================================== Errors Full Dataset gtr eq current b1div b2div new1 new ====== ======== ======== ======== ======== ======== 1 ulp 1.87% 1.21724% 0.65349% 0.74080% 0.19755% 2 ulp 1.70% 0.49497% 0.48573% 0.04254% 0.03857% 8 ulp 1.61% 0.41617% 0.40929% 0.02394% 0.02394% 16 ulp 1.51% 0.33025% 0.32064% 0.01548% 0.01548% 24 ulp 1.41% 0.25092% 0.23914% 0.00945% 0.00945% 52 ulp 1.13% 0.01886% 0.00350% 0.00001% 0.00001% =============================================================== Table 2: Errors with Full Dataset Table 2 shows significant differences in error rates, especially between the current method and the new method. With the current code, 1.13% of test values have results where no bits of the mantissa match the correct answer. That level of error is extremely rare with the new code. Consider the results for errors greater than or equal to 24 ulp. The current code sees those errors at a rate of 1.4%. Most would agree that such answers are "substantially different" from the answers calculated using extended precision. The b1div and b2div still shows errors about about one-fifth of that rate. The new code reduces errors at that level by more than 99%. These improvements are important for serious computation of complex numbers. PERFORMANCE Test results In order for a library change to be practical, it is necessary to show the slowdown is tolerable. The slowdowns observed are much less than would be seen by (for example) switching to quad precision, which on many machines may cause a slowdown of 1000x. The actual slowdown depends on the machine architecture. It also depends on the nature of the input data. If underflow/overflow is rare, then implementations that have strong branch prediction will only slowdown by a few cycles. If underflow/overflow is common, then the branch predictors will be less successful and the cost will be higher. Results from two machines are presented as examples of the overhead for the new method. The one labeled x86 is a 5 year old Intel x86 processor and the one labeled aarch64 is a 3 year old arm64 processor. In the following chart, the times are averaged over the full 10 million value data set. All values are scaled to set the current value to 1.0. Lower values are better. A value of less than 1.0 would be faster than the current method and a value greater than 1.0 would be slower than the current method. ================================================ Moderate set full set x86 aarch64 x86 aarch64 ======== =============== =============== current 1.00 1.00 1.00 1.00 b1div 0.85 0.89 1.17 1.06 b2div 0.98 1.03 1.29 1.27 new1 0.87 1.12 1.15 1.28 new 1.01 1.35 1.25 1.58 ================================================ Table 3: Performance Comparisons There are two performance considerations in Table 3. One, there is a clear performance advantage of 10% to 15% to use the 1 divide and 2 multiplies methods instead of the 2 divide methods. Since most modern architectures have a much higher cost for floating point divide than for floating point multiply, this difference is not a surprise. Choosing to replace two divides by one divide and two multiplies for the performance gain at the cost of a slight loss of accuracy in an independent decision from the proposed changes to minimize major errors due to overflow/underflow. For that reason, the proposed patch does not include the change in the divide method. Second, the moderate dataset shows less overhead for the newer methods than the full dataset. That's because the moderate dataset does not ever take the new branches which protect from under/overflow. The better the branch predictor, the lower the cost for these untaken branches. Both platforms are somewhat dated, with the x86 having a better branch predictor which reduces the cost of the additional branches in the new code. Of course, the relative slowdown may be greater for some architectures, especially those with limited branch prediction combined with a high cost of misprediction. This cost is claimed to be tolerable on the grounds that: (a) most applications will only spend a small fraction of their time calculating complex divide. (b) it is much less than the cost of extended precision (c) users are not forced to use it (as described below) (d) the cost if worthwhile considering the accuracy improvement shown in Table 2. Those users who find this degree of slowdown unsatisfactory may use the switch -fcx-fortran-rules which does not use the library routine, instead inlining Smith's method without the C99 requirement for dealing with NaN results. The proposed patch for double precision complex divide does not affect the behavior of -fcx-fortran-rules. SUMMARY When input data to complex divide has exponents between -512 and 511, this patch makes no changes in accuracy and has only a modest effect on performance. When input data contains values outside those ranges, the patch eliminates more than 99% of major errors with a tolerable cost in performance. In comparison to Elen Kalda's method, this patch introduces more performance overhead but reduces overflow/underflow 20 times as often. REFERENCES [1] Robert L. Smith. Algorithm 116: Complex division. Commun. ACM, 5(8):435, 1962. [2] Michael Baudin and Robert L. Smith. "A robust complex division in Scilab," October 2012, available at http://arxiv.org/abs/1210.4539. [3] Elen Kalda: Complex division improvements in libgcc https://gcc.gnu.org/legacy-ml/gcc-patches/2019-08/msg01629.html --- libgcc/ChangeLog | 6 +++++ libgcc/libgcc2.c | 67 +++++++++++++++++++++++++++++++++++++++++++++++++------- 2 files changed, 65 insertions(+), 8 deletions(-) diff --git a/libgcc/ChangeLog b/libgcc/ChangeLog index e39bd90..2640b84 100644 --- a/libgcc/ChangeLog +++ b/libgcc/ChangeLog @@ -1,3 +1,9 @@ +2020-06-04 Patrick McGehearty <patrick.mcgehea...@oracle.com> + + * libgcc2.c (__divdc3): Enhance accuracy of double precision + complex divide for input values with very large and very + small exponents. + 2020-06-01 Uroš Bizjak <ubiz...@gmail.com> * config/i386/sfp-exceptions.c (struct fenv): diff --git a/libgcc/libgcc2.c b/libgcc/libgcc2.c index e0a9fd7..2a1d3dc 100644 --- a/libgcc/libgcc2.c +++ b/libgcc/libgcc2.c @@ -2036,26 +2036,77 @@ CONCAT3(__mul,MODE,3) (MTYPE a, MTYPE b, MTYPE c, MTYPE d) CTYPE CONCAT3(__div,MODE,3) (MTYPE a, MTYPE b, MTYPE c, MTYPE d) { +#define RBIG ((DBL_MAX)/2.0) +#define RMIN (DBL_MIN) +#define RMIN2 (0x1.0p-512) +#define RMINSCAL (0x1.0p+510) MTYPE denom, ratio, x, y; CTYPE res; - /* ??? We can get better behavior from logarithmic scaling instead of - the division. But that would mean starting to link libgcc against - libm. We could implement something akin to ldexp/frexp as gcc builtins - fairly easily... */ + /* Scale by max(c,d) to reduce chance of denominator overflowing */ if (FABS (c) < FABS (d)) { + /* prevent underflow when denominator is near max representable */ + if (FABS (d) >= RBIG) + { + a = a * 0.5; + b = b * 0.5; + c = c * 0.5; + d = d * 0.5; + } + /* avoid overflow/underflow issues when c and d are very small */ + if (FABS (d) < RMIN2) + { + a = a * RMINSCAL; + b = b * RMINSCAL; + c = c * RMINSCAL; + d = d * RMINSCAL; + } ratio = c / d; denom = (c * ratio) + d; - x = ((a * ratio) + b) / denom; - y = ((b * ratio) - a) / denom; + /* choose alternate order of computation if ratio is subnormal */ + if ( FABS (ratio) > RMIN ) + { + x = ((a * ratio) + b) / denom; + y = ((b * ratio) - a) / denom; + } + else + { + x = ((c * (a / d)) + b) / denom; + y = ((c * (b / d)) - a) / denom; + } } else { + /* prevent underflow when denominator is near max representable */ + if (FABS(c) >= RBIG) + { + a = a * 0.5; + b = b * 0.5; + c = c * 0.5; + d = d * 0.5; + } + /* avoid overflow/underflow issues when c and d are very small */ + else if (FABS(c) < RMIN2) + { + a = a * RMINSCAL; + b = b * RMINSCAL; + c = c * RMINSCAL; + d = d * RMINSCAL; + } ratio = d / c; denom = (d * ratio) + c; - x = ((b * ratio) + a) / denom; - y = (b - (a * ratio)) / denom; + /* choose alternate order of computation if ratio is subnormal */ + if ( FABS (ratio) > RMIN ) + { + x = (a + (b * ratio)) / denom; + y = (b - (a * ratio)) / denom; + } + else + { + x = (a + (d * (b / c))) / denom; + y = (b - (d * (a / c))) / denom; + } } /* Recover infinities and zeros that computed as NaN+iNaN; the only cases -- 1.8.3.1