Hi all, Libgcc currently implements the Smith's algorithm (https://dl.acm.org/citation.cfm?id=368661) for complex division. It is designed to prevent overflows and underflows and it does that well for around 97.7% cases. However, there has been some interest in improving that result, as an example Baudin improves the robustness to ~99.5% by introducing an extra underflow check and also gains in performance by replacing two divisions by two multiplications and one division (https://arxiv.org/abs/1210.4539).
This patch proposes the implementation of Baudin algorithm to Libgcc (in case the increase in rounding error is acceptable). In general, when it comes to deciding what is the best way of doing the complex divisions, there are three things to consider: - speed - accuracy - robustness (how often overflow/underflow happens, how many division results are way off from the exact value) Improving the algorithms by looking at the failures of specific cases is not simple. There are 4 real numbers to be manipulated to get the final result and some minimum number of operations that needs to be done. The quite minimal Smiths algorithm currently does 9 operations, of which 6 are multiplications or divisions which are susceptible to overflow and underflow. It is not feasible to check for both, underflow and overflow, in all 6 operations without significantly impacting the performance. Most of the complex division algorithms have been created to solve some special difficult cases, however, because of the abundance of failure opportunities, algorithm that works well for some type of divisions is likely to fail for other types of divisions. The simplest way to dodge overflow and underflow without impacting performance much is to vary the order of operations. When naively expanding the complex division x/y = (...)/(real(y)*real(y) + imag(y)*imag(y)), the squaring in the denominator is the greatest source of overflow and underflow. Smiths improvement was to introduce scaling factor r = real(y)/imag(y), such that the denominator becomes real(y) + imag(y)*r. Baudin's improvement was to check explicitly for underflow in r and change the order of operations if necessary. One way of comparing the algorithms is to generate lots of random complex numbers and see how the different algorithms perform. That approach is closer to the "real world" situation where the complex numbers are often random, not powers of two (which oddly is the assumption many authors make, including Baudin) and have an exponents in a less "dangerous" ranges. Smiths algorithm was compared to two versions of Baudin. The difference is that in one version (b2div), real and imaginary parts of the result are both divided through by the same denominator d, in the other one (b1div) the real and imaginary parts are multiplied through by the reciprocal of that d, effectively replacing two divisions with one division and two multiplications. Since division is significantly slower on AArch64, that swap introduces gains in performance (though with the cost of rounding error, unfortunately). To compare the performance, I used a test that generates 1800 random complex double pairs (which fit nicely into the 64 kb cache) and divide each pair 200 000 times, effectively doing 360 000 000 operations. | CPU time ------------------ smiths | 7 296 924 b1div | 6 558 148 b2div | 8 658 079 On AArch64, b1div is 10.13% faster than Smiths, b2div is 18.65% slower than Smiths. For the accuracy, 1 000 000 pairs of complex doubles were divided and compared to the exact results (assuming that the division of the same numbers as long doubles gives the exact value). | >2ulp | >1ulp | >0.5ulp | <0.5ulp --------------------------------------------- smiths | 22 937 | 23 944 | 124 228 | 828 891 b1div | 4 450 | 59 716 | 350 792 | 585 042 b2div | 4 036 | 24 488 | 127 144 | 844 332 Same data, but showing the percentage of divisions falling into each category: | >2ulp | >1ulp | >0.5ulp | <0.5ulp --------------------------------------------- smiths | 2.29% | 2.39% | 12.42% | 82.89% b1div | 0.45% | 5.97% | 35.08% | 58.50% b2div | 0.40% | 2.45% | 12.71% | 84.43% The rightmost column (<0.5ulp) shows the number of calculations for which the ulp error of both, the real and imaginary parts, were inside 0.5 ulp, meaning they were correctly rounded. >0.5ulp gives the number of calculations for which the largest ulp error of the real and imaginary parts were between 0.5 and 1.0 ulp, meaning that it was rounded to one of the nearest doubles, but not THE nearest double. Anything less accurate is not great... FMA doesn't create any problems on AArch64. Bootstrapped and tested on aarch64-none-linux-gnu -- no problems with bootstrap, but some regressions in gcc/testsuite/gcc.dg/torture/builtin-math-7.c. The regressions are a result of the loss in accuracy - for a division (3. + 4.i) / 5i = 0.8 - 0.6i, Smiths gives: 0.800000000000000044408920985006 - 0.599999999999999977795539507497 i ulp error in real part: 0.4 ulp error in imaginary part: 0.2 b1div gives: 0.800000000000000044408920985006 - 0.600000000000000088817841970013 i ulp error in real part: 0.4 ulp error in imaginary part: 0.8 That means the imaginary part of the Baudin result is rounded to one of the two nearest floats but not the one which is the nearest. Similar thing happens to another division in that test. Some questions: - Is the 10.13% improvement in performance with b1div worth the loss in accuracy? - In the case of b1div, most of the results that ceased to be correctly rounded as a result of replacing the two divisions with multiplications, ended up in being in 1.0ulp. Maybe that is acceptable? - The improved algorithm reduces the number of bad fails from 2.3% to 0.5%. Is that a significant/useful improvement? Best wishes, Elen libgcc/ChangeLog: 2019-07-31 Elen Kalda <elen.ka...@arm.com> * libgcc2.c CONCAT3(__div,MODE,3): Implement the Baudin's algorithm
diff --git a/libgcc/libgcc2.c b/libgcc/libgcc2.c index 763b5fabd512d4efd3ca007d9a96d8778a5de422..9bc0168281dac231aeb4d7b9852a24e61128b77a 100644 --- a/libgcc/libgcc2.c +++ b/libgcc/libgcc2.c @@ -2033,58 +2033,75 @@ CONCAT3(__mul,MODE,3) (MTYPE a, MTYPE b, MTYPE c, MTYPE d) #if defined(L_divhc3) || defined(L_divsc3) || defined(L_divdc3) \ || defined(L_divxc3) || defined(L_divtc3) + +/* Complex division x/y, where + a = real (x) + b = imag (x) + c = real (y) + d = imag (y) +*/ CTYPE CONCAT3(__div,MODE,3) (MTYPE a, MTYPE b, MTYPE c, MTYPE d) { - MTYPE denom, ratio, x, y; + MTYPE e, f, r, t; CTYPE res; - /* ??? We can get better behavior from logarithmic scaling instead of - the division. But that would mean starting to link libgcc against - libm. We could implement something akin to ldexp/frexp as gcc builtins - fairly easily... */ - if (FABS (c) < FABS (d)) + inline void improved_internal (MTYPE a, MTYPE b, MTYPE c, MTYPE d) + { + r = d/c; + t = 1.0 / (c + (d * r)); + + if (r != 0) { - ratio = c / d; - denom = (c * ratio) + d; - x = ((a * ratio) + b) / denom; - y = ((b * ratio) - a) / denom; + e = (a + (b * r)) * t; + f = (b - (a * r)) * t; } - else + /* Changing the order of operations avoids the underflow of r impacting + the result. */ + else { - ratio = d / c; - denom = (d * ratio) + c; - x = ((b * ratio) + a) / denom; - y = (b - (a * ratio)) / denom; + e = (a + (d * (b / c))) * t; + f = (b - (d * (a / c))) * t; } + } + + if (FABS (d) <= FABS (c)) + { + improved_internal (a, b, c, d); + } + else + { + improved_internal (b, a, d, c); + f = -f; + } /* Recover infinities and zeros that computed as NaN+iNaN; the only cases are nonzero/zero, infinite/finite, and finite/infinite. */ - if (isnan (x) && isnan (y)) + if (isnan (e) && isnan (f)) + { + if (c == 0.0 && d == 0.0 && (!isnan (a) || !isnan (b))) { - if (c == 0.0 && d == 0.0 && (!isnan (a) || !isnan (b))) - { - x = COPYSIGN (INFINITY, c) * a; - y = COPYSIGN (INFINITY, c) * b; - } - else if ((isinf (a) || isinf (b)) && isfinite (c) && isfinite (d)) - { - a = COPYSIGN (isinf (a) ? 1 : 0, a); - b = COPYSIGN (isinf (b) ? 1 : 0, b); - x = INFINITY * (a * c + b * d); - y = INFINITY * (b * c - a * d); - } - else if ((isinf (c) || isinf (d)) && isfinite (a) && isfinite (b)) - { - c = COPYSIGN (isinf (c) ? 1 : 0, c); - d = COPYSIGN (isinf (d) ? 1 : 0, d); - x = 0.0 * (a * c + b * d); - y = 0.0 * (b * c - a * d); - } + e = COPYSIGN (INFINITY, c) * a; + f = COPYSIGN (INFINITY, c) * b; } + else if ((isinf (a) || isinf (b)) && isfinite (c) && isfinite (d)) + { + a = COPYSIGN (isinf (a) ? 1 : 0, a); + b = COPYSIGN (isinf (b) ? 1 : 0, b); + e = INFINITY * (a * c + b * d); + f = INFINITY * (b * c - a * d); + } + else if ((isinf (c) || isinf (d)) && isfinite (a) && isfinite (b)) + { + c = COPYSIGN (isinf (c) ? 1 : 0, c); + d = COPYSIGN (isinf (d) ? 1 : 0, d); + e = 0.0 * (a * c + b * d); + f = 0.0 * (b * c - a * d); + } + } - __real__ res = x; - __imag__ res = y; + __real__ res = e; + __imag__ res = f; return res; } #endif /* complex divide */
#include <math.h> #include <stdint.h> #include <stdio.h> #include <stdlib.h> #include <complex.h> #define FABS(x) ((x) >= 0 ? (x) : -(x)) #define NUMCNT 1000000 __complex double ans_i, x, y; double r_ulp_err, i_ulp_err; complex long double lx, ly, ans_l; int ulp_count_05 = 0; int ulp_count_1 = 0; int ulp_count_2 = 0; int ulp_count_correct = 0; // mantissa range double m_max = 2.0; double m_min = -2.0; // exponent range int e_min = -1022; int e_max = 1023; double rand_double (void) { double range = m_max - m_min; double div = (double)RAND_MAX / range; return m_min + (rand() / div); } int rand_int (void) { int range = e_max - e_min; double div = RAND_MAX / range; return e_min + (int)(rand() / div); } double create_double(void) { return rand_double() * pow(2, rand_int()); } static inline uint64_t asuint64(double f) { union {double f; uint64_t i;} u = {f}; return u.i; } static inline double asdouble(uint64_t i) { union {uint64_t i; double f;} u = {i}; return u.f; } static inline int ulpscale(double x) { // extract the exponent bits int e = asuint64(x)>>52 & 0x7ff; if (!e) e++; // 0x3ff is for the bias return e - 0x3ff - 52; } static double ulperror(double got, long double want_ldbl) { double want = want_ldbl; if (signbit(got) != signbit(want)) return isnan(got) && isnan(want) ? 0 : INFINITY; if (!isfinite(want) || !isfinite(got)) { if (isnan(got) != isnan(want)) return INFINITY; if (isnan(want)) return 0; if (isinf(got) && isinf(want)) return 0; if (isinf(got)) { got = copysign(0x1p1023, got); want *= 0.5; want_ldbl *= 0.5; } if (isinf(want)) { want_ldbl = want = copysign(0x1p1023, want); got *= 0.5; } } return scalbn(got-want_ldbl, -ulpscale(want)); } int main (void) { int i; for (i=0; i < NUMCNT; i++) { x = create_double() + create_double() * _Complex_I; y = create_double() + create_double() * _Complex_I; // division done by libgcc ans_i = x / y; // same division, but with long doubles lx = x; ly = y; ans_l = lx / ly; r_ulp_err = ulperror(creal(ans_i), creall(ans_l)); i_ulp_err = ulperror(cimag(ans_i), cimagl(ans_l)); if (FABS(r_ulp_err) >= 2.0 || FABS(i_ulp_err) >= 2.0) ulp_count_2++; else if (FABS(r_ulp_err) >= 1.0 || FABS(i_ulp_err) >= 1.0) ulp_count_1++; else if (FABS(r_ulp_err) >= 0.5 || FABS(i_ulp_err) >= 0.5) ulp_count_05++; else if (FABS(r_ulp_err) < 0.5 && FABS(i_ulp_err) < 0.5) ulp_count_correct++; } printf("correctly rounded: %2d\n", ulp_count_correct); printf("between 0.5 and 1.0 ulp: %2d\n", ulp_count_05); printf("between 1.0 and 2.0 ulp: %2d\n", ulp_count_1); printf("more than 2.0 ulp: %2d\n", ulp_count_2); }
#include <math.h> #include <stdio.h> #include <stdlib.h> #include <complex.h> #include <time.h> #define NUMCNT 1800 __complex double ans, x, y, res; __complex double x_arr[NUMCNT]; __complex double y_arr[NUMCNT]; double m_max = 2.0; double m_min = -2.0; int e_min = -1022; int e_max = 1023; double rand_double (void) { double range = m_max - m_min; double div = (double)RAND_MAX / range; return m_min + (rand() / div); } int rand_int (void) { int range = e_max - e_min; double div = RAND_MAX / range; return e_min + (int)(rand() / div); } double create_double(void) { return rand_double() * pow(2, rand_int()); } void populate_arrays(void) { int i; for (i=0; i < NUMCNT; i++) { x_arr[i] = create_double() + create_double() * _Complex_I; y_arr[i] = create_double() + create_double() * _Complex_I; } } int main (void) { int i; int j; clock_t t0; // processor time populate_arrays(); t0 = clock(); for (j = 0; j < 200000; j++) { for (i = 0; i < NUMCNT; i++) { ans = x_arr[i] / y_arr[i]; res += ans; } } t0 = clock() - t0; printf("res: %-39.28a %-39.28a i \n", creal(ans), cimag(ans)); printf("t0: %4ld \n", t0); }