[PATCH][GCC] Complex division improvements in Libgcc

Elen Kalda Fri, 23 Aug 2019 04:20:39 -0700

Hi all,

Libgcc currently implements the Smith's algorithm
(https://dl.acm.org/citation.cfm?id=368661) for complex division. It is
designed to prevent overflows and underflows and it does that well for around
97.7% cases. However, there has been some interest in improving that result, as
an example Baudin improves the robustness to ~99.5% by introducing an extra
underflow check and also gains in performance by replacing two divisions by two
multiplications and one division (https://arxiv.org/abs/1210.4539).


This patch proposes the implementation of Baudin algorithm to Libgcc (in case
the increase in rounding error is acceptable).

In general, when it comes to deciding what is the best way of doing the
complex divisions, there are three things to consider:
- speed
- accuracy
- robustness (how often overflow/underflow happens, how many division results
are way off from the exact value)

Improving the algorithms by looking at the failures of specific cases is not
simple. There are 4 real numbers to be manipulated to get the final result and
some minimum number of operations that needs to be done. The quite minimal
Smiths algorithm currently does 9 operations, of which 6 are multiplications
or divisions which are susceptible to overflow and underflow. It is not
feasible to check for both, underflow and overflow, in all 6 operations
without significantly impacting the performance. Most of the complex division
algorithms have been created to solve some special difficult cases, however,
because of the abundance of failure opportunities, algorithm that works well
for some type of divisions is likely to fail for other types of divisions.

The simplest way to dodge overflow and underflow without impacting performance
much is to vary the order of operations. When naively expanding the complex
division x/y = (...)/(real(y)*real(y) + imag(y)*imag(y)), the squaring in the
denominator is the greatest source of overflow and underflow. Smiths
improvement was to introduce scaling factor r = real(y)/imag(y), such that the
denominator becomes real(y) + imag(y)*r. Baudin's improvement was to check
explicitly for underflow in r and change the order of operations if necessary.

One way of comparing the algorithms is to generate lots of random complex
numbers and see how the different algorithms perform. That approach is closer
to the "real world" situation where the complex numbers are often random, not
powers of two (which oddly is the assumption many authors make, including
Baudin) and have an exponents in a less "dangerous" ranges. Smiths algorithm 
was compared to two versions of Baudin. The difference is that in one version
(b2div), real and imaginary parts of the result are both divided through by
the same denominator d, in the other one (b1div) the real and imaginary
parts are multiplied through by the reciprocal of that d, effectively
replacing two divisions with one division and two multiplications. Since
division is significantly slower on AArch64, that swap introduces gains in
performance (though with the cost of rounding error, unfortunately).

To compare the performance, I used a test that generates 1800 random complex
double pairs (which fit nicely into the 64 kb cache) and divide each pair 200
000 times, effectively doing 360 000 000 operations.

           | CPU time
------------------
smiths | 7 296 924
b1div  | 6 558 148
b2div  | 8 658 079

On AArch64, b1div is 10.13% faster than Smiths, b2div is 18.65% slower than 
Smiths.

For the accuracy, 1 000 000 pairs of complex doubles were divided and compared
to the exact results (assuming that the division of the same numbers as long
doubles gives the exact value). 

           |  >2ulp | >1ulp  | >0.5ulp | <0.5ulp
---------------------------------------------
smiths | 22 937 | 23 944 | 124 228 | 828 891
b1div  |  4 450 | 59 716 | 350 792 | 585 042
b2div  |  4 036 | 24 488 | 127 144 | 844 332

Same data, but showing the percentage of divisions falling into each category: 

           |  >2ulp | >1ulp  | >0.5ulp | <0.5ulp
---------------------------------------------
smiths | 2.29%  | 2.39%  | 12.42%  | 82.89%
b1div  | 0.45%  | 5.97%  | 35.08%  | 58.50%
b2div  | 0.40%  | 2.45%  | 12.71%  | 84.43%

The rightmost column (<0.5ulp) shows the number of calculations for which the 
ulp error of both, the real and imaginary parts, were inside 0.5 ulp, meaning 
they were correctly rounded. >0.5ulp gives the number of calculations for 
which the largest ulp error of the real and imaginary parts were between 0.5 
and 1.0 ulp, meaning that it was rounded to one of the nearest doubles, but 
not THE nearest double. Anything less accurate is not great...

FMA doesn't create any problems on AArch64. Bootstrapped and tested on
aarch64-none-linux-gnu -- no problems with bootstrap, but some regressions in
gcc/testsuite/gcc.dg/torture/builtin-math-7.c. The regressions are a result of
the loss in accuracy - for a division (3. + 4.i) / 5i = 0.8 - 0.6i, Smiths
gives:
0.800000000000000044408920985006 - 0.599999999999999977795539507497 i
ulp error in real part: 0.4
ulp error in imaginary part: 0.2
b1div gives:
0.800000000000000044408920985006 - 0.600000000000000088817841970013 i
ulp error in real part: 0.4
ulp error in imaginary part: 0.8
That means the imaginary part of the Baudin result is rounded to one of the two
nearest floats but not the one which is the nearest. Similar thing happens to
another division in that test.


Some questions:
- Is the 10.13% improvement in performance with b1div worth the loss in 
accuracy?
- In the case of b1div, most of the results that ceased to be correctly
rounded as a result of replacing the two divisions with multiplications, ended
up in being in 1.0ulp. Maybe that is acceptable?
- The improved algorithm reduces the number of bad fails from 2.3% to 0.5%. Is
that a significant/useful improvement?


Best wishes,
Elen

libgcc/ChangeLog:

2019-07-31  Elen Kalda  <elen.ka...@arm.com>

        * libgcc2.c CONCAT3(__div,MODE,3): Implement the Baudin's algorithm

diff --git a/libgcc/libgcc2.c b/libgcc/libgcc2.c
index 763b5fabd512d4efd3ca007d9a96d8778a5de422..9bc0168281dac231aeb4d7b9852a24e61128b77a 100644
--- a/libgcc/libgcc2.c
+++ b/libgcc/libgcc2.c
@@ -2033,58 +2033,75 @@ CONCAT3(__mul,MODE,3) (MTYPE a, MTYPE b, MTYPE c, MTYPE d)
 #if defined(L_divhc3) || defined(L_divsc3) || defined(L_divdc3) \
     || defined(L_divxc3) || defined(L_divtc3)
 
+
+/* Complex division x/y, where
+	a = real (x)
+	b = imag (x)
+	c = real (y)
+	d = imag (y)
+*/
 CTYPE
 CONCAT3(__div,MODE,3) (MTYPE a, MTYPE b, MTYPE c, MTYPE d)
 {
-  MTYPE denom, ratio, x, y;
+  MTYPE e, f, r, t;
   CTYPE res;
 
-  /* ??? We can get better behavior from logarithmic scaling instead of
-     the division.  But that would mean starting to link libgcc against
-     libm.  We could implement something akin to ldexp/frexp as gcc builtins
-     fairly easily...  */
-  if (FABS (c) < FABS (d))
+  inline void improved_internal (MTYPE a, MTYPE b, MTYPE c, MTYPE d)
+  {
+    r = d/c;
+    t = 1.0 / (c + (d * r));
+
+    if (r != 0)
     {
-      ratio = c / d;
-      denom = (c * ratio) + d;
-      x = ((a * ratio) + b) / denom;
-      y = ((b * ratio) - a) / denom;
+      e = (a + (b * r)) * t;
+      f = (b - (a * r)) * t;
     }
-  else
+    /* Changing the order of operations avoids the underflow of r impacting
+    the result. */
+    else
     {
-      ratio = d / c;
-      denom = (d * ratio) + c;
-      x = ((b * ratio) + a) / denom;
-      y = (b - (a * ratio)) / denom;
+      e = (a + (d * (b / c))) * t;
+      f = (b - (d * (a / c))) * t;
     }
+  }
+
+  if (FABS (d) <= FABS (c))
+  {
+    improved_internal (a, b, c, d);
+  }
+  else
+  {
+    improved_internal (b, a, d, c);
+    f = -f;
+  }
 
   /* Recover infinities and zeros that computed as NaN+iNaN; the only cases
      are nonzero/zero, infinite/finite, and finite/infinite.  */
-  if (isnan (x) && isnan (y))
+  if (isnan (e) && isnan (f))
+  {
+    if (c == 0.0 && d == 0.0 && (!isnan (a) || !isnan (b)))
     {
-      if (c == 0.0 && d == 0.0 && (!isnan (a) || !isnan (b)))
-	{
-	  x = COPYSIGN (INFINITY, c) * a;
-	  y = COPYSIGN (INFINITY, c) * b;
-	}
-      else if ((isinf (a) || isinf (b)) && isfinite (c) && isfinite (d))
-	{
-	  a = COPYSIGN (isinf (a) ? 1 : 0, a);
-	  b = COPYSIGN (isinf (b) ? 1 : 0, b);
-	  x = INFINITY * (a * c + b * d);
-	  y = INFINITY * (b * c - a * d);
-	}
-      else if ((isinf (c) || isinf (d)) && isfinite (a) && isfinite (b))
-	{
-	  c = COPYSIGN (isinf (c) ? 1 : 0, c);
-	  d = COPYSIGN (isinf (d) ? 1 : 0, d);
-	  x = 0.0 * (a * c + b * d);
-	  y = 0.0 * (b * c - a * d);
-	}
+      e = COPYSIGN (INFINITY, c) * a;
+      f = COPYSIGN (INFINITY, c) * b;
     }
+    else if ((isinf (a) || isinf (b)) && isfinite (c) && isfinite (d))
+    {
+      a = COPYSIGN (isinf (a) ? 1 : 0, a);
+      b = COPYSIGN (isinf (b) ? 1 : 0, b);
+      e = INFINITY * (a * c + b * d);
+      f = INFINITY * (b * c - a * d);
+    }
+    else if ((isinf (c) || isinf (d)) && isfinite (a) && isfinite (b))
+    {
+      c = COPYSIGN (isinf (c) ? 1 : 0, c);
+      d = COPYSIGN (isinf (d) ? 1 : 0, d);
+      e = 0.0 * (a * c + b * d);
+      f = 0.0 * (b * c - a * d);
+    }
+  }
 
-  __real__ res = x;
-  __imag__ res = y;
+  __real__ res = e;
+  __imag__ res = f;
   return res;
 }
 #endif /* complex divide */

#include <math.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <complex.h>


#define FABS(x) ((x) >= 0 ? (x) : -(x))
#define NUMCNT 1000000

__complex double ans_i, x, y;
double r_ulp_err, i_ulp_err;
complex long double lx, ly, ans_l;

int ulp_count_05 = 0;
int ulp_count_1 = 0;
int ulp_count_2 = 0;
int ulp_count_correct = 0;

// mantissa range
double m_max = 2.0;
double m_min = -2.0;

// exponent range
int e_min = -1022;
int e_max = 1023;

double rand_double (void)
{
	double range = m_max - m_min;
	double div = (double)RAND_MAX / range;
	return m_min + (rand() / div);
}

int rand_int (void)
{
	int range = e_max - e_min;
	double div = RAND_MAX / range;
	return e_min + (int)(rand() / div);
}

double create_double(void)
{
	return rand_double() * pow(2, rand_int());
}

static inline uint64_t asuint64(double f)
{
      union {double f; uint64_t i;} u = {f};
      return u.i;
}

static inline double asdouble(uint64_t i)
{
      union {uint64_t i; double f;} u = {i};
      return u.f;
}

static inline int ulpscale(double x)
{
      // extract the exponent bits
      int e = asuint64(x)>>52 & 0x7ff;
      
      if (!e)
          e++;
      // 0x3ff is for the bias 
      return e - 0x3ff - 52;

}

static double ulperror(double got, long double want_ldbl)
{
    double want = want_ldbl;

    if (signbit(got) != signbit(want))
       return isnan(got) && isnan(want) ? 0 : INFINITY;

    if (!isfinite(want) || !isfinite(got)) {
        if (isnan(got) != isnan(want))
            return INFINITY;
        if (isnan(want))
            return 0;
        if (isinf(got) && isinf(want))
            return 0;
        if (isinf(got)) {
            got = copysign(0x1p1023, got);
            want *= 0.5;
            want_ldbl *= 0.5;
            }

        if (isinf(want)) {

            want_ldbl = want = copysign(0x1p1023, want);
            got *= 0.5;
        }
    }

    return scalbn(got-want_ldbl, -ulpscale(want));
}

int main (void)
{
	int i;

	for (i=0; i < NUMCNT; i++)
	{

		x = create_double() + create_double() * _Complex_I;
		y = create_double() + create_double() * _Complex_I;
		
		// division done by libgcc
		ans_i = x / y;
		
		// same division, but with long doubles
		lx = x;
		ly = y;
		ans_l = lx / ly;		

		r_ulp_err = ulperror(creal(ans_i), creall(ans_l));
		i_ulp_err = ulperror(cimag(ans_i), cimagl(ans_l));

		
		if (FABS(r_ulp_err) >= 2.0 || FABS(i_ulp_err) >= 2.0)                
	                ulp_count_2++;
                
		else if (FABS(r_ulp_err) >= 1.0 || FABS(i_ulp_err) >= 1.0)
                        ulp_count_1++;
                
		else if (FABS(r_ulp_err) >= 0.5 || FABS(i_ulp_err) >= 0.5)
			ulp_count_05++;
		
                else if (FABS(r_ulp_err) < 0.5 && FABS(i_ulp_err) < 0.5)
                        ulp_count_correct++;
	}

	printf("correctly rounded: %2d\n", ulp_count_correct);
	printf("between 0.5 and 1.0 ulp: %2d\n", ulp_count_05);
        printf("between 1.0 and 2.0 ulp: %2d\n", ulp_count_1);
	printf("more than 2.0 ulp: %2d\n", ulp_count_2);
             
}

#include <math.h>
#include <stdio.h>
#include <stdlib.h>
#include <complex.h>
#include <time.h>

#define NUMCNT 1800

__complex double ans, x, y, res;
__complex double x_arr[NUMCNT];
__complex double y_arr[NUMCNT];

double m_max = 2.0;
double m_min = -2.0;

int e_min = -1022;
int e_max = 1023;

double rand_double (void)
{
	double range = m_max - m_min;
	double div = (double)RAND_MAX / range;
	return m_min + (rand() / div);
}

int rand_int (void)
{
	int range = e_max - e_min;
	double div = RAND_MAX / range;
	return e_min + (int)(rand() / div);
}

double create_double(void)
{
	return rand_double() * pow(2, rand_int());
}

void populate_arrays(void)
{
	int i;
	for (i=0; i < NUMCNT; i++)
	{
		x_arr[i] = create_double() + create_double() * _Complex_I;
		y_arr[i] = create_double() + create_double() * _Complex_I;
	}
}

int main (void)
{
	int i;
	int j;
	clock_t t0; // processor time

	populate_arrays();

	t0 = clock();

	for (j = 0; j < 200000; j++)
	{
		for (i = 0; i < NUMCNT; i++)
		{	
			ans = x_arr[i] / y_arr[i];
			res += ans;
		}
	}
	t0 = clock() - t0;

	printf("res: %-39.28a  %-39.28a i \n", creal(ans), cimag(ans));
	printf("t0: %4ld \n", t0);
}

[PATCH][GCC] Complex division improvements in Libgcc

Reply via email to