Re: [PATCH V2] Practical Improvement to Double Precision Complex Divide
On Thu, 11 Jun 2020, Patrick McGehearty wrote: > I will study real.c carefully along with the C99 std > to determine if I can find useful values for RMIN2 and RMINSCAL > for each format which are within range for all instances > of that format. A quick skim of real.c shows we have ieee half precision > and two arm half precision formats, for example. The difficulty will be cases such as TFmode which can be IEEE binary128 or IBM long double, which have very different exponent ranges. -- Joseph S. Myers jos...@codesourcery.com
Re: [PATCH V2] Practical Improvement to Double Precision Complex Divide
I will study real.c carefully along with the C99 std to determine if I can find useful values for RMIN2 and RMINSCAL for each format which are within range for all instances of that format. A quick skim of real.c shows we have ieee half precision and two arm half precision formats, for example. I appreciate the pointer to real.c as I was sure that information about various platform formats was somewhere in the src tree, but had not found it on my own. - patrick On 6/10/2020 5:39 PM, Joseph Myers wrote: On Wed, 10 Jun 2020, Patrick McGehearty wrote: #ifdef L_divhc3 #define RBIG (correct value for half precision) #define RMIN (correct value for half precision) #define RMIN2 ... (correct value for half precision) #define RMINSCAL ... (correct value for half precision) #endif #ifdef L_divsc3 #define RBIG FLT_MAX #define RMIN FLT_MIN #define RMIN2 ... #define RMINSCAL ... #endif ... would get the job done once I determine (and test) the correct values for all formats. But some of those machine modes can represent several different floating-point formats, depending on the target. There are many more floating-point formats supported in real.c than there are names for floating-point machine modes. Not all the differences are relevant here (those differences relating to default NaNs, for example), but some are.
Re: [PATCH V2] Practical Improvement to Double Precision Complex Divide
On Wed, 10 Jun 2020, Patrick McGehearty wrote: > #ifdef L_divhc3 > #define RBIG (correct value for half precision) > #define RMIN (correct value for half precision) > #define RMIN2 ... (correct value for half precision) > #define RMINSCAL ... (correct value for half precision) > #endif > #ifdef L_divsc3 > #define RBIG FLT_MAX > #define RMIN FLT_MIN > #define RMIN2 ... > #define RMINSCAL ... > #endif > ... > would get the job done once I determine (and test) the correct > values for all formats. But some of those machine modes can represent several different floating-point formats, depending on the target. There are many more floating-point formats supported in real.c than there are names for floating-point machine modes. Not all the differences are relevant here (those differences relating to default NaNs, for example), but some are. -- Joseph S. Myers jos...@codesourcery.com
Re: [PATCH V2] Practical Improvement to Double Precision Complex Divide
Joseph, Thank you again for your prompt and insightful response. It's now clear that I was engaged in wishful thinking that I only needed to handle double precision in my proposed change. I understand now that the text above the routine: - - - - - #if defined(L_divhc3) || defined(L_divsc3) || defined(L_divdc3) \ || defined(L_divxc3) || defined(L_divtc3) - - - - - implies the same code is to be built for L_divhc3 half precision L_divsc3 single precision L_divdc3 double precision L_divxc3 extended precision L_divtc3 quad precision I will need expand my test suites for these other floating point formats and do my best to understand how the libgcc build mechanisms work before I'm ready to resubmit. At first glance, assuming only one of the L_div?c3 names are defined at a time, it seems like something like: #ifdef L_divhc3 #define RBIG (correct value for half precision) #define RMIN (correct value for half precision) #define RMIN2 ... (correct value for half precision) #define RMINSCAL ... (correct value for half precision) #endif #ifdef L_divsc3 #define RBIG FLT_MAX #define RMIN FLT_MIN #define RMIN2 ... #define RMINSCAL ... #endif ... would get the job done once I determine (and test) the correct values for all formats. - Patrick McGehearty On 6/9/2020 6:11 PM, Joseph Myers wrote: On Wed, 10 Jun 2020, Patrick McGehearty wrote: I see your point about other floating point formats. According to the C standard, DOUBLE precision must have a DBL_MAX of at least 1E+37. That is (slightly) greater than 0x1.0p+63. Would #define RMIN2 (0x1.0p-53) #define RMINSCAL (0x1.0p+51) be acceptable? Those will be in range for any architecture that supports the C standard. But this code is used for different machine modes, not just double. In particular, on Arm / AArch64 it may be used to build __divhc3. And those constants are certainly outside the range of HFmode (IEEE binary16). I think you need to work out appropriate logic for what values are correct for a given machine mode, in terms of the parameters of that mode (maximum and minimum exponents etc.). Then, if you can't represent the logic for determining those values (with the correct type, not hardcoding the absence of a suffix which would mean double) directly in C, see how c-cppbuiltin.c predefines extra macros if -fbuilding-libgcc (macros relating to built-in function suffixes and the number of bits in the mantissa of a floating-point mode are included there - it would certainly make sense to add more such macros if necessary).
Re: [PATCH V2] Practical Improvement to Double Precision Complex Divide
A follow up note relating to use of fused multiply add in complex divide: While reviewing bugs relating to complex divide in libgcc, I rediscovered bug 59714 - complex division is surprising on targets with FMA. The specific concern was when you divide (1.0 + 3.0i) by itself and fused multiply add was active, you did not get (1.0 + 0.0i) as an answer. Instead you got 1.0 + epsilon off of zero. The question raised in that bug report was whether it was appropriate to use fused multiply add. I.e. on the whole, did it contribute to greater or less accuracy over a range of cases. No follow up data has been posted. My data sets (10M randomly generated values over the full exponent and mantissa range) showed the following with/without FMA: Moderate exponent data set (between -512 and 511 for exponent) error current current new new exceeds no fma fma no fma fma 1 ulp 0.34492% 0.24715% 0.34492% 0.24715% 2 ulp 0.02360% 0.01764% 0.02360% 0.01764% Full exponent data set (all exponents generated) error current current new new exceeds no fma fma no fma fma 1 ulp 1.90432% 1.87195% 0.23167% 0.19746% 2 ulp 1.70836% 1.70646% 0.04067% 0.03856% While the differences are not dramatic between using fma and no fma, they seem fairly clear that fma provides slightly more accurate results more often than not. I concur with the current practice of using fma if the architecture provides it. As a side note, my new method for complex divide gets the exact result for dividing (1.0 + 3.0i) by itself whether or not fused multiply add is used. That's because the calculation of the imaginary part includes subtracting two nearly equal values which turns out to be less than DBL_MIN. That means the new algorithm changes the order of operations which does not involve the use of fused mul-add. - Patrick McGehearty On 6/4/2020 6:38 PM, Patrick McGehearty via Gcc-patches wrote: The following patch to libgcc/libgcc2.c __divdc3 provides an opportunity to gain important improvements to the quality of answers for the default double precision complex divide routine when dealing with very large or very small exponents. The current code correctly implements Smith's method (1962) [1] further modified by c99's requirements for dealing with NaN (not a number) results. When working with input values where the exponents are greater than 512 (i.e. 2.0^512) or less than -512 (i.e. 2.0^-512), results are substantially different from the answers provided by quad precision more than 1% of the time. Since the allowed exponent range for double precision numbers is -1076 to +1023, the error rate may be unacceptable for many applications. The proposed method reduces the frequency of "substantially different" answers by more than 99% at a modest cost of performance. Differences between current gcc methods and the new method will be described. Then accuracy and performance differences will be discussed. (Added for Version 2) In my initial research, I missed Elen Kalda's proposed patch https://gcc.gnu.org/legacy-ml/gcc-patches/2019-08/msg01629.html [3] Thanks to Joseph Myers for providing me with the pointer. This version includes performance and accuracy comparisions between Elen's proposed patch and my latest patch version which has been modified to eliminate two branches. NOTATION For all of the following, the notation is: Input complex values: a+bi (a= 64bit real part, b= 64bit imaginary part) c+di Output complex value: e+fi = (a+bi)/(c+di) For the result tables: current = current method (SMITH) b1div = method proposed by Elen Kalda b2div = alternate method considered by Elen Kalda new1 = new method using 1 divide and 2 multiplies new = new method proposed by this patch DESCRIPTIONS of different complex divide methods: NAIVE COMPUTATION (-fcx-limited-range): e = (a*c + b*d)/(c*c + d*d) f = (b*c - a*d)/(c*c + d*d) Note that c*c and d*d will overflow or underflow if either c or d is outside the range 2^-538 to 2^512. This method is available in gcc when the switch -fcx-limited-range is used. That switch is also enabled by -ffast-math. Only one who has a clear understanding of the maximum range of intermediate values generated by a computation should consider using this switch. SMITH's METHOD (current libgcc): if(fabs(c) RBIG) || (FABS (a) > RBIG) || (FABS (b) > RBIG) ) { a = a * 0.5; b = b * 0.5; c = c * 0.5; d = d * 0.5; } /* minimize overflow/underflow issues when c and d are small */ else if (FABS (d) < RMIN2) { a = a * RMINSCAL; b = b * RMINSCAL; c = c * RMINSCAL; d = d * RMINSCAL; } r = c/d; denom = (c*r) + d; if( r > RMIN ) { e = (a*r + b) / denom ; f = (b*r - a) / denom } else { e = (c * (a/d) + b) / denom; f = (c * (b/d) - a) / denom; } [ only presenting the fabs(c) < fabs(d) case here, full code in patch. ]
Re: [PATCH V2] Practical Improvement to Double Precision Complex Divide
On Wed, 10 Jun 2020, Patrick McGehearty wrote: > I see your point about other floating point formats. > According to the C standard, DOUBLE precision must > have a DBL_MAX of at least 1E+37. That is (slightly) > greater than 0x1.0p+63. > > Would > #define RMIN2 (0x1.0p-53) > #define RMINSCAL (0x1.0p+51) > be acceptable? > Those will be in range for any architecture that supports the C standard. But this code is used for different machine modes, not just double. In particular, on Arm / AArch64 it may be used to build __divhc3. And those constants are certainly outside the range of HFmode (IEEE binary16). I think you need to work out appropriate logic for what values are correct for a given machine mode, in terms of the parameters of that mode (maximum and minimum exponents etc.). Then, if you can't represent the logic for determining those values (with the correct type, not hardcoding the absence of a suffix which would mean double) directly in C, see how c-cppbuiltin.c predefines extra macros if -fbuilding-libgcc (macros relating to built-in function suffixes and the number of bits in the mantissa of a floating-point mode are included there - it would certainly make sense to add more such macros if necessary). -- Joseph S. Myers jos...@codesourcery.com
Re: [PATCH V2] Practical Improvement to Double Precision Complex Divide
I see your point about other floating point formats. According to the C standard, DOUBLE precision must have a DBL_MAX of at least 1E+37. That is (slightly) greater than 0x1.0p+63. Would #define RMIN2 (0x1.0p-53) #define RMINSCAL (0x1.0p+51) be acceptable? Those will be in range for any architecture that supports the C standard. I estimate using those values will cause a typical complex divide to run somewhat slower for the full exponent dataset because the test for whether fabs(d) < RMIN2 [or fabs(c) < RMIN2 ] will be taken more often, adding four scaling multiplies to the overall operation. On the positive side, using the new scaling factor significantly reduces the frequency of 2 ulp or larger inaccuracies for the full range dataset. The alternative would be to remove the RMIN2 test altogether. That provides a modest reduction in overhead but increases the inaccuracy for the 8 ulp case to 0.04147%. Errors Full Dataset gtr eq current RMIN2=-512 RMIN2=-53 no RMIN2 == 1 ulp 1.87% 0.19755% 0.16788% 0.22169% 2 ulp 1.70% 0.03857% 0.01007% 0.06182% 8 ulp 1.61% 0.02394% 0.00199% 0.04147% 16 ulp 1.51% 0.01548% 0.00087% 0.02727% 24 ulp 1.41% 0.00945% 0.00042% 0.01694% 52 ulp 1.13% 0.1% 0.1% 0.1% === The RMIN2=-53% option greatly reduces 8 ulp or larger errors. The "No RMIN2" option has about double the error rate of the initial proposal of RMIN2=-512, but still has a error rate at the 8 ulp level 40 times lower than the current code. I welcome any thoughts from the community about the relative importance of accuracy vs performance tradeoffs at these levels. - Patrick McGehearty On 6/4/2020 6:45 PM, Joseph Myers wrote: On Fri, 5 Jun 2020, Patrick McGehearty wrote: diff --git a/libgcc/libgcc2.c b/libgcc/libgcc2.c index e0a9fd7..2a1d3dc 100644 --- a/libgcc/libgcc2.c +++ b/libgcc/libgcc2.c @@ -2036,26 +2036,77 @@ CONCAT3(__mul,MODE,3) (MTYPE a, MTYPE b, MTYPE c, MTYPE d) CTYPE CONCAT3(__div,MODE,3) (MTYPE a, MTYPE b, MTYPE c, MTYPE d) { +#define RBIG ((DBL_MAX)/2.0) +#define RMIN (DBL_MIN) +#define RMIN2 (0x1.0p-512) +#define RMINSCAL (0x1.0p+510) This code is used for many different machine modes and floating-point formats (the type and format corresponding to a particular machine mode may depend on the target for which GCC is configured). You can't hardcode particular values specific to DFmode (and to DFmode being IEEE binary64) here.
Re: [PATCH V2] Practical Improvement to Double Precision Complex Divide
On Fri, 5 Jun 2020, Patrick McGehearty wrote: > diff --git a/libgcc/libgcc2.c b/libgcc/libgcc2.c > index e0a9fd7..2a1d3dc 100644 > --- a/libgcc/libgcc2.c > +++ b/libgcc/libgcc2.c > @@ -2036,26 +2036,77 @@ CONCAT3(__mul,MODE,3) (MTYPE a, MTYPE b, MTYPE c, > MTYPE d) > CTYPE > CONCAT3(__div,MODE,3) (MTYPE a, MTYPE b, MTYPE c, MTYPE d) > { > +#define RBIG ((DBL_MAX)/2.0) > +#define RMIN (DBL_MIN) > +#define RMIN2(0x1.0p-512) > +#define RMINSCAL (0x1.0p+510) This code is used for many different machine modes and floating-point formats (the type and format corresponding to a particular machine mode may depend on the target for which GCC is configured). You can't hardcode particular values specific to DFmode (and to DFmode being IEEE binary64) here. -- Joseph S. Myers jos...@codesourcery.com