Re: [PATCH V2] Practical Improvement to Double Precision Complex Divide

2020-06-10 Thread Joseph Myers
On Thu, 11 Jun 2020, Patrick McGehearty wrote:

> I will study real.c carefully along with the C99 std
> to determine if I can find useful values for RMIN2 and RMINSCAL
> for each format which are within range for all instances
> of that format. A quick skim of real.c shows we have ieee half precision
> and two arm half precision formats, for example.

The difficulty will be cases such as TFmode which can be IEEE binary128 or 
IBM long double, which have very different exponent ranges.

-- 
Joseph S. Myers
jos...@codesourcery.com


Re: [PATCH V2] Practical Improvement to Double Precision Complex Divide

2020-06-10 Thread Patrick McGehearty via Gcc-patches

I will study real.c carefully along with the C99 std
to determine if I can find useful values for RMIN2 and RMINSCAL
for each format which are within range for all instances
of that format. A quick skim of real.c shows we have ieee half precision
and two arm half precision formats, for example.

I appreciate the pointer to real.c as I was sure that information
about various platform formats was somewhere in the src tree, but
had not found it on my own.

- patrick


On 6/10/2020 5:39 PM, Joseph Myers wrote:

On Wed, 10 Jun 2020, Patrick McGehearty wrote:


#ifdef L_divhc3
#define RBIG  (correct value for half precision)
#define RMIN  (correct value for half precision)
#define RMIN2 ...  (correct value for half precision)
#define RMINSCAL ... (correct value for half precision)
#endif
#ifdef L_divsc3
#define RBIG FLT_MAX
#define RMIN FLT_MIN
#define RMIN2 ...
#define RMINSCAL ...
#endif
...
would get the job done once I determine (and test) the correct
values for all formats.

But some of those machine modes can represent several different
floating-point formats, depending on the target.  There are many more
floating-point formats supported in real.c than there are names for
floating-point machine modes.  Not all the differences are relevant here
(those differences relating to default NaNs, for example), but some are.





Re: [PATCH V2] Practical Improvement to Double Precision Complex Divide

2020-06-10 Thread Joseph Myers
On Wed, 10 Jun 2020, Patrick McGehearty wrote:

> #ifdef L_divhc3
> #define RBIG  (correct value for half precision)
> #define RMIN  (correct value for half precision)
> #define RMIN2 ...  (correct value for half precision)
> #define RMINSCAL ... (correct value for half precision)
> #endif
> #ifdef L_divsc3
> #define RBIG FLT_MAX
> #define RMIN FLT_MIN
> #define RMIN2 ...
> #define RMINSCAL ...
> #endif
> ...
> would get the job done once I determine (and test) the correct
> values for all formats.

But some of those machine modes can represent several different 
floating-point formats, depending on the target.  There are many more 
floating-point formats supported in real.c than there are names for 
floating-point machine modes.  Not all the differences are relevant here 
(those differences relating to default NaNs, for example), but some are.

-- 
Joseph S. Myers
jos...@codesourcery.com


Re: [PATCH V2] Practical Improvement to Double Precision Complex Divide

2020-06-10 Thread Patrick McGehearty via Gcc-patches

Joseph,
Thank you again for your prompt and insightful response.

It's now clear that I was engaged in wishful thinking that I only needed
to handle double precision in my proposed change. I understand now that
the text above the routine:
- - - - -
#if defined(L_divhc3) || defined(L_divsc3) || defined(L_divdc3) \
    || defined(L_divxc3) || defined(L_divtc3)
- - - - -
implies the same code is to be built for
L_divhc3  half precision
L_divsc3  single precision
L_divdc3  double precision
L_divxc3  extended precision
L_divtc3  quad precision

I will need expand my test suites for these other floating point formats
and do my best to understand how the libgcc build mechanisms work before
I'm ready to resubmit.

At first glance, assuming only one of the L_div?c3 names are defined at 
a time,

it seems like something like:

#ifdef L_divhc3
#define RBIG  (correct value for half precision)
#define RMIN  (correct value for half precision)
#define RMIN2 ...  (correct value for half precision)
#define RMINSCAL ... (correct value for half precision)
#endif
#ifdef L_divsc3
#define RBIG FLT_MAX
#define RMIN FLT_MIN
#define RMIN2 ...
#define RMINSCAL ...
#endif
...
would get the job done once I determine (and test) the correct
values for all formats.

- Patrick McGehearty


On 6/9/2020 6:11 PM, Joseph Myers wrote:

On Wed, 10 Jun 2020, Patrick McGehearty wrote:


I see your point about other floating point formats.
According to the C standard, DOUBLE precision must
have a DBL_MAX of at least 1E+37. That is (slightly)
greater than 0x1.0p+63.

Would
#define RMIN2 (0x1.0p-53)
#define RMINSCAL (0x1.0p+51)
be acceptable?
Those will be in range for any architecture that supports the C standard.

But this code is used for different machine modes, not just double.  In
particular, on Arm / AArch64 it may be used to build __divhc3.  And those
constants are certainly outside the range of HFmode (IEEE binary16).

I think you need to work out appropriate logic for what values are correct
for a given machine mode, in terms of the parameters of that mode (maximum
and minimum exponents etc.).  Then, if you can't represent the logic for
determining those values (with the correct type, not hardcoding the
absence of a suffix which would mean double) directly in C, see how
c-cppbuiltin.c predefines extra macros if -fbuilding-libgcc (macros
relating to built-in function suffixes and the number of bits in the
mantissa of a floating-point mode are included there - it would certainly
make sense to add more such macros if necessary).





Re: [PATCH V2] Practical Improvement to Double Precision Complex Divide

2020-06-10 Thread Patrick McGehearty via Gcc-patches

A follow up note relating to use of fused multiply add in complex divide:

While reviewing bugs relating to complex divide in libgcc, I
rediscovered bug 59714 - complex division is surprising on targets
with FMA.

The specific concern was when you divide (1.0 + 3.0i) by itself and
fused multiply add was active, you did not get (1.0 + 0.0i) as an
answer.  Instead you got 1.0 + epsilon off of zero.

The question raised in that bug report was whether it was appropriate
to use fused multiply add. I.e. on the whole, did it contribute to
greater or less accuracy over a range of cases. No follow up data has
been posted.

My data sets (10M randomly generated values over the full exponent and
mantissa range) showed the following with/without FMA:

Moderate exponent data set (between -512 and 511 for exponent)
error  current  current  new  new
exceeds    no fma fma   no fma    fma
1 ulp 0.34492%  0.24715%   0.34492%  0.24715%
2 ulp 0.02360%  0.01764%   0.02360%  0.01764%


Full exponent data set (all exponents generated)
error  current  current  new  new
exceeds    no fma fma   no fma    fma
1 ulp 1.90432%  1.87195%   0.23167%  0.19746%
2 ulp 1.70836%  1.70646%   0.04067%  0.03856%

While the differences are not dramatic between using fma and no fma,
they seem fairly clear that fma provides slightly more accurate
results more often than not.

I concur with the current practice of using fma if the architecture
provides it.

As a side note, my new method for complex divide gets the exact result
for dividing (1.0 + 3.0i) by itself whether or not fused multiply add
is used.  That's because the calculation of the imaginary part
includes subtracting two nearly equal values which turns out to be
less than DBL_MIN. That means the new algorithm changes the order of
operations which does not involve the use of fused mul-add.

- Patrick McGehearty



On 6/4/2020 6:38 PM, Patrick McGehearty via Gcc-patches wrote:

The following patch to libgcc/libgcc2.c __divdc3 provides an
opportunity to gain important improvements to the quality of answers
for the default double precision complex divide routine when dealing
with very large or very small exponents.

The current code correctly implements Smith's method (1962) [1]
further modified by c99's requirements for dealing with NaN (not a
number) results. When working with input values where the exponents
are greater than 512 (i.e. 2.0^512) or less than -512 (i.e. 2.0^-512),
results are substantially different from the answers provided by quad
precision more than 1% of the time. Since the allowed exponent range
for double precision numbers is -1076 to +1023, the error rate may be
unacceptable for many applications. The proposed method reduces the
frequency of "substantially different" answers by more than 99% at a
modest cost of performance.

Differences between current gcc methods and the new method will be
described. Then accuracy and performance differences will be discussed.

(Added for Version 2)
In my initial research, I missed Elen Kalda's proposed patch
https://gcc.gnu.org/legacy-ml/gcc-patches/2019-08/msg01629.html [3]
Thanks to Joseph Myers for providing me with the pointer.
This version includes performance and accuracy comparisions
between Elen's proposed patch and my latest patch version which
has been modified to eliminate two branches.

NOTATION

For all of the following, the notation is:
Input complex values:
   a+bi  (a= 64bit real part, b= 64bit imaginary part)
   c+di
Output complex value:
   e+fi = (a+bi)/(c+di)

For the result tables:
current = current method (SMITH)
b1div = method proposed by Elen Kalda
b2div = alternate method considered by Elen Kalda
new1 = new method using 1 divide and 2 multiplies
new = new method proposed by this patch

DESCRIPTIONS of different complex divide methods:

NAIVE COMPUTATION (-fcx-limited-range):
   e = (a*c + b*d)/(c*c + d*d)
   f = (b*c - a*d)/(c*c + d*d)

Note that c*c and d*d will overflow or underflow if either
c or d is outside the range 2^-538 to 2^512.

This method is available in gcc when the switch -fcx-limited-range is
used. That switch is also enabled by -ffast-math. Only one who has a
clear understanding of the maximum range of intermediate values
generated by a computation should consider using this switch.

SMITH's METHOD (current libgcc):
   if(fabs(c) RBIG) || (FABS (a) > RBIG) || (FABS (b) > RBIG) ) {
   a = a * 0.5;
   b = b * 0.5;
   c = c * 0.5;
   d = d * 0.5;
   }
   /* minimize overflow/underflow issues when c and d are small */
   else if (FABS (d) < RMIN2) {
   a = a * RMINSCAL;
   b = b * RMINSCAL;
   c = c * RMINSCAL;
   d = d * RMINSCAL;
   }
   r = c/d; denom = (c*r) + d;
   if( r > RMIN ) {
   e = (a*r + b) / denom   ;
   f = (b*r - a) / denom
   } else {
   e = (c * (a/d) + b) / denom;
   f = (c * (b/d) - a) / denom;
   }
[ only presenting the fabs(c) < fabs(d) case here, full code in patch. ]


Re: [PATCH V2] Practical Improvement to Double Precision Complex Divide

2020-06-09 Thread Joseph Myers
On Wed, 10 Jun 2020, Patrick McGehearty wrote:

> I see your point about other floating point formats.
> According to the C standard, DOUBLE precision must
> have a DBL_MAX of at least 1E+37. That is (slightly)
> greater than 0x1.0p+63.
> 
> Would
> #define RMIN2 (0x1.0p-53)
> #define RMINSCAL (0x1.0p+51)
> be acceptable?
> Those will be in range for any architecture that supports the C standard.

But this code is used for different machine modes, not just double.  In 
particular, on Arm / AArch64 it may be used to build __divhc3.  And those 
constants are certainly outside the range of HFmode (IEEE binary16).

I think you need to work out appropriate logic for what values are correct 
for a given machine mode, in terms of the parameters of that mode (maximum 
and minimum exponents etc.).  Then, if you can't represent the logic for 
determining those values (with the correct type, not hardcoding the 
absence of a suffix which would mean double) directly in C, see how 
c-cppbuiltin.c predefines extra macros if -fbuilding-libgcc (macros 
relating to built-in function suffixes and the number of bits in the 
mantissa of a floating-point mode are included there - it would certainly 
make sense to add more such macros if necessary).

-- 
Joseph S. Myers
jos...@codesourcery.com


Re: [PATCH V2] Practical Improvement to Double Precision Complex Divide

2020-06-09 Thread Patrick McGehearty via Gcc-patches

I see your point about other floating point formats.
According to the C standard, DOUBLE precision must
have a DBL_MAX of at least 1E+37. That is (slightly)
greater than 0x1.0p+63.

Would
#define RMIN2 (0x1.0p-53)
#define RMINSCAL (0x1.0p+51)
be acceptable?
Those will be in range for any architecture that supports the C standard.

I estimate using those values will cause a typical complex divide to run
somewhat slower for the full exponent dataset because the test for
whether fabs(d) < RMIN2 [or fabs(c) < RMIN2 ] will be taken more often,
adding four scaling multiplies to the overall operation.

On the positive side, using the new scaling factor significantly reduces
the frequency of 2 ulp or larger inaccuracies for the full range dataset.
The alternative would be to remove the RMIN2 test altogether.
That provides a modest reduction in overhead but increases
the inaccuracy for the 8 ulp case to 0.04147%.

Errors   Full Dataset
gtr eq current   RMIN2=-512 RMIN2=-53   no RMIN2
==             
 1 ulp  1.87%    0.19755%   0.16788%   0.22169%
 2 ulp  1.70%    0.03857%   0.01007%   0.06182%
 8 ulp  1.61%    0.02394%   0.00199%   0.04147%
16 ulp  1.51%    0.01548%   0.00087%   0.02727%
24 ulp  1.41%    0.00945%   0.00042%   0.01694%
52 ulp  1.13%    0.1%   0.1%   0.1%
===

The RMIN2=-53% option greatly reduces 8 ulp or larger errors.
The "No RMIN2" option has about double the error rate
of the initial proposal of RMIN2=-512, but still has a error
rate at the 8 ulp level 40 times lower than the current code.

I welcome any thoughts from the community about the relative importance
of accuracy vs performance tradeoffs at these levels.

- Patrick McGehearty


On 6/4/2020 6:45 PM, Joseph Myers wrote:

On Fri, 5 Jun 2020, Patrick McGehearty wrote:


diff --git a/libgcc/libgcc2.c b/libgcc/libgcc2.c
index e0a9fd7..2a1d3dc 100644
--- a/libgcc/libgcc2.c
+++ b/libgcc/libgcc2.c
@@ -2036,26 +2036,77 @@ CONCAT3(__mul,MODE,3) (MTYPE a, MTYPE b, MTYPE c, MTYPE 
d)
  CTYPE
  CONCAT3(__div,MODE,3) (MTYPE a, MTYPE b, MTYPE c, MTYPE d)
  {
+#define RBIG   ((DBL_MAX)/2.0)
+#define RMIN   (DBL_MIN)
+#define RMIN2  (0x1.0p-512)
+#define RMINSCAL (0x1.0p+510)

This code is used for many different machine modes and floating-point
formats (the type and format corresponding to a particular machine mode
may depend on the target for which GCC is configured).  You can't hardcode
particular values specific to DFmode (and to DFmode being IEEE binary64)
here.





Re: [PATCH V2] Practical Improvement to Double Precision Complex Divide

2020-06-04 Thread Joseph Myers
On Fri, 5 Jun 2020, Patrick McGehearty wrote:

> diff --git a/libgcc/libgcc2.c b/libgcc/libgcc2.c
> index e0a9fd7..2a1d3dc 100644
> --- a/libgcc/libgcc2.c
> +++ b/libgcc/libgcc2.c
> @@ -2036,26 +2036,77 @@ CONCAT3(__mul,MODE,3) (MTYPE a, MTYPE b, MTYPE c, 
> MTYPE d)
>  CTYPE
>  CONCAT3(__div,MODE,3) (MTYPE a, MTYPE b, MTYPE c, MTYPE d)
>  {
> +#define RBIG ((DBL_MAX)/2.0)
> +#define RMIN (DBL_MIN)
> +#define RMIN2(0x1.0p-512)
> +#define RMINSCAL (0x1.0p+510)

This code is used for many different machine modes and floating-point 
formats (the type and format corresponding to a particular machine mode 
may depend on the target for which GCC is configured).  You can't hardcode 
particular values specific to DFmode (and to DFmode being IEEE binary64) 
here.

-- 
Joseph S. Myers
jos...@codesourcery.com