Thank you for your interest in this project.

On 3/27/2021 6:18 PM, Bernhard Reutner-Fischer wrote:
On Fri, 26 Mar 2021 23:14:41 +0000
Patrick McGehearty via Gcc-patches <gcc-patches@gcc.gnu.org> wrote:

Changes in Version 9 since Version 8:

Revised code to meet gcc coding standards in all files, especially
     with respect to adding spaces around operations and removing
     excess () in #define macro definitions.
Did you gather cycle counter diffs for current against new for the
normal and some edge cases?
I don't see additional value in looking at cycle counters in addition
to the timing information I presented. I prefer wall clock time to
cycle counters, perhaps because cycle counters are less reliably
implemented on some platforms (that's a separate discussion, though).
Every platform will have different counters and different timing.

Let me run through a rough operation count analysis of the double
precision case. I will also postulate instruction latencies
that might be typical of a recent processor. Real processors
will different somewhat but not by large amounts.

The current method uses Smith's method. Showing half of that method:
  if (fabs(c) < fabs(d))
    {
      ratio = c / d;
      denom = (c * ratio) + d;
      xx = ((a * ratio) + b) / denom;
      yy = ((b * ratio) - a) / denom;
    }
<similar code for the other branch of the if>

It requires:
two fabs computations (typically 1 cycle each)
one floating point compare (typically 2 cycles)
1 branch decision (50-50 prediction reliabilty)
        (1 cycle if predicted, 9 cycles if not predicted; ave=5)
3 multiplys, 3 divides, 3 add/subtract operations
  (assume 5 cycle latency for add/mul and 15 cycle for divide)
        (75 cycles)
Total: 84 cycles (typical average, old method)

I omit the NAN analysis as that is unchanged.

The cost of the new method is data dependent.
First, assume all scaling tests are false and ratio does not underflow.
Further, assume that data is consistently in those ranges, to allow
all branch predictions except the first to be predicted.
I'll also assume we have an excellent branch predictor mechanism
that can handle predicting

Then we have:
5 additional fabs, and 4 additional compare and branches (all predicted)
and one || operations for 18 additional cycles.
[This optimistic count assumes fabs(a,b) > RMIN, allowing
us to bypass the fabs < RMAX2 tests.

New total: 84+18 = 102 (optimistic average, new method)

If we can't bypass the fabs < RMAX2 tests, we need to add
four more fabs, and four && operations, and four compare
operations for 12 more cycles.
Possible total: 84+18+12 = 114

Finally, if we hit a scaling operation, then we have four
multiply operations for an additional 20 cycles plus
we've mispredicted the branch for 9 more.
114+28 = 144 cycles.
This analysis is not particularly accurate. It does not account for
some architectures having several floating point units or for
speculative execution hiding latency or many other variations in
computer architecture. Still, its good enough for a ballpark estimate
that says the typical case will be around 15-25% slower and the most
extreme cases will be around 70% slower.

When we look at the patch writeup, I report that for double precision,
the "moderate set" which represents the typical case, we see 4% to 24%
measured slowdown. The "full set" which represents having a
substantial number of values which might cause branch predictors to go
wrong as costing 38% to 56% overhead. Apparently my 'back of the
envelope' analysis above is not too far off.

Oh, I failed to cover the case where fabs(ratio) < RMIN.
That happens in the full set about 1-2% of the time.
It requires two additional divide operations for an extra
30 cycles, plus maybe 10 cycles for the mispredicted branch.
That might bump up the worst case to be about double the
normal case. That's acceptable when you consider that slow
path prevents the code from returning a completely wrong result.

As a quick aside, the test for ratio underflowing reduces the error
rate by a factor of 4 from 1.70% to 0.35% in the full exponent range
data set with about half the total overhead of the new method. Adding
all the other tests and scaling code reduces errors larger than 12
bits to roughly 1 in 10 million and 18 bits or larger to roughly 1 in
100 million. Those remaining errors are due to the issue of
subtraction cancellation, which is a harder problem than avoiding
underflow/overflow issues.



Can you please share data for -Os between current and your proposed new?
As for data for -Os for current and proposed new, it's not very interesting.
There are no differences as complex divide is not inlined by gcc.
Instead, there is a simple call to __divdc3.
I checked for inlining with gcc 4.8, gcc8.3, and gcc11.0.1 as well as with
a compiler that has my code changes.

If one uses -fcx-fortran-rules then Smith's method for complex divide is inlined with -Os. If one uses -fcx-limited-range then the naive method for complex divide is inlined with -Os.
The new code does not change the code generated when those flags are used.

The libgcc_s.so.1 (with symbols) does get larger by roughly 8K with the additional code.

- Patrick McGehearty
TIA,

Reply via email to