On 10/1/21 6:07 PM, Brodie Gaslam via R-devel wrote:
On Thursday, September 30, 2021, 01:25:02 PM EDT, <luke-tier...@uiowa.edu> 
wrote:

On Thu, 30 Sep 2021, brodie gaslam via R-devel wrote:

André,

I'm not an R core member, but happen to have looked a little bit at this
issue myself.  I've seen similar things on Skylake and Coffee Lake 2
(9700, one generation past your latest) too.  I think it would make sense
to have some handling of this, although I would want to show the trade-off
with performance impacts on CPUs that are not affected by this, and on
vectors that don't actually have NAs and similar.  I think the performance
impact is likely to be small so long as branch prediction is active, but
since branch prediction is involved you might need to check with different
ratios of NAs (not for your NA bailout branch, but for e.g. interaction
of what you add and the existing `na.rm=TRUE` logic).
I would want to see realistic examples where this matters, not
microbenchmarks, before thinking about complicating the code. Not all
but most cases where sum(x) returns NaN/NA would eventually result in
an error; getting to the error faster is not likely to be useful.
That's a very good point, and I'll admit I did not consider it
sufficiently.  There are examples such as `rowSums`/`colSums` where some
rows/columns evaluate to NA thus the result is still contains meaningful
data.  By extension, any loop that applies `sum` to list elements where
some might contain NAs, and others not.  `tapply` or any other group based
aggregation come to mind.

My understanding is that arm64 does not support proper long doubles
(they are the same as regular doubles).
Mine is the same.

Then there are issues with that "long double" (where not equivalent to "double") is implemented differently on different platforms, providing different properties. We have ran into that on Power, where "long double" it is implemented using a sum of doubles ("double-double"). If we could rely just on "double", we would not have to worry about such things.

So code using long doubles isn't getting the hoped-for improved
precision. Since that architecture is becoming more common we should
probably be looking at replacing uses of long doubles with better
algorithms that can work with regular doubles, e.g Kahan summation or
variants for sum.
This is probably the bigger issue.  If the future is ARM/AMD, the value of
Intel x87-only optimizations becomes questionable.

More generally is the question of whether to completely replace long
double with algorithmic precision methods, at a cost of performance on
systems that do support hardware long doubles (Intel or otherwise), or
whether both code pathways are kept and selected at compile time.  Or
maybe the aggregation functions gain a low-precision flag for simple
double precision accumulation.

I'm curious to look at the performance and precision implications of e.g.
Kahan summation if no one has done that yet.  Some quick poking around
shows people using processor specific intrinsics to take advantage of
advanced multi-element instructions, but I imagine R would not want to do
that.  Assuming others have not done this already, I will have a look and
report back.

Processor-specific (or even compiler-specific) code is better avoided, but sometimes it is possible to write portable code that is tailored to run fast on common platforms, while still working correctly with acceptable performance on other.

Sometimes one can give hints to the compiler via OpenMP pragmas to vectorize code and/or use vectorized instructions, e.g. when it is ok to ignore some specific corner cases (R uses this in mayHaveNaNOrInf to tell the compiler it is ok to assume associativity of addition in a specific loop/variable, hence allowing it to vectorize better).

Since we're on the topic I want to point out that the default NA in R
starts off as a signaling NA:

      example(numToBits)   # for `bitC`
      bitC(NA_real_)
      ## [1] 0 11111111111 | 
0000000000000000000000000000000000000000011110100010
      bitC(NA_real_ + 0)
      ## [1] 0 11111111111 | 
1000000000000000000000000000000000000000011110100010

Notice the leading bit of the significant starts off as zero, which marks
it as a signaling NA, but becomes 1, i.e. non-signaling, after any
operation[2].

This is meaningful because the mere act of loading a signaling NA into the
x87 FPU is sufficient to trigger the slowdowns, even if the NA is not
actually used in arithmetic operations.  This happens sometimes under some
optimization levels.  I don't now of any benefit of starting off with a
signaling NA, especially since the encoding is lost pretty much as soon as
it is used.  If folks are interested I can provide patch to turn the NA
quiet by default.
In principle this might be a good idea, but the current bit pattern is
unfortunately baked into a number of packages and documents on
internals, as well as serialized objects. The work needed to sort that
out is probably not worth the effort.
One reason why we might not need to sort this out is precisely the
instability shown above.  Anything that relies on the signaling bit set to
a particular value will behave differently with `NA_real_` vs
`NA_real_ + x`.  `R_IsNA` only checks the lower word, so it doesn't care
about the signaling bit or the 19 subsequent ones.  Anything that does
likely has unpredictable behavior **currently**.

Similarly, the documentation[1] only specifies the low word:

On such platforms NA is represented by the NaN value with low-word 0x7a2
(1954 in decimal).
This is consistent with the semantics of `R_IsNA`.

The signalling vs non-signalling bit may also impact propagation of the NaN payload on some platforms. Some non-portable code could be looking at all the bits. Changing the representation based on performance issues in a specific processor may not be the right thing to do in principle. There would have to be a strong benefit of the change to outweigh these risks.

It also doesn't seem to affect the performance issue here since
setting b[1] <- NA_real_ + 0 produces the same slowdown (at least on
my current Intel machine).
The subtlety here is that the slowdown happens by merely loading the
signaling NaN onto the X87 FPU.  Consider the following:

     x <- t(1:1e7+1L)
     system.time(rowSums(x))
     ##    user  system elapsed
     ##   1.685   0.007   1.693

If we apply the following patch to make NA_REAL quiet:

Index: src/main/arithmetic.c
===================================================================
--- src/main/arithmetic.c    (revision 80996)
+++ src/main/arithmetic.c    (working copy)
@@ -117,7 +117,7 @@
      /* The gcc shipping with Fedora 9 gets this wrong without
       * the volatile declaration. Thanks to Marc Schwartz. */
      volatile ieee_double x;
-    x.word[hw] = 0x7ff00000;
+    x.word[hw] = 0x7ff80000;
      x.word[lw] = 1954;
      return x.value;
  }

We get a ~25x speedup:

     x <- t(1:1e7+1L)
     system.time(rowSums(x))
     ##    user  system elapsed
     ##   0.068   0.000   0.068

This despite no NAs anywhere in sight.  I observe the slow behavior on
both Skylake, Coffee Lake 2 and others, across different OSes, so long as
the -O2 compilation flag is used.

This likely happens because the compiler tries to prepare for the `keepNA`
branch in[1]:

         case INTSXP:
         {
         int *ix = INTEGER(x) + (R_xlen_t)n*j;
         for (cnt = 0, sum = 0., i = 0; i < n; i++, ix++)
             if (*ix != NA_INTEGER) {cnt++; sum += *ix;}
             else if (keepNA) {sum = NA_REAL; break;}
         break;
         }

by loading an NA_REAL into the FPU.  At least that was my interpretation
of the following machine code (dumped from src/main/array.o).  I barely
understand ASM, but it looks like 9bd4 loads R_NaReal, and this happens
before the test for NA at 9be8:

             case LGLSXP:   # looks like INTSXP/LGLSXP branches collapsed
             {
                 int *ix = LOGICAL(x) + (R_xlen_t)n * j;
     9bc8:       4d 01 e3                add    r11,r12
                 for (R_xlen_t i = 0; i < n; i++, ra++, ix++)
     9bcb:       48 85 db                test   rbx,rbx
     9bce:       0f 8e fc fe ff ff       jle    9ad0 <do_colsum+0x260>
                     if (keepNA) {
                         if (*ix != NA_LOGICAL) *ra += *ix;
                         else *ra = NA_REAL;
     9bd4:       dd 05 00 00 00 00       fld    QWORD PTR [rip+0x0]        # 9bda 
<do_colsum+0x36a>
                         9bd6: R_X86_64_PC32     R_NaReal-0x4
     9bda:       48 8b 95 60 ff ff ff    mov    rdx,QWORD PTR [rbp-0xa0]
                 for (R_xlen_t i = 0; i < n; i++, ra++, ix++)
     9be1:       31 c0                   xor    eax,eax
     9be3:       eb 2c                   jmp    9c11 <do_colsum+0x3a1>
     9be5:       0f 1f 00                nop    DWORD PTR [rax]
                         if (*ix != NA_LOGICAL) *ra += *ix;
     9be8:       45 39 d0                cmp    r8d,r10d
     9beb:       74 5b                   je     9c48 <do_colsum+0x3d8>
     9bed:       44 89 85 78 ff ff ff    mov    DWORD PTR [rbp-0x88],r8d
     9bf4:       db 85 78 ff ff ff       fild   DWORD PTR [rbp-0x88]
     9bfa:       db 2a                   fld    TBYTE PTR [rdx]
     9bfc:       de c1                   faddp  st(1),st
     9bfe:       db 3a                   fstp   TBYTE PTR [rdx]

... SNIP

     9c48:       db 3a                   fstp   TBYTE PTR [rdx]
     9c4a:       db 2a                   fld    TBYTE PTR [rdx]
     9c4c:       eb b2                   jmp    9c00 <do_colsum+0x390>

If no NA_LOGICAL (NA_INTEGER) are encountered, the NaN is not used.
However, it is always loaded, and for signaling NaNs this alone appears
to switch the FPU to turtle mode.

But more important than whether I can interpret ASM correctly (dubious),
simply changing the NA_REAL value to be of the quiet variety dramatically
improves performance of `rowSums` with integers.

Ironically, this doesn't happen with the REALSXP branch because that one
relies on NaN propagation in the FPU so doesn't load the NA_REAL for the
early-break case when it encounters one.  Of course if it does encounter a
NaN, quiet or not, we get a slowdown.

Compiling with -O3 also fixes this.

In general it is very difficult to tell performance from the assembly, because a lot of optimizations happen at the hardware level, but except from compiler experts who understand concrete processors in details, one can make guesses and based on them come up with performance improvements. If your guess is right about the reason for the slowdown, there should be a way to suggest a small change to the C code, along the lines of what -O3 does, so that the compiler (recent version of GCC) would produce code which doesn't load the value eagerly and runs faster.

Bad performance of `rowSums` on integers alone is not really that big a
deal given the output is numeric, and that's why I never reported this
despite having known about it for a while.  But André's e-mail pushed me
into saying something about it.  There is a risk that code relies on the
full bit pattern of NA_REAL, but I think those are probably broken
currently since the signaling bit is even more unstable than the general
payload bits.

Yes, the signalling bit is often lost sooner than the payload.

Best
Tomas


Best,

B.

[1]: 
https://github.com/r-devel/r-svn/blob/6891db49680629427e3d5927053531b8fa5d8ee3/src/main/array.c#L1939
[2]: https://cran.r-project.org/doc/manuals/r-release/R-data.html#Special-values

Best,

luke


Best,

B.

[1]: 
https://randomascii.wordpress.com/2012/05/20/thats-not-normalthe-performance-of-odd-floats/
[2]: https://en.wikipedia.org/wiki/NaN#Encoding





On Thursday, September 30, 2021, 06:52:59 AM EDT, GILLIBERT, Andre 
<andre.gillib...@chu-rouen.fr> wrote:

Dear R developers,

By default, R uses the "long double" data type to get extra precision for 
intermediate computations, with a small performance tradeoff.

Unfortunately, on all Intel x86 computers I have ever seen, long doubles 
(implemented in the x87 FPU) are extremely slow whenever a special 
representation (NA, NaN or infinities) is used; probably because it triggers 
poorly optimized microcode in the CPU firmware. A function such as sum() 
becomes more than hundred times slower!
Test code:
a=runif(1e7);system.time(for(i in 1:100)sum(a))
b=a;b[1]=NA;system.time(sum(b))

The slowdown factors are as follows on a few intel CPU:

1)      Pentium Gold G5400 (Coffee Lake, 8th generation) with R 64 bits : 140 
times slower with NA

2)      Pentium G4400 (Skylake, 6th generation) with R 64 bits : 150 times 
slower with NA

3)      Pentium G3220 (Haswell, 4th generation) with R 64 bits : 130 times 
slower with NA

4)      Celeron J1900 (Atom Silvermont) with R 64 bits : 45 times slower with NA

I do not have access to more recent Intel CPUs, but I doubt that it has 
improved much.

Recent AMD CPUs have no significant slowdown.
There is no significant slowdown on Intel CPUs (more recent than Sandy Bridge) 
for 64 bits floating point calculations based on SSE2. Therefore, operators 
using doubles, such as '+' are unaffected.

I do not know whether recent ARM CPUs have slowdowns on FP64... Maybe somebody 
can test.

Since NAs are not rare in real-life, I think that it would worth an extra check 
in functions based on long doubles, such as sum(). The check for special 
representations do not necessarily have to be done at each iteration for 
cumulative functions.
If you are interested, I can write a bunch of patches to fix the main functions using 
long doubles: cumsum, cumprod, sum, prod, rowSums, colSums, matrix multiplication 
(matprod="internal").

What do you think of that?

--
Sincerely
Andr� GILLIBERT

       [[alternative HTML version deleted]]

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
--
Luke Tierney
Ralph E. Wareham Professor of Mathematical Sciences
University of Iowa                  Phone:            319-335-3386
Department of Statistics and        Fax:              319-335-3017
      Actuarial Science
241 Schaeffer Hall                  email:  luke-tier...@uiowa.edu
Iowa City, IA 52242                WWW:  http://www.stat.uiowa.edu


______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Reply via email to