Re: [Rd] RE: [R] when can we expect Prof Tierney's compiled R?

Prof Brian Ripley Wed, 27 Apr 2005 01:13:55 -0700

On Tue, 26 Apr 2005, Vadim Ogranovich wrote:

Thank you for sharing the benchmark results. The improvement is very
substantial, I am looking forward to the release of the byte compiler!

The arithmetic shows that x[i]<- is still the bottleneck. I suspect that
this is due to a very involved dispatching/search for the appropriate
function on the C level. There might be significant gain if loops
somehow cached the result of the initial dispatching. This is what you
probably referred to as additional improvements in the engine itself.

I'd be surprised if dispatching were the issue: have you (C-level) profiled to find out? Please do so: these statements do tend to get perpetuated as fact.

You cannot cache the result, as [<- can change the class of x, as could other operations done by the rhs (e.g. if it were x[i] <- g(x, i) the function g could change its argument).

There are a large number of allocations going on (e.g. you need to create a length-one vector x[i-1]), and

gc.time(TRUE)

[1] 0 0 0 0 0

system.time(f1(x, iA), gcFirst=TRUE)

[1] 6.46 0.02 6.49 0.00 0.00

gc.time()

[1] 1.83 1.30 1.85 0.00 0.00

(although note the comment in ?gc.time). It is typical that gc()-ing is a major component of the time for such simple tasks, and allocations can be even more.

Note the use of gcFirst=TRUE (one could gc() before gc.time to be even fairer). Comparable figures for f2

system.time(f2(x, iA), gcFirst=TRUE)

[1] 2.13 0.00 2.13 0.00 0.00

gc.time()

[1] 0.69 0.54 0.71 0.00 0.00

It's quite typical for gc()-ing to take 1/3 of the time (as reported by gc.time) in trivial problems like this.

Though the examples are very simple, I think they are typical of any
simulation of dynamic systems where the new state is computed as a
transformation of the previous state. In my example, the f1(), the state
was a scalar and the transformation was the identity.


I don't believe they are typical, as the transformation is so trivial.
Of course, `simulation of dynamic systems' is not a typical task for R.

In any case the timing of the byte-compilation is remarkable!

Thanks,
Vadim

-----Original Message-----
From: Luke Tierney [mailto:[EMAIL PROTECTED]
Sent: Tuesday, April 26, 2005 5:50 PM
To: Vadim Ogranovich
Cc: Peter Dalgaard; Jason Liao; r-devel@stat.math.ethz.ch
Subject: RE: [R] when can we expect Prof Tierney's compiled R?
For what it's worth (probably not much as these simple
benchmarks are rarely representative of real code and so need
to be taken with a huge chunk of salt) here is what happens
with your examples in R 2.1.0 with the current byte compiler.
Define your examples as functions:
     n = 1e6; iA = seq(2,n); x = double(n);
     f1 <- function(x, iA) for (i in iA) x[i] = x[i-1]
     f2 <- function(x, iA) for (i in iA) x[i-1]
     f3 <- function(x, iA) for (i in iA) 1
     f4 <- function(x, iA) for (i in iA) x[i] = 1.0
     f5 <- function(x, iA) for (i in iA) i-1
     f6 <- function(x, iA) for (i in iA) i
Make byte compiled versions:
     f1c <- cmpfun(f1)
     f2c <- cmpfun(f2)
     f3c <- cmpfun(f3)
     f4c <- cmpfun(f4)
     f5c <- cmpfun(f5)
     f6c <- cmpfun(f6)
and run them:
    > system.time(f1(x, iA))
     [1] 5.43 0.04 5.56 0.00 0.00
    > system.time(f1c(x, iA))
     [1] 1.77 0.03 1.81 0.00 0.00
    > system.time(f2(x, iA))
     [1] 1.72 0.01 1.74 0.00 0.00
    > system.time(f2c(x, iA))
     [1] 0.63 0.00 0.63 0.00 0.00
    > system.time(f3(x, iA))
     [1] 0.19 0.00 0.20 0.00 0.00
    > system.time(f3c(x, iA))
     [1] 0.14 0.00 0.15 0.00 0.00
    > system.time(f4(x, iA))
     [1] 3.78 0.03 3.82 0.00 0.00
    > system.time(f4c(x, iA))
     [1] 1.26 0.02 1.30 0.00 0.00
    > system.time(f5(x, iA))
     [1] 0.99 0.00 1.00 0.00 0.00
    > system.time(f5c(x, iA))
     [1] 0.30 0.00 0.31 0.00 0.00
    > system.time(f6(x, iA))
     [1] 0.21 0.00 0.23 0.00 0.00
    > system.time(f6c(x, iA))
     [1] 0.17 0.00 0.20 0.00 0.00
I'll let you do the arithmetic.  The byte compiler does get
rid of a fair bit of interpreter overhead (which is large in
these kinds of examples compared to most real code) but there
is still considerable room for improvement.  The byte code
engine currently uses the same internal C code for doing the
actual work as the interpreter, so improvements there would
help both interpreted and byte compiled code.
Best,
luke
On Fri, 22 Apr 2005, Vadim Ogranovich wrote:
If we are on the subject of byte compilation, let me bring
a couple of
examples which have been puzzling me for some time. I'd
like to know
a) if the compilation will likely to improve the
performance for this
type of computations, and b) at least roughly understand
the reasons
for the observed numbers, specifically why x[i]<- assignment is so
much slower than x[i] extraction.
The loops below are typical in any recursive calculation
where the new
value of a vector is based on its immediate neighbor say to
the left.
Specifically we assign the previous value to the current element.
# this shows that the assignment x[i]<- is the bottleneck
in the loop
n = 1e6; iA = seq(2,n); x = double(n); system.time(for (i
in iA) x[i]
= x[i-1])
[1] 4.29 0.00 4.30 0.00 0.00
n = 1e6; iA = seq(2,n); x = double(n); system.time(for (i in iA)
x[i-1])
[1] 1.46 0.01 1.46 0.00 0.00
# the overhead of the loop itself is reasonably low, just 0.17 sec
n = 1e6; iA = seq(2,n); x = double(n); system.time(for (i in iA) 1)
[1] 0.17 0.01 0.17 0.00 0.00
# pure assignment (w/o the extraction x[i]) takes 3.09 sec.
Thus x[i]
as extraction is (3.09 - 0.17)/(0.79 - 0.17) = 4.7 times
faster than
x[i]<- as assignment. This looks a bit odd.
n = 1e6; iA = seq(2,n); x = double(n); system.time(for (i
in iA) x[i]
= 1.0)
[1] 3.08 0.00 3.09 0.00 0.00
# this shows that just evaluation of (i-1) takes about
(0.79 - 0.24) =
0.55 sec on my machine (AMD 64 bit). Looks too slow.
n = 1e6; iA = seq(2,n); x = double(n); system.time(for (i
in iA) i-1)
[1] 0.79 0.00 0.79 0.00 0.00
n = 1e6; iA = seq(2,n); x = double(n); system.time(for (i in iA) i)
[1] 0.24 0.01 0.24 0.00 0.00
Thanks,
Vadim
-----Original Message-----
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Luke Tierney
Sent: Friday, April 22, 2005 7:33 AM
To: Peter Dalgaard
Cc: Jason Liao; r-help@stat.math.ethz.ch
Subject: Re: [R] when can we expect Prof Tierney's compiled R?
On Wed, 20 Apr 2005, Peter Dalgaard wrote:
Luke Tierney <[EMAIL PROTECTED]> writes:
Vectorized operations in R are also as fast as compiled
C (because
that is what they are :-)). A compiler such as the one
I'm working
on will be able to make most difference for
non-vectorizable or not
very vectorizable code. It may also be able to reduce the
need for
intermediate allocations in vectorizable code, which may
have other
benefits beyond just speed improvements.
Actually, it has struck me a couple of times that these
operations are
not as fast as they could be, since they are outside the
scope of fast
BLAS routines, but "embarrassingly parallel" code could easily be
written for the relevant hardware. Even on uniprocessor
systems there
might be speedups that the C compiler cannot find (e.g.
because it
cannot assume that source and destination of the operation are
distinct).
My guess is that for anything beyond basic operations we
are doing OK
on uniprocessors. but it would be useful to do some testing to be
sure.  For the basic operations I suspect we are paying a
heavy price
for the way we handle recycling, both in terms of overhead as such
and in terms of inhibiting compiler optimizations. For
performance it
would probably be better to code the scalar-vector,
equal-length-vector, and general cases separately, though
keeping the
code maintainable may be a bit of a challenge.  Again testing on a
range of platforms and compilers would be useful.
With multiprocessors likely to become more widely
available it would
be good to look into ways of factoring the vectorized math
code so we
can slide in one that uses threads when approprate.  This should
dovetail nicely with compilation to identify larger vectorized
expressions that can be parallelized as a unit; I hope to
look into
this a bit this summer.
luke
--
Luke Tierney
Chair, Statistics and Actuarial Science Ralph E. Wareham
Professor of
Mathematical Sciences
University of Iowa                  Phone:             319-335-3386
Department of Statistics and        Fax:               319-335-3017
    Actuarial Science
241 Schaeffer Hall                  email:      [EMAIL PROTECTED]
Iowa City, IA 52242                 WWW:  http://www.stat.uiowa.edu
______________________________________________
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html
--
Luke Tierney
Chair, Statistics and Actuarial Science
Ralph E. Wareham Professor of Mathematical Sciences
University of Iowa                  Phone:             319-335-3386
Department of Statistics and        Fax:               319-335-3017
    Actuarial Science
241 Schaeffer Hall                  email:      [EMAIL PROTECTED]
Iowa City, IA 52242                 WWW:  http://www.stat.uiowa.edu


______________________________________________
R-devel@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


--
Brian D. Ripley,                  [EMAIL PROTECTED]
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

______________________________________________
R-devel@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] RE: [R] when can we expect Prof Tierney's compiled R?

Reply via email to