Re: R performance

Mark Fletcher Wed, 13 May 2020 01:03:19 -0700

On Tue, May 12, 2020 at 12:06:52PM -0500, Nicholas Geovanis wrote:
> 
> You don't mention which distro you are running on the EC2 instance, nor
> whether R or the C libraries differ in release levels. Moreover, that EC2
> instance type is AMD-based not Intel. So if not an apples-to-oranges
> comparison, it might be fujis-to-mcintoshs.


The distro on EC2 was Amazon Linux -- the current Amazon Linux build 
they offer that isn't Amazon Linux 2. Sorry I don't remember the exact 
build number. I did also try Amazon Linux 2 and got similar results, 
tainted by the fact that I had a little bit of trouble on THAT occasion 
building the tidyverse libraries in R, and may possibly have ended up 
with a not-entirely-clean install as a result on that one occasion. So 
best to ignore that and concentrate on the Amazon Linux (not Amazon 
Linux 2) attempts, of whcih I had a couple, which were consistent.

Locally I am running Buster.

On EC2 I commissioned the fresh machine, installed R from the Amazon Linux 
repositories, which by the way is 3.4.1 "Single Candle" -- not quite 
what is in the Debian repositories but different by a minor version.

EC2 used to offer Debian but they don't any more. The closest I could 
get would be Ubuntu.

But we are talking about the same R code and same data running with a 
13-fold performance difference -- I don't believe that is down to the 
Linux distro per se, or AMD vs Intel. Something else is going on here. 
The EC2 box has 128GB of RAM and I could see the R instance using about 
1.3GB, which is what it does on my local box too (24GB RAM here).

I do get that we are talking about virtualised CPUs on the EC2 box and 
virtualisation introduces a penalty of some sort -- but 13-fold? 
Compared to 10-year-old physical hardware? Sounds wrong to me. As I say, 
something else is going on here. Especially when one considers, as I 
mentioned before, that past experiments with Java have not seen a 
significant performance difference between EC2 and my environment.

> 
> Long ago I built R from source a couple times a year. It has an
> unfathomable number of libraries and switches, any one of them could have a
> decisive effect on performamce. Two different builds could be quite
> different in behavior.
> 

Right -- that's what prompted my original question. I was/am hoping 
someone might be in a position to say "well it could be the fact that we 
set the XYZ flags in the build of R in Debian..." that would give me a 
rabbit hole to chase off down.

D.R. Evans' helpful point about his experiences with multi-CPU usage in 
recent Debian builds of R, for example -- even though that's not what 
I'm seeing in my runs, it does imply thought has gone into optimal CPU 
usage in the Debian R build...

Overnight I've run the job on my own machine now, by splitting the job 
up into 10 pieces and running 2 parts each in 5 parallel batches -- I 
was loath to do that at first as the machine is old and self-built and I 
worried about overheating it, but me of little faith, it handled it 
fine. So the question has become academic but I would like to get some 
sort of explanation so I can adjust for the future.

Mark

Re: R performance

Reply via email to