On Tue, May 12, 2020 at 12:06:52PM -0500, Nicholas Geovanis wrote: > > You don't mention which distro you are running on the EC2 instance, nor > whether R or the C libraries differ in release levels. Moreover, that EC2 > instance type is AMD-based not Intel. So if not an apples-to-oranges > comparison, it might be fujis-to-mcintoshs.
The distro on EC2 was Amazon Linux -- the current Amazon Linux build they offer that isn't Amazon Linux 2. Sorry I don't remember the exact build number. I did also try Amazon Linux 2 and got similar results, tainted by the fact that I had a little bit of trouble on THAT occasion building the tidyverse libraries in R, and may possibly have ended up with a not-entirely-clean install as a result on that one occasion. So best to ignore that and concentrate on the Amazon Linux (not Amazon Linux 2) attempts, of whcih I had a couple, which were consistent. Locally I am running Buster. On EC2 I commissioned the fresh machine, installed R from the Amazon Linux repositories, which by the way is 3.4.1 "Single Candle" -- not quite what is in the Debian repositories but different by a minor version. EC2 used to offer Debian but they don't any more. The closest I could get would be Ubuntu. But we are talking about the same R code and same data running with a 13-fold performance difference -- I don't believe that is down to the Linux distro per se, or AMD vs Intel. Something else is going on here. The EC2 box has 128GB of RAM and I could see the R instance using about 1.3GB, which is what it does on my local box too (24GB RAM here). I do get that we are talking about virtualised CPUs on the EC2 box and virtualisation introduces a penalty of some sort -- but 13-fold? Compared to 10-year-old physical hardware? Sounds wrong to me. As I say, something else is going on here. Especially when one considers, as I mentioned before, that past experiments with Java have not seen a significant performance difference between EC2 and my environment. > > Long ago I built R from source a couple times a year. It has an > unfathomable number of libraries and switches, any one of them could have a > decisive effect on performamce. Two different builds could be quite > different in behavior. > Right -- that's what prompted my original question. I was/am hoping someone might be in a position to say "well it could be the fact that we set the XYZ flags in the build of R in Debian..." that would give me a rabbit hole to chase off down. D.R. Evans' helpful point about his experiences with multi-CPU usage in recent Debian builds of R, for example -- even though that's not what I'm seeing in my runs, it does imply thought has gone into optimal CPU usage in the Debian R build... Overnight I've run the job on my own machine now, by splitting the job up into 10 pieces and running 2 parts each in 5 parallel batches -- I was loath to do that at first as the machine is old and self-built and I worried about overheating it, but me of little faith, it handled it fine. So the question has become academic but I would like to get some sort of explanation so I can adjust for the future. Mark