Re: [Rd] Tesla GPUs

Tobias Verbeke Sat, 06 Aug 2011 07:01:49 -0700

On 08/05/2011 08:36 PM, Simon Urbanek wrote:


On Jul 19, 2011, at 12:56 PM, Simon Urbanek wrote:


On Jul 19, 2011, at 2:26 AM, Prof Brian Ripley wrote:

On Mon, 18 Jul 2011, Alireza Mahani wrote:

Simon,

Thank you for elaborating on the limitations of R in handling float types. I
think I'm pretty much there with you.

As for the insufficiency of single-precision math (and hence limitations of
GPU), my personal take so far has been that double-precision becomes crucial
when some sort of error accumulation occurs. For example, in differential
equations where boundary values are integrated to arrive at interior values,
etc. On the other hand, in my personal line of work (Hierarchical Bayesian
models for quantitative marketing), we have so much inherent uncertainty and
noise at so many levels in the problem (and no significant error
accumulation sources) that single vs double precision issue is often
inconsequential for us. So I think it really depends on the field as well as
the nature of the problem.


The main reason to use only double precision in R was that on modern CPUs 
double precision calculations are as fast as single-precision ones, and with 
64-bit CPUs they are a single access.  So the extra precision comes 
more-or-less for free.  You also under-estimate the extent to which stability 
of commonly used algorithms relies on double precision.  (There are stable 
single-precision versions, but they are no longer commonly used.  And as Simon 
said, in some cases stability is ensured by using extra precision where 
available.)

I disagree slightly with Simon on GPUs: I am told by local experts that the 
double-precision on the latest GPUs (those from the last year or so) is 
perfectly usable.  See the performance claims on 
http://en.wikipedia.org/wiki/Nvidia_Tesla of about 50% of the SP performance in 
DP.


That would be good news. Unfortunately those seem to be still targeted at a 
specialized market and are not really graphics cards in traditional sense. 
Although this is sort of required for the purpose it removes the benefit of 
ubiquity. So, yes, I agree with you that it may be an interesting way forward, 
but I fear it's too much of a niche to be widely supported. I may want to ask 
our GPU specialists here to see if they have any around so I could re-visit our 
OpenCL R benchmarks. Last time we abandoned our OpenCL R plans exactly due to 
the lack of speed in double precision.


A quick update - it turns out we have a few Tesla/Fermi machines here, so I ran 
some very quick benchmarks on them. The test case was the same as for the 
original OpenCL comparisons posted here a while ago when Apple introduced it: 
dnorm on long vectors:

64M, single:
-- GPU -- total: 4894.1 ms, compute: 234.5 ms, compile: 4565.7 ms, real: 328.3 
ms
-- CPU -- total: 2290.8 ms

64M, double:
-- GPU -- total: 5448.4 ms, compute: 634.1 ms, compile: 4636.4 ms, real: 812.0 
ms
-- CPU -- total: 2415.8 ms

128M, single:
-- GPU -- total: 5843.7 ms, compute: 469.2 ms, compile: 5040.5 ms, real: 803.1 
ms
-- CPU -- total: 4568.9 ms

128M, double:
-- GPU -- total: 6042.8 ms, compute: 1093.9 ms, compile: 4583.3 ms, real: 
1459.5 ms
-- CPU -- total: 4946.8 ms

The CPU times are based on a dual Xeon X5690 machine (12 cores @ 3.47GHz) using 
OpenMP, but are very approximate, because there were two other jobs running on 
machine -- still, it should be a good ballpark figure. The GPU times are run on 
Tesla S2050 using OpenCL, addressed as one device so presumably comparable to 
the performance of one Tesla M2050.
The figures to compare are GPU.real (which is computation + host memory I/O) 
and CPU.total, because we can assume that we can compile the kernel in advance, 
but you can't save on the memory transfer (unless you find a good way to chain 
calls which is not realistic in R).

So the good news is that the new GPUs fulfill their promise : double precision 
is only twice as slow as single precision. Also they scale approximately 
linearly - see the real time of 64M double is almost the same as 128M single. 
They also outperform the CPUs as well, although not by an order of magnitude.

The double precision support is very good news, and even though we are still 
using GPUs in a suboptimal manner, they are faster than the CPUs. The only 
practical drawback is that using OpenCL requires serious work, it's not as easy 
as slapping omp pragmas on existing code. Also the HPC Teslas are quite 
expensive so I don't expect to see them in desktops anytime soon. However, for 
people that are thinking about big computation, it may be an interesting way to 
go. Given that it's not mainstream I don't expect core R to have OCL support 
just yet, but it may be worth keeping in mind for the future as we are 
designing the parallelization framework in R.


+1. Chip vendors nowadays also offer a CPU runtime for execution of
OpenCL code on common x86 multi-core CPUs (e.g. of the Opteron series
or Core i7 family) so it may be more ubiquitous soon.

Best,
Tobias

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] Tesla GPUs

Reply via email to