I've been speed testing some code to understand the complexity/speed trade-off of various approaches. I want to offer my colleagues the easiest way to use a GPU to get a decent speed-up without forcing anyone to write C-like code if possible.
Below is a piece of code that performs sin() five times to an array of 100 floats, first it uses Python via pycuda.cumath and then it uses an ElementwiseKernel. The cumath version is consistently a bit slower than the ElementwiseKernel. I don't really care about the tiny speed difference but I do want to understand what's causing it and what's happening behind the scenes. I'm trying to follow the underlying code (cumath.py and elementwise.py) but I'm having trouble keeping up. Could someone confirm my assumptions please? Assumptions: 1) When a cumath operation is performed (e.g. cumath.sin()) the result isn't copied back from the GPU to the CPU (a slow operation) 2) If multiple cumath operations are applied in sequence to a piece of data then the data still stays on the GPU (i.e. it doesn't have to be copied back to the CPU then back to the GPU to apply each subsequent cumath operation) 3) The act of applying .get() (e.g. "print sinop" in the code) to a gpuarray is the only thing in the example below that causes the GPU memory to be copied back to the CPU 3) cumath operations are a tiny bit slower than the equivalent ElementwiseKernel because the autogenerated code isn't quite as smart as my hand-coded ElementwiseKernel (<-total guess!)? In the code below I've observed that if I apply a .get() (e.g. sinop.get() or result_kernel_gpu.get()) before activating the second timer (timer2.record()) then the execution time is longer, presumably this time accounts for copying GPU data back to the CPU. Hoping someone can enlighten me, Ian. CODE: cudatest.py import pycuda.gpuarray as gpuarray import pycuda.driver as drv import pycuda.autoinit import pycuda.cumath from pycuda.elementwise import ElementwiseKernel import numpy nbr_items = 100 # constant to determine length of our array # setup two timers timer1 = drv.Event() timer2 = drv.Event() timer1.record() ################ # cumath version # build a float32 array of 100 floats [0.0, 1.0, ..., 99.0] a_cumath = gpuarray.to_gpu(numpy.array(numpy.arange(0,nbr_items)).astype(numpy.float32)) # make 'sin' into a shortcut for the pycuda sin function sin = pycuda.cumath.sin # apply sin 5 times to the array sinop = sin(sin(sin(sin(sin(a_cumath))))) timer2.record() timer2.synchronize() #print sinop # use this to visually compare results between cumath and ElementwiseKernel # roughly this takes 0.0003 seconds on my 9800GT print "cumath took:", timer1.time_till(timer2)*1e-3," seconds" ########################### # ElementwiseKernel version # build a kernel which applies sin() five times to the input kernel = ElementwiseKernel( "float *input, float *output", "output[i] = sin(sin(sin(sin(sin(input[i])))))", "afn" ) timer1.record() # build the same array [0.0, 1.0, ..., 99.0] a_kernel = gpuarray.to_gpu(numpy.array(numpy.arange(0,nbr_items)).astype(numpy.float32)) # build an empty result array and send it to the gpu result_kernel_gpu = gpuarray.to_gpu(numpy.zeros(nbr_items).astype(numpy.float32)) # call the kernel kernel(a_kernel, result_kernel_gpu) timer2.record() timer2.synchronize() #print result_kernel_gpu # build a kernel which applies sin() five times to the input # roughly this takes 0.0002 seconds on my 9800GT print "ElementwiseKernel took:", timer1.time_till(timer2)*1e-3," seconds" -- Ian Ozsvald (A.I. researcher, screencaster) i...@ianozsvald.com http://IanOzsvald.com http://morconsulting.com/ http://TheScreencastingHandbook.com http://ProCasts.co.uk/examples.html http://twitter.com/ianozsvald _______________________________________________ PyCUDA mailing list pyc...@host304.hostmonster.com http://host304.hostmonster.com/mailman/listinfo/pycuda_tiker.net