I've been speed testing some code to understand the complexity/speed
trade-off of various approaches. I want to offer my colleagues the
easiest way to use a GPU to get a decent speed-up without forcing
anyone to write C-like code if possible.

Below is a piece of code that performs sin() five times to an array of
100 floats, first it uses Python via pycuda.cumath and then it uses an
ElementwiseKernel. The cumath version is consistently a bit slower
than the ElementwiseKernel. I don't really care about the tiny speed
difference but I do want to understand what's causing it and what's
happening behind the scenes.

I'm trying to follow the underlying code (cumath.py and
elementwise.py) but I'm having trouble keeping up. Could someone
confirm my assumptions please?

Assumptions:
1) When a cumath operation is performed (e.g. cumath.sin()) the result
isn't copied back from the GPU to the CPU (a slow operation)
2) If multiple cumath operations are applied in sequence to a piece of
data then the data still stays on the GPU (i.e. it doesn't have to be
copied back to the CPU then back to the GPU to apply each subsequent
cumath operation)
3) The act of applying .get() (e.g. "print sinop" in the code) to a
gpuarray is the only thing in the example below that causes the GPU
memory to be copied back to the CPU
3) cumath operations are a tiny bit slower than the equivalent
ElementwiseKernel because the autogenerated code isn't quite as smart
as my hand-coded ElementwiseKernel (<-total guess!)?

In the code below I've observed that if I apply a .get() (e.g.
sinop.get() or result_kernel_gpu.get()) before activating the second
timer (timer2.record()) then the execution time is longer, presumably
this time accounts for copying GPU data back to the CPU.

Hoping someone can enlighten me,
Ian.

CODE:
cudatest.py

import pycuda.gpuarray as gpuarray
import pycuda.driver as drv
import pycuda.autoinit
import pycuda.cumath
from pycuda.elementwise import ElementwiseKernel
import numpy

nbr_items = 100 # constant to determine length of our array

# setup two timers
timer1 = drv.Event()
timer2 = drv.Event()
timer1.record()

################
# cumath version

# build a float32 array of 100 floats [0.0, 1.0, ..., 99.0]
a_cumath = 
gpuarray.to_gpu(numpy.array(numpy.arange(0,nbr_items)).astype(numpy.float32))
# make 'sin' into a shortcut for the pycuda sin function
sin = pycuda.cumath.sin
# apply sin 5 times to the array
sinop = sin(sin(sin(sin(sin(a_cumath)))))

timer2.record()
timer2.synchronize()

#print sinop # use this to visually compare results between cumath and
ElementwiseKernel
# roughly this takes 0.0003 seconds on my 9800GT
print "cumath took:", timer1.time_till(timer2)*1e-3," seconds"

###########################
# ElementwiseKernel version

# build a kernel which applies sin() five times to the input
kernel = ElementwiseKernel(
        "float *input, float *output",
        "output[i] = sin(sin(sin(sin(sin(input[i])))))",
        "afn"
        )

timer1.record()

# build the same array [0.0, 1.0, ..., 99.0]
a_kernel = 
gpuarray.to_gpu(numpy.array(numpy.arange(0,nbr_items)).astype(numpy.float32))
# build an empty result array and send it to the gpu
result_kernel_gpu =
gpuarray.to_gpu(numpy.zeros(nbr_items).astype(numpy.float32))
# call the kernel
kernel(a_kernel, result_kernel_gpu)

timer2.record()
timer2.synchronize()

#print result_kernel_gpu
# build a kernel which applies sin() five times to the input
# roughly this takes 0.0002 seconds on my 9800GT
print "ElementwiseKernel took:", timer1.time_till(timer2)*1e-3," seconds"



-- 
Ian Ozsvald (A.I. researcher, screencaster)
i...@ianozsvald.com

http://IanOzsvald.com
http://morconsulting.com/
http://TheScreencastingHandbook.com
http://ProCasts.co.uk/examples.html
http://twitter.com/ianozsvald

_______________________________________________
PyCUDA mailing list
pyc...@host304.hostmonster.com
http://host304.hostmonster.com/mailman/listinfo/pycuda_tiker.net

Reply via email to