Hi Andreas,
thanks for the quick answer! It seems that my question is somehow related to
scikits.cuda. Executing the code attached to this email sketches the issue: By
uncommenting the line "#model.fit(a_gpu, b_gpu)", the profiling output
"compiler.py:185(compile)" vanishes. The manual PyCUDA kernel seems be compiled
beforehand (during the initializing the model). The call "culinalg.dot",
however, seems to cause the compiler output after having initialized the model
...
For this toy example, there seems to be no big time difference for the second
call of "fit" (i.e., whether the first "fit" line is uncommented or not; on my
machine, it takes about 0.41 seconds in both cases). However, for the project I
am working on (which is too big to share), it makes a difference (1.35 seconds
instead of 1.662 seconds, as mentioned in my previous email).
Cheers
Fabian
Andreas Kloeckner <[email protected]> schrieb am 17:08 Donnerstag,
19.Februar 2015:
<[email protected]> writes:
> Hi,
>
> is it possible to "precompile" the invoked kernels beforehand? My code makes
> use of several CUDA kernels, which are basically called within a "fit"
> function. Profiling the code with cProfile yields:
>
> 42272 function calls (42228 primitive calls) in 1.662 seconds
> ...
>
> 11 0.000 0.000 0.344 0.031 compiler.py:185(compile)
> 11 0.002 0.000 0.346 0.031 compiler.py:245(__init__)
> 4 0.000 0.000 0.317 0.079 compiler.py:33(preprocess_source)
> 11 0.000 0.000 0.342 0.031 compiler.py:66(compile_plain)
> ...
>
> Thus, about 0.344 of the 1.662 seconds are spent on compiling the
> code. When executing the function "fit" twice, the code is not
> compiled again (hence, saving these 0.344 seconds for the second call
> of "fit"). I would like to somehow precompile all involved kernels as
> soon as the object the "fit" function belongs to is initialized...
>
>
> Can one invoke the overall compilation process beforehand?
Sure! That's what the SourceModule constructor does. Just keep the
instance around.
Andreas
import sys
import math
import copy
import time
import numpy
import logging
import pycuda
import pycuda.gpuarray as gpuarray
import scikits.cuda.linalg as culinalg
from pycuda.compiler import SourceModule
import pycuda.cumath as cumath
import cProfile
class MyModel(object):
def __init__(self, device_id=0):
self.device_id = device_id
self._init_device()
def __del__(self):
self.ctx.pop()
def _init_device(self):
""" Initializes the device.
"""
# sanity check for device id
if (self.device_id < 0) or (self.device_id > pycuda.driver.Device.count() - 1):
raise Exception("Invalid device id. Maximum device id is " + str(pycuda.driver.Device.count() - 1))
# init driver and context
pycuda.driver.init()
self.device = pycuda.driver.Device(self.device_id)
self.ctx = self.device.make_context()
# init linalg for cuda
culinalg.init()
# init kernels
self._init_kernels()
def _init_kernels(self):
kernel_elementwise_multiply_mod = SourceModule("""
__global__ void elementwise_multiply(float *dest, float *a, int n, int m)
{
int tidx = threadIdx.x + blockDim.x * blockIdx.x;
int tidy = threadIdx.y + blockDim.y * blockIdx.y;
if (tidx >= n || tidy >= m){
return;
}
dest[tidx*m + tidy] = a[tidx*m + tidy]*a[tidx*m + tidy];
}
""")
self.kernel_elementwise_multiply = kernel_elementwise_multiply_mod.get_function("elementwise_multiply")
def _elementwise_mult(self, a):
n = a.shape[0]
m = a.shape[1]
dest = gpuarray.empty((n, m), numpy.float32)
blocksize = 16
block = (blocksize, blocksize, 1)
grid = (int(math.ceil(float(n) / blocksize)), int(math.ceil(float(m) / blocksize)))
self.kernel_elementwise_multiply(dest, a, numpy.int32(n), numpy.int32(m), block=block, grid=grid)
return dest
def _mult(self, a, b, transa=False, transb=False):
if transa:
transa = "T"
else:
transa = "N"
if transb:
transb = "T"
else:
transb = "N"
return culinalg.dot(a, b, transa=transa, transb=transb)
def fit(self, a_gpu, b_gpu):
test = self._mult(a_gpu, b_gpu)
test2 = self._elementwise_mult(a_gpu)
return test2 - test
# initialize model
model = MyModel(device_id=0)
# test arrays
a = numpy.random.randn(8000*8000).astype(numpy.float32).reshape((8000,8000))
b = numpy.random.randn(8000*8000).astype(numpy.float32).reshape((8000,8000))
a_gpu = gpuarray.to_gpu(a)
b_gpu = gpuarray.to_gpu(b)
# NOTE: By uncommenting this line, the profiling output "compiler.py:185(compile)" vanishes
#model.fit(a_gpu, b_gpu)
# this call is profiled
def run():
res = model.fit(a_gpu, b_gpu)
print res.get()
start = time.time()
cProfile.run("run()")
end = time.time()
print("Elapsed time: " + str(end-start))
_______________________________________________
PyCUDA mailing list
[email protected]
http://lists.tiker.net/listinfo/pycuda