Thanks Lev. My matrix size is going to be large, somewhere near n=100000. So, how can I test between CPU and GPU matrix math? I though my technique was good enough but apparently not.
On Sun, Nov 8, 2015 at 11:33 PM, Lev Givon <[email protected]> wrote: > Received from Keith Brown on Sun, Nov 08, 2015 at 10:23:18PM EST: >> I have several thousand matrices where I need to calculate their dot >> product. So, it seems pyCuda should do the trick (i hope). I am >> running into an issue with block sizes. >> >> Here is my code >> >> #!/usr/bin/env python >> import sys,time >> from string import Template >> import numpy as np >> from pycuda import driver, compiler, gpuarray, tools >> from pycuda.compiler import SourceModule >> import pycuda.autoinit >> >> >> def main(): >> d={} >> size=4 >> d['size']=size >> >> src=Template(""" >> __global__ void MatrixMulKernel(float *a, float *b, float *c) >> { >> int tx = threadIdx.x; >> int ty = threadIdx.y; >> float Pvalue = 0; >> >> for (int k = 0; k < $size; ++k) { >> float Aelement = a[ty * $size + k]; >> float Belement = b[k * $size + tx]; >> Pvalue += Aelement * Belement; >> } >> c[ty * $size + tx] = Pvalue; >> >> } >> """) >> >> #src.substitute(d) >> >> a_cpu = np.random.randn(size,size).astype(np.float32) >> b_cpu = np.random.randn(size,size).astype(np.float32) >> >> a_gpu=gpuarray.to_gpu(a_cpu) >> b_gpu=gpuarray.to_gpu(b_cpu) >> c_gpu = gpuarray.empty((size,size), np.float32) >> >> src.substitute(d) >> mod = compiler.SourceModule(src.substitute(d)) >> mm=mod.get_function("MatrixMulKernel") >> v=mm(a_gpu,b_gpu,c_gpu, >> block=(16,16,1), >> ) >> start=time.time() >> >> gpu_ans=c_gpu.get() >> stop=time.time() >> print "Gpu",stop-start >> >> start=time.time() >> cpu_ans=np.dot(a_cpu,b_cpu) >> stop=time.time() >> print "Cpu",stop-start >> >> >> #print gpu_ans >> #print cpu_ans >> print np.allclose(gpu_ans,cpu_ans,atol=1e-03) >> >> >> Couple of issues: >> When I increase size of matrix it seems it gets less accurate than CPU >> dot product therefore I have to use np.allclose to compare. > > It isn't necessary clear that the CPU answer is "more accurate"; since the > summations performed on the GPU may occur in a different order than those on > the > CPU and since floating point addition is not associative, the difference > between > the GPU and CPU results may become more pronounced for the larger summations > required when computing the dot product of large matrices. > >> Also, what is the optimal block size I should be using? > > It depends on your matrix size; you generally want to set the block (and grid) > size to maximize the number of threads active at a specific time. > > If your matrices are very small (4 x 4), it isn't clear that using the GPU > will > save you much time compared to using numpy because of the cost of copying the > matrices to and from GPU memory. > > Note that if you are dealing with large matrices, you may wish to check out > the > CUBLAS functions for matrix multiplication; a dot() function that uses those > functions is available in scikit-cuda [1], although the Python code that makes > the function easy to use may impose some noticeable overhead if you plan to > invoke it several thousand times. > > [1] http://scikit-cuda.rtfd.org > -- > Lev Givon > Bionet Group | Neurokernel Project > http://lebedov.github.io/ > http://neurokernel.github.io/ > _______________________________________________ PyCUDA mailing list [email protected] http://lists.tiker.net/listinfo/pycuda
