Thanks Lev.
My matrix size is going to be large, somewhere near n=100000.
So, how can I test between CPU and GPU matrix math? I though my
technique was good enough but apparently not.




On Sun, Nov 8, 2015 at 11:33 PM, Lev Givon <[email protected]> wrote:
> Received from Keith Brown on Sun, Nov 08, 2015 at 10:23:18PM EST:
>> I have several thousand matrices where I need to calculate their dot
>> product. So, it seems pyCuda should do the trick (i hope). I am
>> running into an issue with block sizes.
>>
>> Here is my code
>>
>> #!/usr/bin/env python
>> import sys,time
>> from string import Template
>> import numpy as np
>> from pycuda import driver, compiler, gpuarray, tools
>> from pycuda.compiler import SourceModule
>> import pycuda.autoinit
>>
>>
>> def main():
>>     d={}
>>     size=4
>>     d['size']=size
>>
>>     src=Template("""
>> __global__ void MatrixMulKernel(float *a, float *b, float *c)
>> {
>>     int tx = threadIdx.x;
>>     int ty = threadIdx.y;
>>     float Pvalue = 0;
>>
>>     for (int k = 0; k < $size; ++k) {
>>         float Aelement = a[ty * $size + k];
>>         float Belement = b[k * $size + tx];
>>         Pvalue += Aelement * Belement;
>>     }
>>    c[ty * $size + tx] = Pvalue;
>>
>>     }
>>     """)
>>
>>     #src.substitute(d)
>>
>>     a_cpu = np.random.randn(size,size).astype(np.float32)
>>     b_cpu = np.random.randn(size,size).astype(np.float32)
>>
>>     a_gpu=gpuarray.to_gpu(a_cpu)
>>     b_gpu=gpuarray.to_gpu(b_cpu)
>>     c_gpu = gpuarray.empty((size,size), np.float32)
>>
>>     src.substitute(d)
>>     mod = compiler.SourceModule(src.substitute(d))
>>     mm=mod.get_function("MatrixMulKernel")
>>     v=mm(a_gpu,b_gpu,c_gpu,
>>             block=(16,16,1),
>>             )
>>     start=time.time()
>>
>>     gpu_ans=c_gpu.get()
>>     stop=time.time()
>>     print "Gpu",stop-start
>>
>>     start=time.time()
>>     cpu_ans=np.dot(a_cpu,b_cpu)
>>     stop=time.time()
>>     print "Cpu",stop-start
>>
>>
>>     #print gpu_ans
>>     #print cpu_ans
>>     print np.allclose(gpu_ans,cpu_ans,atol=1e-03)
>>
>>
>> Couple of issues:
>> When I increase size of matrix it seems it gets less accurate than CPU
>> dot product therefore I have to use np.allclose to compare.
>
> It isn't necessary clear that the CPU answer is "more accurate"; since the
> summations performed on the GPU may occur in a different order than those on 
> the
> CPU and since floating point addition is not associative, the difference 
> between
> the GPU and CPU results may become more pronounced for the larger summations
> required when computing the dot product of large matrices.
>
>> Also, what is the optimal block size I should be using?
>
> It depends on your matrix size; you generally want to set the block (and grid)
> size to maximize the number of threads active at a specific time.
>
> If your matrices are very small (4 x 4), it isn't clear that using the GPU 
> will
> save you much time compared to using numpy because of the cost of copying the
> matrices to and from GPU memory.
>
> Note that if you are dealing with large matrices, you may wish to check out 
> the
> CUBLAS functions for matrix multiplication; a dot() function that uses those
> functions is available in scikit-cuda [1], although the Python code that makes
> the function easy to use may impose some noticeable overhead if you plan to
> invoke it several thousand times.
>
> [1] http://scikit-cuda.rtfd.org
> --
> Lev Givon
> Bionet Group | Neurokernel Project
> http://lebedov.github.io/
> http://neurokernel.github.io/
>

_______________________________________________
PyCUDA mailing list
[email protected]
http://lists.tiker.net/listinfo/pycuda

Reply via email to