meant to paste the following in my initial post (this is with a 5000-by-250 matrix... somehow after i ran it a couple times, i was getting a memory error with the 10K matrix):
In [79]: wwt_gpu, gputime = matrixmul_opt(Wsub,Wsub.T) number of registers used: 25 In [80]: gputime Out[80]: 0.5596442222595215 In [81]: wwt_gpu, gputime = matrixmul_opt(Wsub,Wsub.T) number of registers used: 25 In [82]: gputime Out[82]: 0.5658891201019287 In [83]: wwt_gpu, gputime = matrixmul_opt(Wsub,Wsub.T) number of registers used: 25 In [84]: gputime Out[84]: 0.5873079299926758 In [85]: t = time(); wwt_cpu = Wsub.dot(Wsub.T); dt_cpu = time()-t In [86]: dt_cpu Out[86]: 0.4057891368865967 In [87]: t = time(); wwt_cpu = Wsub.dot(Wsub.T); dt_cpu = time()-t In [88]: dt_cpu Out[88]: 0.36435890197753906 In [89]: t = time(); wwt_cpu = Wsub.dot(Wsub.T); dt_cpu = time()-t In [90]: dt_cpu Out[90]: 0.3597581386566162 On Sun, Sep 30, 2012 at 6:04 PM, W. Bryan Smith <[email protected]> wrote: > hi all, > > i am just getting started with pycuda, and wanted to check the performance > on matrix multiplication. > > i copied the demo at > http://wiki.tiker.net/PyCuda/Examples/DemoMetaMatrixmulCheetah, and i can > get it to run just fine. but the performance, as measured by the returned > gputime, is consistently a little slower than using numpy's builtin > linalg.dot() function. i have left all the default settings as defined in > the demo file, and on a test matrix of size (10000,250) the pycuda version > of the inner product takes about 10-15% longer than the numpy version. are > there default settings i can tweak to make this faster? or, alternatively, > is there something else i should be doing to test this? > > I am running the CUDA-5.0 libraries, pycuda 2012.1, and Cheetah 2.4.4 on > OS X 10.7.4 > > thanks, > bryan >
_______________________________________________ PyCUDA mailing list [email protected] http://lists.tiker.net/listinfo/pycuda
