I'm seeing performance of around a 7.5x speed up on my 8500GT and 1 year old Xeon. 140ms GPU vs 1047ms CPU on 10,000,000 points.
For me, the slower memory throughput kills the idea of using MKL and numpy for the reduction. MKL can do the reduction step in less than .01ms and the GPU is doing it in .9ms. But the transfer time of that 2MB (512x1024 floats) is about 2ms, giving the GPU the edge. I am guessing here, but my thought on the slower than expected performance comes from the writing of the bins. temp_grid[id*interv+bin]+=1.0; If I am remembering correctly, the GPU cannot write to the same memory slot simultaneously. If the memory is aligned such that bin=0 for each core is in the same slot, then if two cores were attempting to adjust the value in bin=0, it would take 2 cycles instead of 1. 10 writing means 10 cycles. Etc. Even if the memory is not aligned, you will be attempting to write to the same block a large number of times. Utilizing shared memory to temporarily hold the bin counts and then putting them into the global output matrix SHOULD increase your time. If I have a chance, I'll work on it and post an updated kernel. Dominic -----Original Message----- From: pycuda-boun...@tiker.net [mailto:pycuda-boun...@tiker.net] On Behalf Of Francisco Villaescusa Navarro Sent: Friday, April 06, 2012 11:26 AM To: Thomas Wiecki Cc: pycuda@tiker.net Subject: Re: [PyCUDA] Histograms with PyCUDA Thanks for all the suggestions! Regarding removing sqrt: it seems that the code only gains about ~1%, and you lose the capacity to easily define linear intervals... I have tried with sqrt and sqrtf, but there is not difference in the total time (or it is very small). The code to find the histogram of an array with values between 0 and 1 should be something as: import numpy as np import time import pycuda.driver as cuda import pycuda.autoinit import pycuda.gpuarray as gpuarray import pycuda.cumath as cumath from pycuda.compiler import SourceModule from pycuda import compiler grid_gpu_template = """ __global__ void grid(float *values, int size, float *temp_grid) { unsigned int id = threadIdx.x; int i,bin; const uint interv = %(interv)s; for(i=id;i<size;i+=blockDim.x){ bin=(int)(values[i]*interv); if (bin==interv){ bin=interv-1; } temp_grid[id*interv+bin]+=1.0; } } """ reduction_gpu_template = """ __global__ void reduction(float *temp_grid, float *his) { unsigned int id = blockIdx.x*blockDim.x+threadIdx.x; const uint interv = %(interv)s; const uint threads = %(max_number_of_threads)s; if(id<interv){ for(int i=0;i<threads;i++){ his[id]+=temp_grid[id+interv*i]; } } } """ number_of_points=100000000 max_number_of_threads=512 interv=1024 blocks=interv/max_number_of_threads if interv%max_number_of_threads!=0: blocks+=1 values=np.random.random(number_of_points).astype(np.float32) grid_gpu = grid_gpu_template % { 'interv': interv, } mod_grid = compiler.SourceModule(grid_gpu) grid = mod_grid.get_function("grid") reduction_gpu = reduction_gpu_template % { 'interv': interv, 'max_number_of_threads': max_number_of_threads, } mod_redt = compiler.SourceModule(reduction_gpu) redt = mod_redt.get_function("reduction") values_gpu=gpuarray.to_gpu(values) temp_grid_gpu =gpuarray.zeros((max_number_of_threads,interv),dtype=np.float32) hist=np.zeros(interv,dtype=np.float32) hist_gpu=gpuarray.to_gpu(hist) start=time.clock()*1e3 grid (values_gpu ,np .int32 (number_of_points ),temp_grid_gpu,grid=(1,1),block=(max_number_of_threads,1,1)) redt(temp_grid_gpu,hist_gpu,grid=(blocks, 1),block=(max_number_of_threads,1,1)) hist=hist_gpu.get() print 'Time used to grid with GPU:',time.clock()*1e3-start,' ms' start=time.clock()*1e3 bins_histo=np.linspace(0.0,1.0,interv+1) hist_CPU=np.histogram(values,bins=bins_histo)[0] print 'Time used to grid with CPU:',time.clock()*1e3-start,' ms' print 'max difference between methods=',np.max(hist_CPU-hist) ################ Results: Time used to grid with GPU: 680.0 ms Time used to grid with CPU: 9320.0 ms max difference between methods= 0.0 So it seems that with this algorithm we can't achieve factors larger than ~15 Fran. _______________________________________________ PyCUDA mailing list PyCUDA@tiker.net http://lists.tiker.net/listinfo/pycuda _______________________________________________ PyCUDA mailing list PyCUDA@tiker.net http://lists.tiker.net/listinfo/pycuda