Hi Francisco, Just my 2cents on your kernel, I ve learned that pow should be avoid (in my old days ;-) ). Just try:
distance=sqrt((pos_x[i]-x)*(pos_x[i]-x)+(pos_y[i]-y)*(pos_y[i]-y)+(pos_z[i]-z)*(pos_z[i]-z)); and to limit memory acces (but the compiler should do it by himself): tmpx=(pos_x[i]-x);tmpy=(pos_y[i]-y);tmpz=(pos_z[i]-z); distance=sqrt(tmpx*tmpx+tmpy*tmpy+tmpz*tmpz); I am really sorry but I can't test it, so if it is stupid feel free to tell me. Pierre. Le jeudi 05 avril 2012 à 22:30 +0200, Francisco Villaescusa Navarro a écrit : > Hi all, > > > I have implemented your suggestions by writing (in my problem I have > an array with positions (pos_x, pos_y, pos_,z), and I want to compute > the distances of that set of particles to another one located at > (x,y,z), and make an histogram of that distribution of distances) > > > distances_gpu_template = """ > __global__ void dis(float *pos_x, float *pos_y, float *pos_z, float x, > float y, float z, int size, int *aux) > { > unsigned int idx = blockIdx.x*blockDim.x+threadIdx.x; > unsigned int idy = blockIdx.y*blockDim.y+threadIdx.y; > unsigned int id = idy*gridDim.x*blockDim.x+idx; > int i,bin; > const uint interv = %(interv)s; > float distance; > > > for(i=id;i<size;i+=blockDim.x){ > > distance=sqrt(pow(pos_x[i]-x,2)+pow(pos_y[i]-y,2)+pow(pos_z[i]-z,2)); > bin=(int)(distance*interv/sqrt(3.01)); > aux[id*interv+bin]+=1; > } > } > """ > > > reduction_gpu_template = """ > __global__ void red(int *aux, int *his) > { > unsigned int idx = blockIdx.x*blockDim.x+threadIdx.x; > unsigned int idy = blockIdx.y*blockDim.y+threadIdx.y; > unsigned int id = idy*gridDim.x*blockDim.x+idx; > const uint interv = %(interv)s; > > > for(int i=0;i<512;i++){ > his[id]+=aux[id+interv*i]; > } > } > """ > > > The code runs and generate the same results as numpy.histogram, > although I can only speed up the calculation by a factor 10-15, > whereas I expected this factor to be close to 100 for big arrays. > > > Do you think that there could be a better approach to the problem? > Would be faster to compute the matrix of values with several blocks > instead of the only one I'm using here (my GPU has 512 "cores")? > > > I have also checked whether make the reduction with the GPU or the CPU > changes results, but the answer is not, since it is a very quick > operation, and as long as it only has to be made once, no matter > which, GPU or CPU, unit use. > > > Fran. > > > > El 04/04/2012, a las 23:07, David Mertens escribió: > > > On Wed, Apr 4, 2012 at 3:51 PM, Pazzula, Dominic J > > <dominic.j.pazz...@citi.com> wrote: > > Basically, yes. Each block calculates its own histogram. > > Each block returns an array with the histogram for that > > block. On the CPU sum up those “sub” histograms into the > > final. > > > > > > > > I'm not so sure that performing the reduction on the CPU is the > > right way here. But I could be wrong. If you really want to tackle > > the problem, you should try the reduction on both the GPU and the > > CPU and benchmark the results. > > > > > > > > > > From: Francisco Villaescusa Navarro > > [mailto:villaescusa.franci...@gmail.com] > > Sent: Wednesday, April 04, 2012 3:49 PM > > To: Pazzula, Dominic J [ICG-IT] > > Cc: 'David Mertens'; 'Francisco Villaescusa Navarro'; > > 'pycuda@tiker.net' > > > > > > Subject: Re: [PyCUDA] Histograms with PyCUDA > > > > > > > > > > Thanks a lot for the replies! > > > > > > > > > > I'm not sure to fully understand what you say, so please, > > let me say it with my own words (if I'm wrong please let me > > know): > > > > > > > > > > > > I transfer the array with the numbers I want to grid to the > > GPU. Over each element of that array I overwrite the value > > of the bin that corresponds to that array's element and I > > return that array (containing integer numbers with the > > positions of the bins) to the CPU where I make the > > reduction. > > > > > > > > The first half of what you said isn't quite what I proposed. I had > > in mind that you would allocate a new set of memory on the device > > with size N_blocks x N_bins. You would have to perform atomic > > operations on the bin increments, which isn't great for performance > > because you could serialize multiple updates on the same bin, but at > > least you're distributing those atomic operations across many > > processors rather than on a single CPU. Proper bin size is critical > > for good performance: if your bins are too big, you'll essentially > > end up with serialized updates. If the bins are too small, you'll > > allocate far more memory than you need. > > > > > > > > El 04/04/2012, a las 22:34, Pazzula, Dominic J escribió: > > > > > > > > > > Exactly what I was about to propose. Doing the reduction > > would probably be faster on the CPU. NumPy + MKL would > > thread what is essentially a series of element-wise array > > additions. > > > > > > > > Actually, I would argue that the reduction of the binning from > > N_blocks x N_bins down to a single histogram of size N_bins would be > > very well suited for a parallel implementation, better suited than > > the binning operation, and should be much faster on the GPU than the > > CPU. It also saves you on data transfer back to the CPU when you're > > done. > > > > David > > > > > > -- > > "Debugging is twice as hard as writing the code in the first place. > > Therefore, if you write the code as cleverly as possible, you are, > > by definition, not smart enough to debug it." -- Brian Kernighan > > > > > _______________________________________________ > PyCUDA mailing list > PyCUDA@tiker.net > http://lists.tiker.net/listinfo/pycuda _______________________________________________ PyCUDA mailing list PyCUDA@tiker.net http://lists.tiker.net/listinfo/pycuda