Hi Francisco,

Just my 2cents on your kernel, I ve learned that pow should be avoid (in
my old days ;-) ). Just try:

distance=sqrt((pos_x[i]-x)*(pos_x[i]-x)+(pos_y[i]-y)*(pos_y[i]-y)+(pos_z[i]-z)*(pos_z[i]-z));

and to limit memory acces (but the compiler should do it by himself):

tmpx=(pos_x[i]-x);tmpy=(pos_y[i]-y);tmpz=(pos_z[i]-z);
distance=sqrt(tmpx*tmpx+tmpy*tmpy+tmpz*tmpz);

I am really sorry but I can't test it, so if it is stupid feel free to
tell me.

Pierre.

Le jeudi 05 avril 2012 à 22:30 +0200, Francisco Villaescusa Navarro a
écrit :
> Hi all,
> 
> 
> I have implemented your suggestions by writing (in my problem I have
> an array with positions (pos_x, pos_y, pos_,z), and I want to compute
> the distances of that set of particles to another one located at
> (x,y,z), and make an histogram of that distribution of distances)
> 
> 
> distances_gpu_template = """
> __global__ void dis(float *pos_x, float *pos_y, float *pos_z, float x,
> float y, float z, int size, int *aux)
> {
>     unsigned int idx = blockIdx.x*blockDim.x+threadIdx.x;
>     unsigned int idy = blockIdx.y*blockDim.y+threadIdx.y;
>     unsigned int id = idy*gridDim.x*blockDim.x+idx;
>     int i,bin;
>     const uint interv = %(interv)s;
>     float distance;
> 
> 
>     for(i=id;i<size;i+=blockDim.x){
> 
> distance=sqrt(pow(pos_x[i]-x,2)+pow(pos_y[i]-y,2)+pow(pos_z[i]-z,2));
>         bin=(int)(distance*interv/sqrt(3.01));
>         aux[id*interv+bin]+=1;
>     }
> }
> """
> 
> 
> reduction_gpu_template = """
> __global__ void red(int *aux, int *his)
> {
>     unsigned int idx = blockIdx.x*blockDim.x+threadIdx.x;
>     unsigned int idy = blockIdx.y*blockDim.y+threadIdx.y;
>     unsigned int id = idy*gridDim.x*blockDim.x+idx;
>     const uint interv = %(interv)s;
> 
> 
>     for(int i=0;i<512;i++){
>         his[id]+=aux[id+interv*i];
>     }
> }
> """
> 
> 
> The code runs and generate the same results as numpy.histogram,
> although I can only speed up the calculation by a factor 10-15,
> whereas I expected this factor to be close to 100 for big arrays.
> 
> 
> Do you think that there could be a better approach to the problem?
> Would be faster to compute the matrix of values with several blocks
> instead of the only one I'm using here (my GPU has 512 "cores")?
> 
> 
> I have also checked whether make the reduction with the GPU or the CPU
> changes results, but the answer is not, since it is a very quick
> operation, and as long as it only has to be made once, no matter
> which, GPU or CPU, unit use.
> 
> 
> Fran.
> 
> 
> 
> El 04/04/2012, a las 23:07, David Mertens escribió:
> 
> > On Wed, Apr 4, 2012 at 3:51 PM, Pazzula, Dominic J
> > <dominic.j.pazz...@citi.com> wrote:
> >         Basically, yes.  Each block calculates its own histogram.
> >         Each block returns an array with the histogram for that
> >         block.  On the CPU sum up those “sub” histograms into the
> >         final.
> >         
> >         
> > 
> > I'm not so sure that performing the reduction on the CPU is the
> > right way here. But I could be wrong. If you really want to tackle
> > the problem, you should try the reduction on both the GPU and the
> > CPU and benchmark the results.
> > 
> > 
> >         
> >         
> >         From: Francisco Villaescusa Navarro
> >         [mailto:villaescusa.franci...@gmail.com] 
> >         Sent: Wednesday, April 04, 2012 3:49 PM
> >         To: Pazzula, Dominic J [ICG-IT]
> >         Cc: 'David Mertens'; 'Francisco Villaescusa Navarro';
> >         'pycuda@tiker.net'
> >         
> >         
> >         Subject: Re: [PyCUDA] Histograms with PyCUDA
> >         
> >         
> >          
> >         
> >         Thanks a lot for the replies!
> >         
> >          
> >         
> >         
> >         I'm not sure to fully understand what you say, so please,
> >         let me say it with my own words (if I'm wrong please let me
> >         know):
> >         
> >         
> >          
> >         
> >         
> >         I transfer the array with the numbers I want to grid to the
> >         GPU. Over each element of that array I overwrite the value
> >         of the bin that corresponds to that array's element and I
> >         return that array (containing integer numbers with the
> >         positions of the bins) to the CPU where I make the
> >         reduction.
> >         
> >         
> > 
> > The first half of what you said isn't quite what I proposed. I had
> > in mind that you would allocate a new set of memory on the device
> > with size N_blocks x N_bins. You would have to perform atomic
> > operations on the bin increments, which isn't great for performance
> > because you could serialize multiple updates on the same bin, but at
> > least you're distributing those atomic operations across many
> > processors rather than on a single CPU. Proper bin size is critical
> > for good performance: if your bins are too big, you'll essentially
> > end up with serialized updates. If the bins are too small, you'll
> > allocate far more memory than you need.
> >  
> > 
> >         
> >         El 04/04/2012, a las 22:34, Pazzula, Dominic J escribió:
> >         
> >         
> >         
> >         
> >         Exactly what I was about to propose.  Doing the reduction
> >         would probably be faster on the CPU. NumPy + MKL would
> >         thread what is essentially a series of element-wise array
> >         additions.
> >         
> >         
> > 
> > Actually, I would argue that the reduction of the binning from
> > N_blocks x N_bins down to a single histogram of size N_bins would be
> > very well suited for a parallel implementation, better suited than
> > the binning operation, and should be much faster on the GPU than the
> > CPU. It also saves you on data transfer back to the CPU when you're
> > done.
> > 
> > David
> > 
> > 
> > -- 
> >  "Debugging is twice as hard as writing the code in the first place.
> >   Therefore, if you write the code as cleverly as possible, you are,
> >   by definition, not smart enough to debug it." -- Brian Kernighan
> > 
> 
> 
> _______________________________________________
> PyCUDA mailing list
> PyCUDA@tiker.net
> http://lists.tiker.net/listinfo/pycuda



_______________________________________________
PyCUDA mailing list
PyCUDA@tiker.net
http://lists.tiker.net/listinfo/pycuda

Reply via email to