Joe <homerun4...@gmail.com> writes:
> in the meantime I added a scan function to find out how many
> indices will be written by a specific thread.
> These results are written to shared memory and it works
> fine.
>
> However, the final writing of the results to global memory
> is very slow and takes up nearly all the time. (18 seconds out of 20).
>
> Is there something I am missing in the following code?
>
> int cnt
> //holds the index where this thread starts to write to the global array
> //this is computed by each thread earlier
>
> thrd_chk_start and thrd_chk_end are the set the data that
> each thread processes.
> Typically (thrd_chk_end - thrd_chk_start) is between 25 and 100.
>
>
> for (int i = thrd_chk_start; i < thrd_chk_end; i++)
> {
>      if(condition)
>      {
>          out[(hIdx * nearNeigh_n) + cnt ] = i;
>          cnt += 1;
>      }
> }
>
> The line with out[...] is very slow, does anyone know if there is
> a reason for that? Indices not known to compiler beforehand or whatever?
> All other writes to global memory are way way faster than this.

Depending on how scattered these writes are, it might be helpful to turn
off caching for them. See the CUDA docs for how.

Andreas

_______________________________________________
PyCUDA mailing list
PyCUDA@tiker.net
http://lists.tiker.net/listinfo/pycuda

Reply via email to