Re: [PyCUDA] Sharing a common index within a block

Joe Sun, 18 Oct 2015 05:28:33 -0700

Hello,

in the meantime I added a scan function to find out how many
indices will be written by a specific thread.
These results are written to shared memory and it works
fine.


However, the final writing of the results to global memory
is very slow and takes up nearly all the time. (18 seconds out of 20).

Is there something I am missing in the following code?

int cnt
//holds the index where this thread starts to write to the global array
//this is computed by each thread earlier

thrd_chk_start and thrd_chk_end are the set the data that
each thread processes.
Typically (thrd_chk_end - thrd_chk_start) is between 25 and 100.


for (int i = thrd_chk_start; i < thrd_chk_end; i++)
{
    if(condition)
    {
        out[(hIdx * nearNeigh_n) + cnt ] = i;
        cnt += 1;
    }
}

The line with out[...] is very slow, does anyone know if there is
a reason for that? Indices not known to compiler beforehand or whatever?
All other writes to global memory are way way faster than this.

I tried blockDim.x between 64 and 1024 but there is not much difference.

Kind regards,
Joe

Hello,

in the meantime I made a first approach on this issue.
Unfortunately it is still very slow.
I am using atomicAdd with a shared variable at the moment.
I read somewhere that that it is recommended to use
shared instead global as it is faster, is this true in general?

This is about the code I am using:

if(condition)
{
     cnt = atomicAdd(&shrd_cnt, 1);
     out[ cnt ] = i;
}

Anything not good about it?

I am still thinking about how it could be done without
atomicAdd. One idea would be to initialize the
"out" array with e.g. -1.

And then check the array before each write if one
of the other threads has added something.

while(out[j] == -1)
{
     j++;Am 14.10.2015 um 21:38 schrieb Joe:> Hello,
Am 14.10.2015 um 21:38 schrieb Joe:> Hello,

in the meantime I made a first approach on this issue.
Unfortunately it is still very slow.
I am using atomicAdd with a shared variable at the moment.
I read somewhere that that it is recommended to use
shared instead global as it is faster, is this true in general?

This is about the code I am using:

if(condition)
{
     cnt = atomicAdd(&shrd_cnt, 1);
     out[ cnt ] = i;
}

Anything not good about it?

I am still thinking about how it could be done without
atomicAdd. One idea would be to initialize the
"out" array with e.g. -1.

And then check the array before each write if one
of the other threads has added something.

while(out[j] == -1)
{
     j++;
}
out[j] == i

But I think this is prone to errors if the memory is accessed
at the same time.
Am 14.10.2015 um 21:38 schrieb Joe:> Hello,

in the meantime I made a first approach on this issue.
Unfortunately it is still very slow.
I am using atomicAdd with a shared variable at the moment.
I read somewhere that that it is recommended to use
shared instead global as it is faster, is this true in general?

This is about the code I am using:

if(condition)
{
     cnt = atomicAdd(&shrd_cnt, 1);
     out[ cnt ] = i;
}

Anything not good about it?

I am still thinking about how it could be done without
atomicAdd. One idea would be to initialize the
"out" array with e.g. -1.

And then check the array before each write if one
of the other threads has added something.

while(out[j] == -1)
{
     j++;
}
out[j] == i

But I think this is prone to errors if the memory is accessed
at the same time.

Andreas, you mentioned to do "a scan withing each block".
Do you mean something like each thread checks how many indices
it wants to add to the global list and then add them in a second
step?

Kind regards,
Joe

 >> Hello,


I have a question and hope that you can help me.

A block is working on a common problem, the threads are iterating
through a part of the problem each.
Now if some condition is met, a thread should write its threadId
to a 1D output which is smaller than the total number of threads.

I would rather not store all of the results as integers.
since the condition is only met in very rare cases.

The two options I found would be

1.) to store all results in a bitfield with is as long as there are
threads and use bitwise atomicAnd.

2.) share a common index within a block which is and use the
return value of atomicAdd to store the threadId there.

Is one of this ideas to be preferred? Or do you have
better suggestions to do this?


This sounds tricky. A reasonable design might be to allocate space so
that every block has room to write out twice or three times its expected
number of outputs, use a scan within each block to compute indices, and
have some sort of failure indication (+do-over) if the allocated output
space overruns.

Andreas

Andreas, you mentioned to do "a scan withing each block".
Do you mean something like each thread checks how many indices
it wants to add to the global list and then add them in a second
step?

Kind regards,
Joe

 >> Hello,


I have a question and hope that you can help me.

A block is working on a common problem, the threads are iterating
through a part of the problem each.
Now if some condition is met, a thread should write its threadId
to a 1D output which is smaller than the total number of threads.

I would rather not store all of the results as integers.
since the condition is only met in very rare cases.

The two options I found would be

1.) to store all results in a bitfield with is as long as there are
threads and use bitwise atomicAnd.

2.) share a common index within a block which is and use the
return value of atomicAdd to store the threadId there.

Is one of this ideas to be preferred? Or do you have
better suggestions to do this?


This sounds tricky. A reasonable design might be to allocate space so
that every block has room to write out twice or three times its expected
number of outputs, use a scan within each block to compute indices, and
have some sort of failure indication (+do-over) if the allocated output
space overruns.

Andreas

in the meantime I made a first approach on this issue.
Unfortunately it is still very slow.
I am using atomicAdd with a shared variable at the moment.
I read somewhere that that it is recommended to use
shared instead global as it is faster, is this true in general?

This is about the code I am using:

if(condition)
{
     cnt = atomicAdd(&shrd_cnt, 1);
     out[ cnt ] = i;
}

Anything not good about it?

I am still thinking about how it could be done without
atomicAdd. One idea would be to initialize the
"out" array with e.g. -1.

And then check the array before each write if one
of the other threads has added something.

while(out[j] == -1)
{
     j++;
}
out[j] == i

But I think this is prone to errors if the memory is accessed
at the same time.

Andreas, you mentioned to do "a scan withing each block".
Do you mean something like each thread checks how many indices
it wants to add to the global list and then add them in a second
step?

Kind regards,
Joe

 >> Hello,


I have a question and hope that you can help me.

A block is working on a common problem, the threads are iterating
through a part of the problem each.
Now if some condition is met, a thread should write its threadId
to a 1D output which is smaller than the total number of threads.

I would rather not store all of the results as integers.
since the condition is only met in very rare cases.

The two options I found would be

1.) to store all results in a bitfield with is as long as there are
threads and use bitwise atomicAnd.

2.) share a common index within a block which is and use the
return value of atomicAdd to store the threadId there.

Is one of this ideas to be preferred? Or do you have
better suggestions to do this?


This sounds tricky. A reasonable design might be to allocate space so
that every block has room to write out twice or three times its expected
number of outputs, use a scan within each block to compute indices, and
have some sort of failure indication (+do-over) if the allocated output
space overruns.

Andreas

}
out[j] == i

But I think this is prone to errors if the memory is accessed
at the same time.

Andreas, you mentioned to do "a scan withing each block".
Do you mean something like each thread checks how many indices
it wants to add to the global list and then add them in a second
step?

Kind regards,
Joe

 >> Hello,


I have a question and hope that you can help me.

A block is working on a common problem, the threads are iterating
through a part of the problem each.
Now if some condition is met, a thread should write its threadId
to a 1D output which is smaller than the total number of threads.

I would rather not store all of the results as integers.
since the condition is only met in very rare cases.

The two options I found would be

1.) to store all results in a bitfield with is as long as there are
threads and use bitwise atomicAnd.

2.) share a common index within a block which is and use the
return value of atomicAdd to store the threadId there.

Is one of this ideas to be preferred? Or do you have
better suggestions to do this?


This sounds tricky. A reasonable design might be to allocate space so
that every block has room to write out twice or three times its expected
number of outputs, use a scan within each block to compute indices, and
have some sort of failure indication (+do-over) if the allocated output
space overruns.

Andreas



_______________________________________________
PyCUDA mailing list
[email protected]
http://lists.tiker.net/listinfo/pycuda

Re: [PyCUDA] Sharing a common index within a block

Reply via email to