Hello,
I'm working on a neural network-based project which relies heavily on array
operations. I have moved from Numpy to PyOpenCL in order to speed up these
operations, and gotten great results (3.7x speedup on my laptop). I'm looking
forward to even better results when I move to a better graphics card. However,
in order to get optimal performance, I want to properly handle asynchronous
behavior, but I am not sure how to do that when using builtin Array functions
(+,-,/,abs(), fill(), etc).
I have defined several custom kernels, and used them successfully, along with
some more primitive operations. These custom kernels all return event objects,
which I can then use with the wait_for argument to synchronize execution.
However, it seems that the only way to do this with the built in functions is
by using queue.finish(), since they do not return event objections. Is there a
more sophisticated way to do so?
Here is some hypothetical code:
###### using built-in functions, and queue.finish() for synchronization ######
def my_method(x,y):
c = x*y
queue.finish()
event = cl_custom_kernel(c,x,y)
event.wait()
return c
def main():
results = []
for a in range(10):
x = cl.zeros(queue,(100,100), dtype=np.float32)
y = cl.zeros(queue,(100,100), dtype=np.float32)
queue.finish()
x.fill(1.0)
y.fill(2.0)
queue.finish()
z = my_method(x,y)
results.append(z)
###### using custom kernels and events for synchronization ######
# cl_fill(arr,val) is a kernel which does arr.fill(val)
# cl_multiply(a,b,c) is a kernel which does c=a*b
def my_method(x,y,c,x_event,y_event):
mult_event = cl_multiply(x,y,c,wait_for=[x_event,y_event])
final_event = cl_custom_kernel(c,x,y,wait_for=[mult_event])
return final_event
def main()
xs = []
ys = []
zs = []
for a in range(10):
x = cl.zeros(queue,(100,100), dtype=np.float32)
y = cl.zeros(queue,(100,100), dtype=np.float32)
z = cl.zeros(queue,(100,100), dtype=np.float32)
xs.append(x)
ys.append(y)
zs.append(z)
queue.finish()
events = []
for x,y,z in zip(xs,ys,zs):
x_evt = cl_fill(x,1.0)
y_evt = cl_fill(y,1.0)
evt = my_method(x,y,z,x_evt,y_evt)
events.append(evt)
for evt in events:
evt.wait()
Now, lets assume I want each iteration to run in parallel, so that I can
saturate the graphics card. Ideally, we would queue the allocation operations,
then queue the fill operations (but use the wait_for argument so that they wait
until allocation is complete), then we call the method which queues
multiplication, and then queues the custom kernel. But each operation should
wait until the operation before it finishes.
Is there any way to do this using the built in functions? Or do I have to build
custom kernels for everything so that I have access to the events for each
operation?
Thanks,
Lewis
_______________________________________________
PyOpenCL mailing list
[email protected]
http://lists.tiker.net/listinfo/pyopencl