Hello,

I'm working on a neural network-based project which relies heavily on array 
operations. I have moved from Numpy to PyOpenCL in order to speed up these 
operations, and gotten great results (3.7x speedup on my laptop). I'm looking 
forward to even better results when I move to a better graphics card. However, 
in order to get optimal performance, I want to properly handle asynchronous 
behavior, but I am not sure how to do that when using builtin Array functions 
(+,-,/,abs(), fill(), etc).

I have defined several custom kernels, and used them successfully, along with 
some more primitive operations. These custom kernels all return event objects, 
which I can then use with the wait_for argument to synchronize execution. 
However, it seems that the only way to do this with the built in functions is 
by using queue.finish(), since they do not return event objections. Is there a 
more sophisticated way to do so?

Here is some hypothetical code:


###### using built-in functions, and queue.finish() for synchronization ######
def my_method(x,y):
        c = x*y
        queue.finish()
        event = cl_custom_kernel(c,x,y)
        event.wait()
        return c

def main():
        results = []
        for a in range(10):
                x = cl.zeros(queue,(100,100), dtype=np.float32)
                y = cl.zeros(queue,(100,100), dtype=np.float32)
                queue.finish()
                x.fill(1.0)
                y.fill(2.0)
                queue.finish()
                z = my_method(x,y)
                results.append(z)

###### using custom kernels and events for synchronization ######
# cl_fill(arr,val) is a kernel which does arr.fill(val)
# cl_multiply(a,b,c) is a kernel which does c=a*b

def my_method(x,y,c,x_event,y_event):
        mult_event = cl_multiply(x,y,c,wait_for=[x_event,y_event])
        final_event = cl_custom_kernel(c,x,y,wait_for=[mult_event])
        return final_event

def main()
        xs = []
        ys = []
        zs = []
        for a in range(10):
                x = cl.zeros(queue,(100,100), dtype=np.float32)
                y = cl.zeros(queue,(100,100), dtype=np.float32)
                z = cl.zeros(queue,(100,100), dtype=np.float32)
                xs.append(x)
                ys.append(y)
                zs.append(z)
        queue.finish()
        
        events = []
        for x,y,z in zip(xs,ys,zs):
                x_evt = cl_fill(x,1.0)
                y_evt = cl_fill(y,1.0)
                evt = my_method(x,y,z,x_evt,y_evt)
                events.append(evt)
        for evt in events:
                evt.wait()





Now, lets assume I want each iteration to run in parallel, so that I can 
saturate the graphics card. Ideally, we would queue the allocation operations, 
then queue the fill operations (but use the wait_for argument so that they wait 
until allocation is complete), then we call the method which queues 
multiplication, and then queues the custom kernel. But each operation should 
wait until the operation before it finishes.

Is there any way to do this using the built in functions? Or do I have to build 
custom kernels for everything so that I have access to the events for each 
operation?

Thanks,
Lewis

_______________________________________________
PyOpenCL mailing list
[email protected]
http://lists.tiker.net/listinfo/pyopencl

Reply via email to