On Mon, 2012-02-13 at 21:22 +0100, Daniele Pianu wrote:
> On Sun, 2012-02-12 at 15:15 -0500, Andreas Kloeckner wrote:
> > I know that to get true overlapping on Nv, those buffers have to be
> > what's called "page-locked" on the Nvidia side. This requires
> > CL_MEM_ALLOC_HOST_PTR (which has a different meaning, as you may
> > know). Also, it seems you're using CUDA 3.2? The Nv CL drivers have
> > matured significantly since 3.2, I'd advise you to use something newer.
> 
> Unfortunately, I cannot upgrade my installation since I'm using the
> laboratory computers and I've no administrator privileges. Regarding the
> page-locked memory, you were right. Also, I didn't have to use two
> buffers for each data half: a single buffer using the proper slicing
> suffices. Now everything works beautifully :D For anyone who may find
> the code useful, here it is:
> 
> 
> # Init host memory buffer and device memory buffer used for enable 
> # pinned-memory
> pinInBuffer = cl.Buffer(self.context, clmem.READ_ONLY|
>                         clmem.ALLOC_HOST_PTR, dataSize)
> pinOutBuffer = cl.Buffer(self.context, clmem.WRITE_ONLY|
>                          clmem.ALLOC_HOST_PTR, dataSize)
> devInBuffer = cl.Buffer(self.context, clmem.READ_ONLY, dataSize)
> devOutBuffer = cl.Buffer(self.context, clmem.WRITE_ONLY, dataSize)
> 
> # Get numpy arrays used for filling and retrieving data from 
> # pinned-memory
> (dataIn,ev) = cl.enqueue_map_buffer(self.cmdQueues[0], pinInBuffer, 
>                                     clmap.WRITE,
>                                     0, (dataSize,), np.uint8, 'C')
> (dataOut,ev) = cl.enqueue_map_buffer(self.cmdQueues[0], pinOutBuffer, 
>                                      clmap.READ,
>                                      0, (dataSize,), np.uint8, 'C')
>             
> # Fill the array obtained from memory maps
> dataIn[:] = np.frombuffer(data, dtype=np.uint8)
> 
> 
> # Non-blocking copy of the first half
> # TODO: could it be blocking? actually we can't start before the first 
> # chunck is copied
> cl.enqueue_copy(self.cmdQueues[0], devInBuffer, dataIn[:halfSize],
>                 is_blocking=False)
> self.cmdQueues[0].flush()
> 
> 
> # Launch kernel on the first half
> program.aes_ecb(self.cmdQueues[0], (halfSize>>4,), (256,), keyBuffer,
>                 devInBuffer, devOutBuffer,
>                 T0buff, T1buff, T2buff, T3buff,
>                 np.uint32(0))

Instead you can use events and wait_for here:

event0 = cl.enqueue_copy(self.cmdQueues[0], devInBuffer,
dataIn[:halfSize],
                 is_blocking=False)
 program.aes_ecb(self.cmdQueues[0], (halfSize>>4,), (256,), keyBuffer,
                 devInBuffer, devOutBuffer,
                 T0buff, T1buff, T2buff, T3buff,
                 np.uint32(0), wait_for=(event0))

This way OpenCL will ensure proper order of commands in the queue.

Best regards 


-- 
Tomasz Rybak  GPG/PGP key ID: 2AD5 9860
Fingerprint A481 824E 7DD3 9C0E C40A  488E C654 FB33 2AD5 9860
http://member.acm.org/~tomaszrybak

Attachment: signature.asc
Description: This is a digitally signed message part

_______________________________________________
PyOpenCL mailing list
[email protected]
http://lists.tiker.net/listinfo/pyopencl

Reply via email to