On Mon, 2012-02-13 at 21:22 +0100, Daniele Pianu wrote: > On Sun, 2012-02-12 at 15:15 -0500, Andreas Kloeckner wrote: > > I know that to get true overlapping on Nv, those buffers have to be > > what's called "page-locked" on the Nvidia side. This requires > > CL_MEM_ALLOC_HOST_PTR (which has a different meaning, as you may > > know). Also, it seems you're using CUDA 3.2? The Nv CL drivers have > > matured significantly since 3.2, I'd advise you to use something newer. > > Unfortunately, I cannot upgrade my installation since I'm using the > laboratory computers and I've no administrator privileges. Regarding the > page-locked memory, you were right. Also, I didn't have to use two > buffers for each data half: a single buffer using the proper slicing > suffices. Now everything works beautifully :D For anyone who may find > the code useful, here it is: > > > # Init host memory buffer and device memory buffer used for enable > # pinned-memory > pinInBuffer = cl.Buffer(self.context, clmem.READ_ONLY| > clmem.ALLOC_HOST_PTR, dataSize) > pinOutBuffer = cl.Buffer(self.context, clmem.WRITE_ONLY| > clmem.ALLOC_HOST_PTR, dataSize) > devInBuffer = cl.Buffer(self.context, clmem.READ_ONLY, dataSize) > devOutBuffer = cl.Buffer(self.context, clmem.WRITE_ONLY, dataSize) > > # Get numpy arrays used for filling and retrieving data from > # pinned-memory > (dataIn,ev) = cl.enqueue_map_buffer(self.cmdQueues[0], pinInBuffer, > clmap.WRITE, > 0, (dataSize,), np.uint8, 'C') > (dataOut,ev) = cl.enqueue_map_buffer(self.cmdQueues[0], pinOutBuffer, > clmap.READ, > 0, (dataSize,), np.uint8, 'C') > > # Fill the array obtained from memory maps > dataIn[:] = np.frombuffer(data, dtype=np.uint8) > > > # Non-blocking copy of the first half > # TODO: could it be blocking? actually we can't start before the first > # chunck is copied > cl.enqueue_copy(self.cmdQueues[0], devInBuffer, dataIn[:halfSize], > is_blocking=False) > self.cmdQueues[0].flush() > > > # Launch kernel on the first half > program.aes_ecb(self.cmdQueues[0], (halfSize>>4,), (256,), keyBuffer, > devInBuffer, devOutBuffer, > T0buff, T1buff, T2buff, T3buff, > np.uint32(0))
Instead you can use events and wait_for here:
event0 = cl.enqueue_copy(self.cmdQueues[0], devInBuffer,
dataIn[:halfSize],
is_blocking=False)
program.aes_ecb(self.cmdQueues[0], (halfSize>>4,), (256,), keyBuffer,
devInBuffer, devOutBuffer,
T0buff, T1buff, T2buff, T3buff,
np.uint32(0), wait_for=(event0))
This way OpenCL will ensure proper order of commands in the queue.
Best regards
--
Tomasz Rybak GPG/PGP key ID: 2AD5 9860
Fingerprint A481 824E 7DD3 9C0E C40A 488E C654 FB33 2AD5 9860
http://member.acm.org/~tomaszrybak
signature.asc
Description: This is a digitally signed message part
_______________________________________________ PyOpenCL mailing list [email protected] http://lists.tiker.net/listinfo/pyopencl
