On Sun, 2012-02-12 at 15:15 -0500, Andreas Kloeckner wrote:
> I know that to get true overlapping on Nv, those buffers have to be
> what's called "page-locked" on the Nvidia side. This requires
> CL_MEM_ALLOC_HOST_PTR (which has a different meaning, as you may
> know). Also, it seems you're using CUDA 3.2? The Nv CL drivers have
> matured significantly since 3.2, I'd advise you to use something newer.
Unfortunately, I cannot upgrade my installation since I'm using the
laboratory computers and I've no administrator privileges. Regarding the
page-locked memory, you were right. Also, I didn't have to use two
buffers for each data half: a single buffer using the proper slicing
suffices. Now everything works beautifully :D For anyone who may find
the code useful, here it is:
# Init host memory buffer and device memory buffer used for enable
# pinned-memory
pinInBuffer = cl.Buffer(self.context, clmem.READ_ONLY|
clmem.ALLOC_HOST_PTR, dataSize)
pinOutBuffer = cl.Buffer(self.context, clmem.WRITE_ONLY|
clmem.ALLOC_HOST_PTR, dataSize)
devInBuffer = cl.Buffer(self.context, clmem.READ_ONLY, dataSize)
devOutBuffer = cl.Buffer(self.context, clmem.WRITE_ONLY, dataSize)
# Get numpy arrays used for filling and retrieving data from
# pinned-memory
(dataIn,ev) = cl.enqueue_map_buffer(self.cmdQueues[0], pinInBuffer,
clmap.WRITE,
0, (dataSize,), np.uint8, 'C')
(dataOut,ev) = cl.enqueue_map_buffer(self.cmdQueues[0], pinOutBuffer,
clmap.READ,
0, (dataSize,), np.uint8, 'C')
# Fill the array obtained from memory maps
dataIn[:] = np.frombuffer(data, dtype=np.uint8)
# Non-blocking copy of the first half
# TODO: could it be blocking? actually we can't start before the first
# chunck is copied
cl.enqueue_copy(self.cmdQueues[0], devInBuffer, dataIn[:halfSize],
is_blocking=False)
self.cmdQueues[0].flush()
# Launch kernel on the first half
program.aes_ecb(self.cmdQueues[0], (halfSize>>4,), (256,), keyBuffer,
devInBuffer, devOutBuffer,
T0buff, T1buff, T2buff, T3buff,
np.uint32(0))
# Start copying the second half
cl.enqueue_copy(self.cmdQueues[1], devInBuffer,
dataIn[halfSize-roundoffSize:],
device_offset=halfSize-roundoffSize, is_blocking=False)
self.cmdQueues[0].flush()
self.cmdQueues[1].flush()
# Launch kernel on the second half
program.aes_ecb(self.cmdQueues[1], (halfSize>>4,), (256,), keyBuffer,
devInBuffer, devOutBuffer,
T0buff, T1buff, T2buff, T3buff,
np.uint32(halfSize>>4))
# Non-blocking read of the first half
cl.enqueue_copy(self.cmdQueues[0], dataOut[:halfSize], devOutBuffer,
is_blocking=False)
self.cmdQueues[0].flush()
self.cmdQueues[1].flush()
# Finally, read the second half
cl.enqueue_copy(self.cmdQueues[1], dataOut[halfSize-roundoffSize:],
devOutBuffer,
device_offset=halfSize-roundoffSize)
result = dataOut
# Done
Daniele
_______________________________________________
PyOpenCL mailing list
[email protected]
http://lists.tiker.net/listinfo/pyopencl