On Sun, 2012-02-12 at 15:15 -0500, Andreas Kloeckner wrote:
> I know that to get true overlapping on Nv, those buffers have to be
> what's called "page-locked" on the Nvidia side. This requires
> CL_MEM_ALLOC_HOST_PTR (which has a different meaning, as you may
> know). Also, it seems you're using CUDA 3.2? The Nv CL drivers have
> matured significantly since 3.2, I'd advise you to use something newer.

Unfortunately, I cannot upgrade my installation since I'm using the
laboratory computers and I've no administrator privileges. Regarding the
page-locked memory, you were right. Also, I didn't have to use two
buffers for each data half: a single buffer using the proper slicing
suffices. Now everything works beautifully :D For anyone who may find
the code useful, here it is:


# Init host memory buffer and device memory buffer used for enable 
# pinned-memory
pinInBuffer = cl.Buffer(self.context, clmem.READ_ONLY|
                        clmem.ALLOC_HOST_PTR, dataSize)
pinOutBuffer = cl.Buffer(self.context, clmem.WRITE_ONLY|
                         clmem.ALLOC_HOST_PTR, dataSize)
devInBuffer = cl.Buffer(self.context, clmem.READ_ONLY, dataSize)
devOutBuffer = cl.Buffer(self.context, clmem.WRITE_ONLY, dataSize)

# Get numpy arrays used for filling and retrieving data from 
# pinned-memory
(dataIn,ev) = cl.enqueue_map_buffer(self.cmdQueues[0], pinInBuffer, 
                                    clmap.WRITE,
                                    0, (dataSize,), np.uint8, 'C')
(dataOut,ev) = cl.enqueue_map_buffer(self.cmdQueues[0], pinOutBuffer, 
                                     clmap.READ,
                                     0, (dataSize,), np.uint8, 'C')
            
# Fill the array obtained from memory maps
dataIn[:] = np.frombuffer(data, dtype=np.uint8)


# Non-blocking copy of the first half
# TODO: could it be blocking? actually we can't start before the first 
# chunck is copied
cl.enqueue_copy(self.cmdQueues[0], devInBuffer, dataIn[:halfSize],
                is_blocking=False)
self.cmdQueues[0].flush()


# Launch kernel on the first half
program.aes_ecb(self.cmdQueues[0], (halfSize>>4,), (256,), keyBuffer,
                devInBuffer, devOutBuffer,
                T0buff, T1buff, T2buff, T3buff,
                np.uint32(0))
                        
# Start copying the second half
cl.enqueue_copy(self.cmdQueues[1], devInBuffer, 
                dataIn[halfSize-roundoffSize:],
                device_offset=halfSize-roundoffSize, is_blocking=False)
            
self.cmdQueues[0].flush()
self.cmdQueues[1].flush()
            
# Launch kernel on the second half
program.aes_ecb(self.cmdQueues[1], (halfSize>>4,), (256,), keyBuffer,
                devInBuffer, devOutBuffer,
                T0buff, T1buff, T2buff, T3buff,
                np.uint32(halfSize>>4))

# Non-blocking read of the first half
cl.enqueue_copy(self.cmdQueues[0], dataOut[:halfSize], devOutBuffer,
                is_blocking=False)
            
self.cmdQueues[0].flush()
self.cmdQueues[1].flush()

# Finally, read the second half
cl.enqueue_copy(self.cmdQueues[1], dataOut[halfSize-roundoffSize:],    
                devOutBuffer,
                device_offset=halfSize-roundoffSize)
            
result = dataOut

# Done


Daniele


_______________________________________________
PyOpenCL mailing list
[email protected]
http://lists.tiker.net/listinfo/pyopencl

Reply via email to