Jerome Kieffer <jerome.kief...@esrf.fr> writes: > As some of you may have noticed, Nvidia dropped the capability to > profile OpenCL code since Cuda8. I am looking into the profiling info > available in PyOpenCL's events if it would be possible to re-gernetate > this file. > > Did anybody look into this ? It would prevent me from re-inventing the wheel. > > I found some "oddities" while trying to profile mulit-queue processing. > I collected ~100 events, evenly distributed in 5 queues. > > Every single event has a different command queue (as obtained from > event.command_queue) but they all point to the same object at the > C-level according to their event.command_queue.int_ptr. > > This would be consistent with the fact that using multiple queues works > exactly at the same speed as using only one :( > > Did anybody manage to (actually) interleave sending buffers, retrieving > buffers and calculation on the GPU with PyOpenCL ?
I have (at one point) verified that this does work. In order for overlapped transfers to actually happen, you need to allocate the host-side end of the transfer with ALLOC_HOST_PTR (or some such--I don't remember precisely)--the same as 'page-locked' memory in CUDA. Another (mostly speculative--but interesting) option might be to go through the (experimental!) CUDA backend for pocl--that goes through the CUDA API, and, as a result, restores the ability to profile. Andreas _______________________________________________ PyOpenCL mailing list PyOpenCL@tiker.net https://lists.tiker.net/listinfo/pyopencl