Jerome Kieffer <jerome.kief...@esrf.fr> writes:
> As some of you may have noticed, Nvidia dropped the capability to
> profile OpenCL code since Cuda8. I am looking into the profiling info
> available in PyOpenCL's events if it would be possible to re-gernetate
> this file.
>
> Did anybody look into this ? It would prevent me from re-inventing the wheel.
>
> I found some "oddities" while trying to profile mulit-queue processing.
> I collected ~100 events, evenly distributed in 5 queues. 
>
> Every single event has a different command queue (as obtained from
> event.command_queue) but they all point to the same object at the
> C-level according to their event.command_queue.int_ptr.
>
> This would be consistent with the fact that using multiple queues works
> exactly at the same speed as using only one :(
>
> Did anybody manage to (actually) interleave sending buffers, retrieving
> buffers and calculation on the GPU with PyOpenCL ?

I have (at one point) verified that this does work. In order for
overlapped transfers to actually happen, you need to allocate the
host-side end of the transfer with ALLOC_HOST_PTR (or some such--I don't
remember precisely)--the same as 'page-locked' memory in CUDA.

Another (mostly speculative--but interesting) option might be to go
through the (experimental!) CUDA backend for pocl--that goes through the
CUDA API, and, as a result, restores the ability to profile.

Andreas

_______________________________________________
PyOpenCL mailing list
PyOpenCL@tiker.net
https://lists.tiker.net/listinfo/pyopencl

Reply via email to