On Fri, 20 Apr 2018 11:17:15 -0500
Andreas Kloeckner <li...@informa.tiker.net> wrote:

> I have (at one point) verified that this does work. In order for
> overlapped transfers to actually happen, you need to allocate the
> host-side end of the transfer with ALLOC_HOST_PTR (or some such--I don't
> remember precisely)--the same as 'page-locked' memory in CUDA.

Yes, this is what Vincent noticed. We are still working on it.
My question was also about all processing/io appearing in the same
queue while submitted in different ones. If it actually occures like
this it is a bug according to me (unless the profiler enforces only one
queue ??)

> Another (mostly speculative--but interesting) option might be to go
> through the (experimental!) CUDA backend for pocl--that goes through
> the CUDA API, and, as a result, restores the ability to profile.

Thanks for the hint, I re-compiled pocl with cuda support (it backports
smoothly on debian9) and it works.

The nvprof is now usable for profiling kernel using POCL while it sees
none when using the Nvidia opencl driver.

This is an information which is worth sharing with the other OpenCL developers.

Jérôme Kieffer
tel +33 476 882 445

