On Fri, 20 Apr 2018 11:17:15 -0500 Andreas Kloeckner <li...@informa.tiker.net> wrote:
> I have (at one point) verified that this does work. In order for > overlapped transfers to actually happen, you need to allocate the > host-side end of the transfer with ALLOC_HOST_PTR (or some such--I don't > remember precisely)--the same as 'page-locked' memory in CUDA. Yes, this is what Vincent noticed. We are still working on it. My question was also about all processing/io appearing in the same queue while submitted in different ones. If it actually occures like this it is a bug according to me (unless the profiler enforces only one queue ??) > Another (mostly speculative--but interesting) option might be to go > through the (experimental!) CUDA backend for pocl--that goes through > the CUDA API, and, as a result, restores the ability to profile. Thanks for the hint, I re-compiled pocl with cuda support (it backports smoothly on debian9) and it works. The nvprof is now usable for profiling kernel using POCL while it sees none when using the Nvidia opencl driver. This is an information which is worth sharing with the other OpenCL developers. Cheers, -- Jérôme Kieffer tel +33 476 882 445 _______________________________________________ PyOpenCL mailing list PyOpenCL@tiker.net https://lists.tiker.net/listinfo/pyopencl