Hi everybody,

I have definitely used parallel send,retreive and GPU calculations with PyOpenCL on Nvidia devices using multiple queues, although I have done the profiling via wrapping the pyopenCL events on the python host an enriching timing information with different queues. It does work with image objects as far as I know, if you are interested in a MWE, please let me know.

Regards

Jonathan


On 04/24/2018 09:54 PM, pyopencl-requ...@tiker.net wrote:
Send PyOpenCL mailing list submissions to
        pyopencl@tiker.net

To subscribe or unsubscribe via the World Wide Web, visit
        https://lists.tiker.net/listinfo/pyopencl
or, via email, send a message with subject or body 'help' to
        pyopencl-requ...@tiker.net

You can reach the person managing the list at
        pyopencl-ow...@tiker.net

When replying, please edit your Subject line so it is more specific
than "Re: Contents of PyOpenCL digest..."


Today's Topics:

    1. Profiling events in PyOpenCL (Jerome Kieffer)
    2. Re: Profiling events in PyOpenCL (Andreas Kloeckner)
    3. Re: Profiling events in PyOpenCL (Vincent Favre-Nicolin)
    4. Re: Profiling events in PyOpenCL (Jerome Kieffer)
    5. Re: Profiling events in PyOpenCL (Andreas Kloeckner)
    6. Re: Profiling events in PyOpenCL (Jerome Kieffer)


----------------------------------------------------------------------

Message: 1
Date: Fri, 20 Apr 2018 17:26:15 +0200
From: Jerome Kieffer <jerome.kief...@esrf.fr>
To: pyopencl@tiker.net
Subject: [PyOpenCL] Profiling events in PyOpenCL
Message-ID: <20180420172615.6b107...@lintaillefer.esrf.fr>
Content-Type: text/plain; charset=UTF-8

Dear all,

As some of you may have noticed, Nvidia dropped the capability to
profile OpenCL code since Cuda8. I am looking into the profiling info
available in PyOpenCL's events if it would be possible to re-gernetate
this file.

Did anybody look into this ? It would prevent me from re-inventing the wheel.

I found some "oddities" while trying to profile mulit-queue processing.
I collected ~100 events, evenly distributed in 5 queues.

Every single event has a different command queue (as obtained from
event.command_queue) but they all point to the same object at the
C-level according to their event.command_queue.int_ptr.

This would be consistent with the fact that using multiple queues works
exactly at the same speed as using only one :(

Did anybody manage to (actually) interleave sending buffers, retrieving
buffers and calculation on the GPU with PyOpenCL ?

Thanks for you help



_______________________________________________
PyOpenCL mailing list
PyOpenCL@tiker.net
https://lists.tiker.net/listinfo/pyopencl

Reply via email to