Re: [pocl-devel] NVIDIA device backend for POCL

Pekka Jääskeläinen Wed, 20 Mar 2013 02:19:29 -0700

On 03/20/2013 09:21 AM, Kalle Raiskila wrote:
> This could be done in several ways. One is to have the host code's main
> thread adding command_nodes (better name needed, perhaps :)) to the queue,
> and a background thread eating up the queue, synchronously. One consumer
> thread is probably needed per device/queue.


This can be done at least in two ways:

1) By moving the current implementation of clFinish to the device layer
(command queues are device-specific after all). Then the device can execute
the queue as it sees best. This has been my favorite plan so far.

This in practice implies at least one host worker thread per device driver,
like you say, because of the need to block wait for events in the out of
order case, or adding an "execute all ready commands" interface which returns
when it has to block due to unsatisfied events. Perhaps we can assume for now
that a couple of host threads are not expensive and simply block wait in
the CQ execution.

2) The task scheduler logic is in the clFinish and it commands the device
layer on per-command basis (perhaps using an is_available() API for asking if
it's OK to add more tasks to the device). It could be made to work with
the current sequential loop in clFinish executed with multiple synchronized
threads.

However, putting the queue execution, and refactoring the possible task
scheduler implementations allows for independent execution of larger
parts of the command queue independently in the devices with reduced
host-device synchronization. In the 1) we need to synchronize at every
command, in 2) the device might execute several commands without
needing to bounce back to host after every one (if we know no-one is
depending on the events).

BTW, a point for the out-of-order queues and in-order queues (IOQ) is that
the IO queues could be treated as OOO queues in the task scheduler, but
with implicit dependencies (event synchronization?) added. For example,
the scenario of overlapping a buffer transfer with a kernel execution
should be legal even in IOQ as long as we ensure the concurrently
transferred buffer is not used by the concurrently executed kernel.

Multi-device/multi-CQ execution is one question. If clFinish launches the
"task graph execution" and there are multiple devices (command queues) that
are synchronized with each other with events, how to ensure progress? The
clFinish() blocks and we need to launch multiple command queues to
make the whole multi-device task graph proceed. One way is that the host
program is also multithreaded and the clFinish calls are done in
separate host threads. Or host calls clFlush() for all command queues
and waits for the final event. To make the latter work, it implies that
the device drivers should have the independent command queue worker
threads which start the CQ processing in background already at clFlush().

-- 
Pekka

------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_mar
_______________________________________________
pocl-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/pocl-devel

Re: [pocl-devel] NVIDIA device backend for POCL

Reply via email to