On 03/20/2013 09:21 AM, Kalle Raiskila wrote: > This could be done in several ways. One is to have the host code's main > thread adding command_nodes (better name needed, perhaps :)) to the queue, > and a background thread eating up the queue, synchronously. One consumer > thread is probably needed per device/queue.
This can be done at least in two ways: 1) By moving the current implementation of clFinish to the device layer (command queues are device-specific after all). Then the device can execute the queue as it sees best. This has been my favorite plan so far. This in practice implies at least one host worker thread per device driver, like you say, because of the need to block wait for events in the out of order case, or adding an "execute all ready commands" interface which returns when it has to block due to unsatisfied events. Perhaps we can assume for now that a couple of host threads are not expensive and simply block wait in the CQ execution. 2) The task scheduler logic is in the clFinish and it commands the device layer on per-command basis (perhaps using an is_available() API for asking if it's OK to add more tasks to the device). It could be made to work with the current sequential loop in clFinish executed with multiple synchronized threads. However, putting the queue execution, and refactoring the possible task scheduler implementations allows for independent execution of larger parts of the command queue independently in the devices with reduced host-device synchronization. In the 1) we need to synchronize at every command, in 2) the device might execute several commands without needing to bounce back to host after every one (if we know no-one is depending on the events). BTW, a point for the out-of-order queues and in-order queues (IOQ) is that the IO queues could be treated as OOO queues in the task scheduler, but with implicit dependencies (event synchronization?) added. For example, the scenario of overlapping a buffer transfer with a kernel execution should be legal even in IOQ as long as we ensure the concurrently transferred buffer is not used by the concurrently executed kernel. Multi-device/multi-CQ execution is one question. If clFinish launches the "task graph execution" and there are multiple devices (command queues) that are synchronized with each other with events, how to ensure progress? The clFinish() blocks and we need to launch multiple command queues to make the whole multi-device task graph proceed. One way is that the host program is also multithreaded and the clFinish calls are done in separate host threads. Or host calls clFlush() for all command queues and waits for the final event. To make the latter work, it implies that the device drivers should have the independent command queue worker threads which start the CQ processing in background already at clFlush(). -- Pekka ------------------------------------------------------------------------------ Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_d2d_mar _______________________________________________ pocl-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/pocl-devel
