Dear Christian,

First of all, please make sure to keep the list cc'd in your
replies. Without the benefits of searchable archival, justifying the
time to answer questions is much harder.

Christian Hacker <[email protected]> writes:
> Thank you for your reply. If I'm understanding you correctly, it is
> acceptable to have numpy arrays of objects allocated on the host, and then
> assigning GPUArray instances as elements of those arrays. I didn't take
> into account the overhead from launching the kernel - that may explain why
> things work so slowly. I will attempt to test the simulator with larger
> network topologies once I have pycuda set up on a machine with a
> sufficiently powerful GPU.
>
> If you will indulge my ignorance a little more, there is another problem I
> would request advice for. I have run into a possible bottleneck in the
> learning algorithm, specifically where the simulator must compare the
> calculated error of the current learning cycle to a user-defined threshold
> value to determine if further learning is required. Currently I am storing
> this threshold value in a (1, 1) GPUArray and using the overloaded
> comparison operators to check it against the calculated network error, also
> stored on the GPU. The issue is that the code driving the simulator is all
> host-side: a conditional statement checks the verity of the comparison and
> decides whether to continue working. Because  a comparison of values on two
> GPUArrays will return a GPUArray with a binary integer value, whereas
> Python conditionals require a binary integer value, I have no choice but to
> transfer a single binary integer value from the device to the host - every
> single learning cycle. Due to the variety of operations the simulator must
> conduct each learning cycle, it would be unwieldy and, perhaps, impossible
> to use an if_positive(...) function to sidestep this issue. So, following
> all of that prologue, here is another question:
>
> Is it possible to write a custom kernel (or even a Python function) that
> can return integer values to the Python interpreter after evaluating GPU
> array data, without requiring the transfer of any of that data from the
> device to the host?

Yes, but only in a limited way. With enough mapping/unmapping logic,
device kernels can indeed write to host memory. However I would
anticipate that the latency incurred in this process is similar (if
not worse) than the one involved in reading from the device.

Quite simply, if data resides on the device, the only way to get it off
of there is a read. Perhaps the only way (and quite an easy one if I
understand your situation right) would be to continue the computation
(overlapped with the transfer) and defer the convergence check until the
transfer finishes. Here's an example of code that does this:

https://github.com/inducer/pycuda/blob/master/pycuda/sparse/cg.py

Andreas

Attachment: signature.asc
Description: PGP signature

_______________________________________________
PyCUDA mailing list
[email protected]
http://lists.tiker.net/listinfo/pycuda

Reply via email to