Hi Andreas,
I didn't know what settings you had in your temp_cl_profiler.conf file,
so made a run with default settings.
$ python matrix-multiply.py
$ cat opencl_profile.log
# OPENCL_PROFILE_LOG_VERSION 1.0
# OPENCL_DEVICE 0 Tesla C1060
# TIMESTAMPFACTOR fcec360a5d451e4
method,gputime,cputime,occupancy
method=[ memcpyHtoDasync ] gputime=[ 3487.264 ] cputime=[ 4092.000 ]
method=[ memcpyHtoDasync ] gputime=[ 2722.880 ] cputime=[ 3283.000 ]
method=[ matrixMul ] gputime=[ 91482.430 ] cputime=[ 13.000 ]
occupancy=[ 0.750 ]
method=[ matrixMul ] gputime=[ 91467.359 ] cputime=[ 12.000 ]
occupancy=[ 0.750 ]
method=[ matrixMul ] gputime=[ 91516.477 ] cputime=[ 13.000 ]
occupancy=[ 0.750 ]
method=[ memcpyDtoHasync ] gputime=[ 1119.680 ] cputime=[ 1893.000 ]
method=[ memcpyDtoHasync ] gputime=[ 2664.768 ] cputime=[ 3464.000 ]
method=[ memcpyDtoHasync ] gputime=[ 1117.952 ] cputime=[ 1823.000 ]
method=[ memcpyDtoHasync ] gputime=[ 2690.976 ] cputime=[ 3514.000 ]
The 3 last lines are the extra memory transfert. gputime and cputime are
not accurate because of other jobs running on the host computer at the
moment. For instance gputime for the first, 2nd, 7th and last row should
be equal (512 Mb transfert).
My nvidia driver is 190.29 and the device is a tesla c1060.
Feel free to ask additionnal information.
Regards,
Nicolas
Andreas Klöckner a écrit :
Hi Nicolas,
On Dienstag 02 Februar 2010, Bonnel wrote:
I was just playing with the profiler from nvidia and I'm wondering why
all data from the graphic card are read back. I though memory was read
back only when using cl.enqueue_read_buffer. Here is the result I get
from the profiling of matrix-multiply.py :
method memory transfert size
memcpyHtoDasync 5.12e+06
memcpyHtoDasync 5.12e+06
memcpyDtoHasync 2.56e+06
memcpyDtoHasync 5.12e+06
memcpyDtoHasync 2.56e+06
memcpyDtoHasync 5.12e+06
As there is only one cl.enqueue_read_buffer call, there should be only
one memcpyDtoHasync call.
I recently had an informative conversation with someone on the Nvidia
driver team, and they indicated that CL may 'transparently' issue
transfers after kernel launches based on the flags with which the buffer
was created.
Now I'm faced with two problems. First, all the Nvidia profiler does for
me is crash. I've figured out that I can invoke it from the command line
by specifying
export OPENCL_PROFILE=1
export OPENCL_PROFILE_CONFIG='temp_cl_profiler.conf'
and then find data in "opencl_profile_0.log". However no matter what I
put in temp_cl_profiler.conf, I can't see the extra transfers you are
seeing. Can you grab and post the generated config file, perhaps by
import os; print open(os.environ["OPENCL_PROFILE_CONFIG"], "r").read()
That would be very helpful. (If you could generate a survey of what the
file can look like, that would of course help even more!)
As far as flags were concerned, COPY_HOST_PTR was a natural suspect, but
removing that didn't change the timings. It would really help if I could
observe the extra transfers.
Thanks for posting your observations!
Andreas
_______________________________________________
PyOpenCL mailing list
[email protected]
http://host304.hostmonster.com/mailman/listinfo/pyopencl_tiker.net