Attached a rather contrived example of computing ffts with pyfft. The names say it all: serial.py, streams.py, streams-time.py.
I have tried to make an example utilizing streams. However, I'm not convinced that it actually works as expected. If you can convince me that it works or improve the code I promise to clean it up and put it on the wiki. 1: Checking the GPU time width plot in the Compute visual profiler do not show any overlap between stream 1 and 2. Reading over at the CUDA forums it seems that perhaps this is caused by the profiler and that running the code outside the profiler would not give the same behaviour. Is this true? (And how do you profile your code if the profiler is broken?) 2: The streamed version runs faster than the serial version. However, I have a nagging suspicion that this speedup is only from faster mem-copies and not from any overlap between streams. E.g., putting in a line to print the time after each line shows that the "python time" of the first mem-copy is ~0.3 ms while the "python time" of the first fft call is ~ 6 ms while the second fft call is ~0.1 ms. 6 ms happens to be the time of the mem-copy according to the visual profiler !?! Can anyone confirm this ... is the first fft call blocking until the data has been copied to the device? (also the get_async seems to be blocking according to the "python time") Any help appreciated. ----------------------------------------------- Magnus Paulsson Assistant Professor School of Computer Science, Physics and Mathematics Linnaeus University Phone: +46-480-446308 Mobile: +46-70-6942987
serial.py
Description: Binary data
streams.py
Description: Binary data
streams-time.py
Description: Binary data
_______________________________________________ PyCUDA mailing list PyCUDA@tiker.net http://lists.tiker.net/listinfo/pycuda