>Eric Blossom wrote: > advantage of it. Again from reading, it appears that you need at > least 64 elements that you can apply an instruction to, to be in it's > target zone. For certain parts of our graphs, this is probably OK > (e.g., FEC decode, FIR's, FFTs), but I'm kind of dubious about > anything with a depedency chain (IIR's, PLLs, equalizers, etc.)
32 threads in a so called "warp" execute together in a Single Instruction Multiple Threads (SIMT) manner, on a particular Streaming Multiprocessor (SM). The control flows among the 32 threads can diverge, but when that happens, each set of divergence paths will be executed serially. Your observations are correct. At least for now, CUDA's strength is still quite restricted to computation intensive data parallel processing, where nVidia's other 99% business lies in (of course, the graphics processing). But after GPGPU processing takes off, things could change. > I'm also not sure if you can launch multiple kernels simultaneously > (CUDA-speak). If you could launch multiple kernels, we'd have a > better chance of using the parallelism. Currently no. But it is possible execute several parallel tasks within the same kernel by diverging the control flow, and at the same time, trying to group each different task (each variance of the control flow) in groups of 32 threads (considering padding?) to avoid in warp divergence. Nvcc compiler will at least take care of register allocation so that multiple tasks won't use more registers than the max a single one requires. > > > Eric -Yu -- Posted via http://www.ruby-forum.com/. _______________________________________________ Discuss-gnuradio mailing list [email protected] http://lists.gnu.org/mailman/listinfo/discuss-gnuradio
