Stefan, First off I definitely want to encourage investigations of this sort: so even though I have some thoughts similar to Sylvains/Tom's about whether VOLK is the right place to do this, I definitely want to encourage *trying* this, since you never know - we could be entirely wrong about whether or not this will work. The only way to know for sure is to try it.
That said: I do think there are way *within* VOLK to deal with the issue of the input size (i.e. vector size) having a large impact on performance - namely the custom dispatcher. This is a concept that exists in VOLK, but has larger gone unnoticed because by in the large the default dispatcher does a good (or at least, good-enough) job at selecting the proper proto-kernel. For off-loading concepts such as utilizing GPU's via OpenCL, a custom dispatcher *could* select the appropriate proto-kernel (including directing the OpenCL implemention to select a CPU vs. GPU-based implementation, if multiple OpenCL implementations are available) on a per-'work()' call from the GNURadio scheduler. In other words, instead of relying on volk_profile to select the best proto-kernel for all calls to that particular volk kernel, the dispatcher could have something more akin to the FFTW 'wisdom' where for different sizes of matrices/vectors, different proto-kernels are called (including the CPU SIMDized call, instead of the OpenCL call for smaller input sizes, etc.). Anyways - I definitely think this is something that should be looked into more, and if you are interested in pursuing this as - either as a GSoC project or otherwise, I would definitely encourage it, as well as offer assistance/advice where I can. Doug On Thu, Dec 17, 2015 at 7:58 PM, Stefan Wunsch < [email protected]> wrote: > > > On 12/18/2015 12:30 AM, Tom Rondeau wrote: > > On Thu, Dec 17, 2015 at 1:14 PM, Sylvain Munaut <[email protected]> > wrote: > > > >> Hi, > >> > >>> RUN_VOLK_TESTS: volk_32f_x2_matrix_nxn_multiply_puppet_32f(1000000,10) > >>> generic completed in 28482ms > >>> a_opencl completed in 13364.3ms > >> > >> Question is how does that number change for smaller problem sizes ? > >> And what would be the average problem size encountered in real env. > >> > >> For SIMD optimization the result of "who's the fastest" doesn't vary > >> too much depending on problem size because they don't have much setup > >> / teardown size. > >> For OpenCL I very much doubt that would be the case and if you end up > >> with an app making a lot of "smallish" (and given the default buffer > >> size of GR, I feel the calls to volk aren't processing millions of > >> samples at a time in a single call) > >> > >> > >> Cheers, > >> > >> Sylvain > >> > > > > > > Stefan, > > > > This is a great start. But Sylvain makes good points about the data > > transfer issue. That's definitely a problem we have to think about. It's > > why we have avoided pursuing GPU support in VOLK in the past. Now, if > > heterogeneous processor technologies change, so might this problem. > > > > On the other hand, Doug Geiger has made progress on building OpenCL > support > > into the buffer structure of the scheduler. What you've done here might > > work better as a block designed around this concept. > > > > Tom > > > > Hi, > > I just wondered why it has not been done yet, but I see the problems now > (Sylvain made the point). > If a proper device selection and initialization is integrated into VOLK, > probably the same processings could be used for the scheduler (e.g., > with a generic fallback). But as well, I think that I don't know enough > about all of this ;) > > Greetings > Stefan > > _______________________________________________ > Discuss-gnuradio mailing list > [email protected] > https://lists.gnu.org/mailman/listinfo/discuss-gnuradio > -- Doug Geiger [email protected]
_______________________________________________ Discuss-gnuradio mailing list [email protected] https://lists.gnu.org/mailman/listinfo/discuss-gnuradio
