Stefan,
 First off I definitely want to encourage investigations of this sort: so
even though I have some thoughts similar to Sylvains/Tom's about whether
VOLK is the right place to do this, I definitely want to encourage *trying*
this, since you never know - we could be entirely wrong about whether or
not this will work. The only way to know for sure is to try it.

 That said: I do think there are way *within* VOLK to deal with the issue
of the input size (i.e. vector size) having a large impact on performance -
namely the custom dispatcher. This is a concept that exists in VOLK, but
has larger gone unnoticed because by in the large the default dispatcher
does a good (or at least, good-enough) job at selecting the proper
proto-kernel. For off-loading concepts such as utilizing GPU's via OpenCL,
a custom dispatcher *could* select the appropriate proto-kernel (including
directing the OpenCL implemention to select a CPU vs. GPU-based
implementation, if multiple OpenCL implementations are available) on a
per-'work()' call from the GNURadio scheduler. In other words, instead of
relying on volk_profile to select the best proto-kernel for all calls to
that particular volk kernel, the dispatcher could have something more akin
to the FFTW 'wisdom' where for different sizes of matrices/vectors,
different proto-kernels are called (including the CPU SIMDized call,
instead of the OpenCL call for smaller input sizes, etc.).

 Anyways - I definitely think this is something that should be looked into
more, and if you are interested in pursuing this as - either as a GSoC
project or otherwise, I would definitely encourage it, as well as offer
assistance/advice where I can.

 Doug


On Thu, Dec 17, 2015 at 7:58 PM, Stefan Wunsch <
[email protected]> wrote:

>
>
> On 12/18/2015 12:30 AM, Tom Rondeau wrote:
> > On Thu, Dec 17, 2015 at 1:14 PM, Sylvain Munaut <[email protected]>
> wrote:
> >
> >> Hi,
> >>
> >>> RUN_VOLK_TESTS: volk_32f_x2_matrix_nxn_multiply_puppet_32f(1000000,10)
> >>> generic completed in 28482ms
> >>> a_opencl completed in 13364.3ms
> >>
> >> Question is how does that number change for smaller problem sizes ?
> >> And what would be the average problem size encountered in real env.
> >>
> >> For SIMD optimization the result of "who's the fastest" doesn't vary
> >> too much depending on problem size because they don't have much setup
> >> / teardown size.
> >> For OpenCL I very much doubt that would be the case and if you end up
> >> with an app making a lot of "smallish" (and given the default buffer
> >> size of GR, I feel the calls to volk aren't processing millions of
> >> samples at a time in a single call)
> >>
> >>
> >> Cheers,
> >>
> >>     Sylvain
> >>
> >
> >
> > Stefan,
> >
> > This is a great start. But Sylvain makes good points about the data
> > transfer issue. That's definitely a problem we have to think about. It's
> > why we have avoided pursuing GPU support in VOLK in the past. Now, if
> > heterogeneous processor technologies change, so might this problem.
> >
> > On the other hand, Doug Geiger has made progress on building OpenCL
> support
> > into the buffer structure of the scheduler. What you've done here might
> > work better as a block designed around this concept.
> >
> > Tom
> >
>
> Hi,
>
> I just wondered why it has not been done yet, but I see the problems now
> (Sylvain made the point).
> If a proper device selection and initialization is integrated into VOLK,
> probably the same processings could be used for the scheduler (e.g.,
> with a generic fallback). But as well, I think that I don't know enough
> about all of this ;)
>
> Greetings
> Stefan
>
> _______________________________________________
> Discuss-gnuradio mailing list
> [email protected]
> https://lists.gnu.org/mailman/listinfo/discuss-gnuradio
>



-- 
Doug Geiger
[email protected]
_______________________________________________
Discuss-gnuradio mailing list
[email protected]
https://lists.gnu.org/mailman/listinfo/discuss-gnuradio

Reply via email to