On Tue, Jan 17, 2012 at 10:36 AM, Josh Blum <[email protected]> wrote:

>
>
> On 01/16/2012 09:51 AM, ziyang wrote:
> > On 01/13/2012 09:30 PM, Josh Blum wrote:
> >>> To reduce the computation load of the processor, I tried two methods:
> >>> 1) modify the gr.quadrature_demod_cf block, replace some multiplication
> >>> operations with volk-based operations (gr.multiply and
> gr.multiply_const
> >>> modules in gr_blocks);
> >> I like it. Make sure to contribute patches like that back. :-)
> > Actually, what I did was writing a new quadrature_demod block without
> > the multiplication and delay operations, and connect extra gr.multiply
> > and gr.delay blocks instead in the flow graph. Because my understanding
> > is that the volk functions take a vector (multiple values) as input, and
> > I didn't figure out a way to do the single-item-operation in the volk
> > style.
> >
>
> I dont recommend using the extra blocks, that would probably cause more
> overhead. Looking at gr_quadrature_demod_cf::work, it looks like you can
> vectorize the operation of the conjugate multiply, then the atan, then
> the gain scaler. So, that would be one for loop that operates on 4
> samples at a time, and calls 3 volk functions.
>

Right now, the Volk atan2 function is only implemented for SSE and only
works if libsimdmath is installed. If not, it will fall back to a generic
implementation which is considerably slower than Gnuradio's LUT atan2.
There's no NEON implementation, so right now the fastest option on E100 is
to use Gnuradio's built-in atan2.

I spent some quality time a couple of months ago during SDR Forum writing a
vectorized atan2 algorithm in Volk via Orc. I was unable to get the entire
algorithm to fit within the register constraints the Orc runtime compiler
applies. The end goal is to get the entire algorithm vectorized so it only
needs to write out to memory once, which is going to be far faster than
running three vector operations across a large buffer which won't fit into
cache. I'll get back to it one of these days but it looks like parts of
Orc's compiler will have to be improved. Terry, if you're interested, Orc
code is easily read and looks like vector pseudocode, so my Orc
implementation might be of use if you're interested in writing a custom
NEON implementation for Volk. It's based on the libsimdmath implementation,
which is in turn based on Cephes, and uses all sorts of Crazy Math Tricks.

--n


> >> Also, you may consider timing a particular operation as a performance
> >> metric, rather than counting the number of demodulated packets.
> >>
> > I was wondering if there are examples from which I can learn how to do
> > this?
>
> Sorry, I guess there isnt much in the way of examples.
>
> You can time individual work functions by adding some code before an
> after. We have some high resolution timers in
> gruel/include/gruel/high_res_timers.h
>
> I have also seen people time the block in a simple flow graph with a
> null source, head, your_block, null_sink. You can time tb.run() and
> compare run duration vs the non-vectorized code.
>
> -Josh
>
> _______________________________________________
> Discuss-gnuradio mailing list
> [email protected]
> https://lists.gnu.org/mailman/listinfo/discuss-gnuradio
>
_______________________________________________
Discuss-gnuradio mailing list
[email protected]
https://lists.gnu.org/mailman/listinfo/discuss-gnuradio

Reply via email to