On 01/17/2012 07:54 PM, Nick Foster wrote:
On Tue, Jan 17, 2012 at 10:36 AM, Josh Blum <[email protected]
<mailto:[email protected]>> wrote:
On 01/16/2012 09:51 AM, ziyang wrote:
> On 01/13/2012 09:30 PM, Josh Blum wrote:
>>> To reduce the computation load of the processor, I tried two
methods:
>>> 1) modify the gr.quadrature_demod_cf block, replace some
multiplication
>>> operations with volk-based operations (gr.multiply and
gr.multiply_const
>>> modules in gr_blocks);
>> I like it. Make sure to contribute patches like that back. :-)
> Actually, what I did was writing a new quadrature_demod block
without
> the multiplication and delay operations, and connect extra
gr.multiply
> and gr.delay blocks instead in the flow graph. Because my
understanding
> is that the volk functions take a vector (multiple values) as
input, and
> I didn't figure out a way to do the single-item-operation in the
volk
> style.
>
I dont recommend using the extra blocks, that would probably cause
more
overhead. Looking at gr_quadrature_demod_cf::work, it looks like
you can
vectorize the operation of the conjugate multiply, then the atan, then
the gain scaler. So, that would be one for loop that operates on 4
samples at a time, and calls 3 volk functions.
Right now, the Volk atan2 function is only implemented for SSE and
only works if libsimdmath is installed. If not, it will fall back to a
generic implementation which is considerably slower than Gnuradio's
LUT atan2. There's no NEON implementation, so right now the fastest
option on E100 is to use Gnuradio's built-in atan2.
I spent some quality time a couple of months ago during SDR Forum
writing a vectorized atan2 algorithm in Volk via Orc. I was unable to
get the entire algorithm to fit within the register constraints the
Orc runtime compiler applies. The end goal is to get the entire
algorithm vectorized so it only needs to write out to memory once,
which is going to be far faster than running three vector operations
across a large buffer which won't fit into cache. I'll get back to it
one of these days but it looks like parts of Orc's compiler will have
to be improved. Terry, if you're interested, Orc code is easily read
and looks like vector pseudocode, so my Orc implementation might be of
use if you're interested in writing a custom NEON implementation for
Volk. It's based on the libsimdmath implementation, which is in turn
based on Cephes, and uses all sorts of Crazy Math Tricks.
--n
Thank you for your help, Nicks. Right now, I really want to have a
faster atan implementation, but I use python and occationally c++ for
most of the time, so I'm not sure if I can handle the custom NEON
implementation because these Orc / NEON / libsmdmath / Cephes are all
completely new to me.
Thanks.
Best Regards,
Terry
>> Also, you may consider timing a particular operation as a
performance
>> metric, rather than counting the number of demodulated packets.
>>
> I was wondering if there are examples from which I can learn how
to do
> this?
Sorry, I guess there isnt much in the way of examples.
You can time individual work functions by adding some code before an
after. We have some high resolution timers in
gruel/include/gruel/high_res_timers.h
I have also seen people time the block in a simple flow graph with a
null source, head, your_block, null_sink. You can time tb.run() and
compare run duration vs the non-vectorized code.
-Josh
_______________________________________________
Discuss-gnuradio mailing list
[email protected] <mailto:[email protected]>
https://lists.gnu.org/mailman/listinfo/discuss-gnuradio
_______________________________________________
Discuss-gnuradio mailing list
[email protected]
https://lists.gnu.org/mailman/listinfo/discuss-gnuradio
_______________________________________________
Discuss-gnuradio mailing list
[email protected]
https://lists.gnu.org/mailman/listinfo/discuss-gnuradio