Hello Philip,

I know NEON has better performances, but it's far more complicated to use.
In a dumb way, basic radix2 butterfly in floating point could look like this:

static void Rank(cpx_float_t *_A, cpx_float_t *_B, floatmix_t *_W,
                 int _nblks, int _bsize) {
  __asm__ ("mov        r0,  %0           \n\t" //  _A
           "mov        r1,  %1           \n\t" //  _B
           "mov        r3,  %3           \n\t" //  blkIdx = _nblks
           "1:                           \n\t" //  for blkIdx
           "mov        r2,  %2           \n\t" //  wtmp = _W
           "mov        r4,  %4           \n\t" //  szIdx = _bsize
           "2:                           \n\t" //  for szIdx
           "vld1.32   {d1}, [r1]         \n\t" //  load _B
           "vld1.32   {d2}, [r2]         \n\t" //  load wtmp
           "add       r2, r2, %3, lsl #3 \n\t" //  wtmp += _nblks*8
           "vmul.f32  d0, d2, d1         \n\t" //  complex mul
           "vrev64.32 d3, d2             \n\t"
           "vmul.f32  d3, d3, d1         \n\t"
           "vneg.f32  s1, s1             \n\t"
           "vtrn.32   d0, d3             \n\t"
           "vadd.f32  d3, d0, d3         \n\t"
           "vld1.32   {d0}, [r0]         \n\t" //  load _A
           "vsub.f32  d1, d0, d3         \n\t" //  _B = _A - _B W
           "vadd.f32  d0, d0, d3         \n\t" //  _A = _A + _B W
           "vst1.64   {d0}, [r0]!        \n\t" //  store _A and update
           "vst1.64   {d1}, [r1]!        \n\t" //  store _B and update
           "subs      r4, r4, #1         \n\t" //  szIdx--
           "bgt 2b                       \n\t" //  branch if > 0
           "add       r0, r0, %4, lsl #3 \n\t" //  _A += _bsize*8
           "add       r1, r1, %4, lsl #3 \n\t" //  _B += _bsize*8
           "subs      r3, r3, #1         \n\t" //  blkIdx--
           "bgt 1b                       \n\t" //  branch if > 0
           :: "r"(_A), "r"(_B), "r"(_W), "r" (_nblks), "r" (_bsize)
           : "d0", "d1", "d2", "d3", "r0", "r1", "r2", "r3", "r4", "memory");

and this it's twice slower than the fixed point.

In your knowledge, is it's possible to link the assembly optimised
"omxSP.a" library compiled by RVCT with a "hello FFT world" using gcc?


> Michele Bavaro wrote:
>> Hello Philip,
>> thank you for coming back on this subject.
>> I modified the library developed by Gregory Heckler, the source code is
>> here:
>> http://github.com/gps-sdr/gps-sdr/tree/6153c01317f34a26b2fb41926505b9d97f764e90/objects
>> To give you an example, the DIT butterfly looks like this:
>> #define BUTTERFLY_FWD(_A, _B, _W)                                    \
>>   __asm__ ("LDR    r0, [%0]            \n\t"                         \
>>         "LDR    r2, [%1]            \n\t"                            \
>>         "MOV    r3, #0              \n\t"                            \
>>         "SHADD16  r0, r0, r3        \n\t"                            \
>>         "SHADD16  r2, r2, r3        \n\t"                            \
>>         "LDR    r3, [%2]            \n\t"                            \
>>         "SMUADX r5, r2, r3          \n\t"                            \
>>         "SMUSD  r4, r2, r3          \n\t"                            \
>>         "ADD    r5, r5, #8192       \n\t"                            \
>>         "ADD    r4, r4, #8192       \n\t"                            \
>>         "ASR    r4, r4, #14         \n\t"                            \
>>         "PKHBT  r3, r4, r5, LSL #2  \n\t"                            \
>>         "QSUB16 r2, r0, r3          \n\t"                            \
>>         "QADD16 r0, r0, r3          \n\t"                            \
>>         "STR    r0, [%0]            \n\t"                            \
>>         "STR    r2, [%1]            \n\t"                            \
>>         ::"r" (_A), "r" (_B), "r" (_W)                               \
>>         :"r0", "r2", "r3", "r4", "r5", "memory")
>> and just uses ARM assembly (NEON is complicated to use with this basic
>> radix2 implementation).
> You'll need to use NEON to get performance. I'll try and look over the
> algorithm and see if I can make some suggestions.
>> As user space, I am using the Angstrom image v0.92:
>> http://www.gumstix.net/overo-gm-images/v0.92/
>> on my Overo Water. I use the CodeSourcery 2009q1 free toolchain, even
>> though today I've been suggested to try something else by Koen.
> Use the toolchains created by OE. A quick way to do that is set your
> path into the oe/tmp/cross/... directory.
> Philip
>> Regards,
>> Michele
>>> Michele Bavaro wrote:
>>>> Hello everyone,
>>>> I'm porting my software GPS receiver on the OMAP, therefore I need
>>>> fast
>>>> signal processing libraries, and in particular FFTs.
>>>> I have somehow adapted an open source library to do radix2 butterfly
>>>> using
>>>> ARM assembly. It works, but my 256 points fixed point 16 bit FFT still
>>>> takes about 60us. That's 12 times slower than 4.7us advertised with
>>>> NEON!
>>> What open source FFT library? You could try posting the code and seeing
>>> if anyone has any suggestions. (Post the code the Beagle list also,
>>> there are some good NEON people there)
>>>> Frustrated, I downloaded and compiled with the evaluation version of
>>>> RVCT
>>>> the openMAX libraries, but I don't manage to link the object file with
>>>> code compiled with the CodeSourcery gnu toolchain.
>>> What user space are you using? Angstrom or something else. You'll need
>>> to use a tool chain that matches your user space.
>>> Philip
>>>> I tried to translate the assembly, but unfortunately it's a very
>>>> challenging task for me.
>>>> Can someone point me in the right direction on this subject?
>>>> Should I keep working on my fixed point 16 bit FFT? Should I buy the
>>>> ARM
>>>> toolchain and port all the software? Should I just give up and try
>>>> using
>>>> the DSP maybe?
>>>> Thank you in advance for any reply, and good luck with the OpenSDR,
>>>> which
>>>> I'm watching very closely.
>>>> Cheers,
>>>> Michele

