Hello Philip,
I know NEON has better performances, but it's far more complicated to use.
In a dumb way, basic radix2 butterfly in floating point could look like this:
static void Rank(cpx_float_t *_A, cpx_float_t *_B, floatmix_t *_W,
int _nblks, int _bsize) {
__asm__ ("mov r0, %0 \n\t" // _A
"mov r1, %1 \n\t" // _B
"mov r3, %3 \n\t" // blkIdx = _nblks
"1: \n\t" // for blkIdx
"mov r2, %2 \n\t" // wtmp = _W
"mov r4, %4 \n\t" // szIdx = _bsize
"2: \n\t" // for szIdx
"vld1.32 {d1}, [r1] \n\t" // load _B
"vld1.32 {d2}, [r2] \n\t" // load wtmp
"add r2, r2, %3, lsl #3 \n\t" // wtmp += _nblks*8
"vmul.f32 d0, d2, d1 \n\t" // complex mul
"vrev64.32 d3, d2 \n\t"
"vmul.f32 d3, d3, d1 \n\t"
"vneg.f32 s1, s1 \n\t"
"vtrn.32 d0, d3 \n\t"
"vadd.f32 d3, d0, d3 \n\t"
"vld1.32 {d0}, [r0] \n\t" // load _A
"vsub.f32 d1, d0, d3 \n\t" // _B = _A - _B W
"vadd.f32 d0, d0, d3 \n\t" // _A = _A + _B W
"vst1.64 {d0}, [r0]! \n\t" // store _A and update
"vst1.64 {d1}, [r1]! \n\t" // store _B and update
"subs r4, r4, #1 \n\t" // szIdx--
"bgt 2b \n\t" // branch if > 0
"add r0, r0, %4, lsl #3 \n\t" // _A += _bsize*8
"add r1, r1, %4, lsl #3 \n\t" // _B += _bsize*8
"subs r3, r3, #1 \n\t" // blkIdx--
"bgt 1b \n\t" // branch if > 0
:: "r"(_A), "r"(_B), "r"(_W), "r" (_nblks), "r" (_bsize)
: "d0", "d1", "d2", "d3", "r0", "r1", "r2", "r3", "r4", "memory");
}
and this it's twice slower than the fixed point.
In your knowledge, is it's possible to link the assembly optimised
"omxSP.a" library compiled by RVCT with a "hello FFT world" using gcc?
Cheers,
Michele
> Michele Bavaro wrote:
>> Hello Philip,
>>
>> thank you for coming back on this subject.
>> I modified the library developed by Gregory Heckler, the source code is
>> here:
>>
>> http://github.com/gps-sdr/gps-sdr/tree/6153c01317f34a26b2fb41926505b9d97f764e90/objects
>>
>> To give you an example, the DIT butterfly looks like this:
>>
>>
>> #define BUTTERFLY_FWD(_A, _B, _W) \
>> __asm__ ("LDR r0, [%0] \n\t" \
>> "LDR r2, [%1] \n\t" \
>> "MOV r3, #0 \n\t" \
>> "SHADD16 r0, r0, r3 \n\t" \
>> "SHADD16 r2, r2, r3 \n\t" \
>> "LDR r3, [%2] \n\t" \
>> "SMUADX r5, r2, r3 \n\t" \
>> "SMUSD r4, r2, r3 \n\t" \
>> "ADD r5, r5, #8192 \n\t" \
>> "ADD r4, r4, #8192 \n\t" \
>> "ASR r4, r4, #14 \n\t" \
>> "PKHBT r3, r4, r5, LSL #2 \n\t" \
>> "QSUB16 r2, r0, r3 \n\t" \
>> "QADD16 r0, r0, r3 \n\t" \
>> "STR r0, [%0] \n\t" \
>> "STR r2, [%1] \n\t" \
>> ::"r" (_A), "r" (_B), "r" (_W) \
>> :"r0", "r2", "r3", "r4", "r5", "memory")
>>
>>
>>
>> and just uses ARM assembly (NEON is complicated to use with this basic
>> radix2 implementation).
>
> You'll need to use NEON to get performance. I'll try and look over the
> algorithm and see if I can make some suggestions.
>
>>
>> As user space, I am using the Angstrom image v0.92:
>>
>> http://www.gumstix.net/overo-gm-images/v0.92/
>>
>> on my Overo Water. I use the CodeSourcery 2009q1 free toolchain, even
>> though today I've been suggested to try something else by Koen.
>
> Use the toolchains created by OE. A quick way to do that is set your
> path into the oe/tmp/cross/... directory.
>
> Philip
>
>>
>>
>> Regards,
>> Michele
>>
>>
>>
>>
>>> Michele Bavaro wrote:
>>>> Hello everyone,
>>>>
>>>> I'm porting my software GPS receiver on the OMAP, therefore I need
>>>> fast
>>>> signal processing libraries, and in particular FFTs.
>>>>
>>>> I have somehow adapted an open source library to do radix2 butterfly
>>>> using
>>>> ARM assembly. It works, but my 256 points fixed point 16 bit FFT still
>>>> takes about 60us. That's 12 times slower than 4.7us advertised with
>>>> NEON!
>>> What open source FFT library? You could try posting the code and seeing
>>> if anyone has any suggestions. (Post the code the Beagle list also,
>>> there are some good NEON people there)
>>>
>>>> Frustrated, I downloaded and compiled with the evaluation version of
>>>> RVCT
>>>> the openMAX libraries, but I don't manage to link the object file with
>>>> code compiled with the CodeSourcery gnu toolchain.
>>> What user space are you using? Angstrom or something else. You'll need
>>> to use a tool chain that matches your user space.
>>>
>>> Philip
>>>
>>>> I tried to translate the assembly, but unfortunately it's a very
>>>> challenging task for me.
>>>>
>>>> Can someone point me in the right direction on this subject?
>>>> Should I keep working on my fixed point 16 bit FFT? Should I buy the
>>>> ARM
>>>> toolchain and port all the software? Should I just give up and try
>>>> using
>>>> the DSP maybe?
>>>>
>>>> Thank you in advance for any reply, and good luck with the OpenSDR,
>>>> which
>>>> I'm watching very closely.
>>>>
>>>> Cheers,
>>>> Michele
>>>>
>>>>
>>>>
>>
>>
>>
>