Hello Philip, I know NEON has better performances, but it's far more complicated to use. In a dumb way, basic radix2 butterfly in floating point could look like this:
static void Rank(cpx_float_t *_A, cpx_float_t *_B, floatmix_t *_W, int _nblks, int _bsize) { __asm__ ("mov r0, %0 \n\t" // _A "mov r1, %1 \n\t" // _B "mov r3, %3 \n\t" // blkIdx = _nblks "1: \n\t" // for blkIdx "mov r2, %2 \n\t" // wtmp = _W "mov r4, %4 \n\t" // szIdx = _bsize "2: \n\t" // for szIdx "vld1.32 {d1}, [r1] \n\t" // load _B "vld1.32 {d2}, [r2] \n\t" // load wtmp "add r2, r2, %3, lsl #3 \n\t" // wtmp += _nblks*8 "vmul.f32 d0, d2, d1 \n\t" // complex mul "vrev64.32 d3, d2 \n\t" "vmul.f32 d3, d3, d1 \n\t" "vneg.f32 s1, s1 \n\t" "vtrn.32 d0, d3 \n\t" "vadd.f32 d3, d0, d3 \n\t" "vld1.32 {d0}, [r0] \n\t" // load _A "vsub.f32 d1, d0, d3 \n\t" // _B = _A - _B W "vadd.f32 d0, d0, d3 \n\t" // _A = _A + _B W "vst1.64 {d0}, [r0]! \n\t" // store _A and update "vst1.64 {d1}, [r1]! \n\t" // store _B and update "subs r4, r4, #1 \n\t" // szIdx-- "bgt 2b \n\t" // branch if > 0 "add r0, r0, %4, lsl #3 \n\t" // _A += _bsize*8 "add r1, r1, %4, lsl #3 \n\t" // _B += _bsize*8 "subs r3, r3, #1 \n\t" // blkIdx-- "bgt 1b \n\t" // branch if > 0 :: "r"(_A), "r"(_B), "r"(_W), "r" (_nblks), "r" (_bsize) : "d0", "d1", "d2", "d3", "r0", "r1", "r2", "r3", "r4", "memory"); } and this it's twice slower than the fixed point. In your knowledge, is it's possible to link the assembly optimised "omxSP.a" library compiled by RVCT with a "hello FFT world" using gcc? Cheers, Michele > Michele Bavaro wrote: >> Hello Philip, >> >> thank you for coming back on this subject. >> I modified the library developed by Gregory Heckler, the source code is >> here: >> >> http://github.com/gps-sdr/gps-sdr/tree/6153c01317f34a26b2fb41926505b9d97f764e90/objects >> >> To give you an example, the DIT butterfly looks like this: >> >> >> #define BUTTERFLY_FWD(_A, _B, _W) \ >> __asm__ ("LDR r0, [%0] \n\t" \ >> "LDR r2, [%1] \n\t" \ >> "MOV r3, #0 \n\t" \ >> "SHADD16 r0, r0, r3 \n\t" \ >> "SHADD16 r2, r2, r3 \n\t" \ >> "LDR r3, [%2] \n\t" \ >> "SMUADX r5, r2, r3 \n\t" \ >> "SMUSD r4, r2, r3 \n\t" \ >> "ADD r5, r5, #8192 \n\t" \ >> "ADD r4, r4, #8192 \n\t" \ >> "ASR r4, r4, #14 \n\t" \ >> "PKHBT r3, r4, r5, LSL #2 \n\t" \ >> "QSUB16 r2, r0, r3 \n\t" \ >> "QADD16 r0, r0, r3 \n\t" \ >> "STR r0, [%0] \n\t" \ >> "STR r2, [%1] \n\t" \ >> ::"r" (_A), "r" (_B), "r" (_W) \ >> :"r0", "r2", "r3", "r4", "r5", "memory") >> >> >> >> and just uses ARM assembly (NEON is complicated to use with this basic >> radix2 implementation). > > You'll need to use NEON to get performance. I'll try and look over the > algorithm and see if I can make some suggestions. > >> >> As user space, I am using the Angstrom image v0.92: >> >> http://www.gumstix.net/overo-gm-images/v0.92/ >> >> on my Overo Water. I use the CodeSourcery 2009q1 free toolchain, even >> though today I've been suggested to try something else by Koen. > > Use the toolchains created by OE. A quick way to do that is set your > path into the oe/tmp/cross/... directory. > > Philip > >> >> >> Regards, >> Michele >> >> >> >> >>> Michele Bavaro wrote: >>>> Hello everyone, >>>> >>>> I'm porting my software GPS receiver on the OMAP, therefore I need >>>> fast >>>> signal processing libraries, and in particular FFTs. >>>> >>>> I have somehow adapted an open source library to do radix2 butterfly >>>> using >>>> ARM assembly. It works, but my 256 points fixed point 16 bit FFT still >>>> takes about 60us. That's 12 times slower than 4.7us advertised with >>>> NEON! >>> What open source FFT library? You could try posting the code and seeing >>> if anyone has any suggestions. (Post the code the Beagle list also, >>> there are some good NEON people there) >>> >>>> Frustrated, I downloaded and compiled with the evaluation version of >>>> RVCT >>>> the openMAX libraries, but I don't manage to link the object file with >>>> code compiled with the CodeSourcery gnu toolchain. >>> What user space are you using? Angstrom or something else. You'll need >>> to use a tool chain that matches your user space. >>> >>> Philip >>> >>>> I tried to translate the assembly, but unfortunately it's a very >>>> challenging task for me. >>>> >>>> Can someone point me in the right direction on this subject? >>>> Should I keep working on my fixed point 16 bit FFT? Should I buy the >>>> ARM >>>> toolchain and port all the software? Should I just give up and try >>>> using >>>> the DSP maybe? >>>> >>>> Thank you in advance for any reply, and good luck with the OpenSDR, >>>> which >>>> I'm watching very closely. >>>> >>>> Cheers, >>>> Michele >>>> >>>> >>>> >> >> >> >