Hi Philip,
actually that code implements a in-place FFT.
The base radix-2 butterfly has a complex mul of the B branch with the
twiddle factor:
void bfly(cpx_int16_t *_A, cpx_int16_t *_B, cpx_int16_t *_W)
{
sint32_t bi, bq;
_A->i >>= 1;
_A->q >>= 1;
_B->i >>= 1;
_B->q >>= 1;
bi = _B->i*_W->i - _B->q*_W->q; // Complex MUL - real
bq = _B->i*_W->q + _B->q*_W->i; // Complex MUL - imag
bi = (bi + 8192) >> 14;
bq = (bq + 8192) >> 14;
_B->i = _A->i - (sint16_t)bi;
_B->q = _A->q - (sint16_t)bq;
_A->i += (sint16_t) bi;
_A->q += (sint16_t) bq;
}
The code I sent you translates this in ARM assembly.
To fully exploit the NEON though, I think you need to group 4 butterflies
together. The openMAX code has a no-so-easy for me way of doing this, and
reduces most of the computations to something like:
VUZP dW1,dW2
VUZP dX1,dX3
VMULL qT0,dX1,dW1
VMLSL qT0,dX3,dW2 ;// real part
VMULL qT1,dX3,dW1
VMLAL qT1,dX1,dW2 ;// imag part
VRSHRN dX1,qT0,#15
VRSHRN dX3,qT1,#15
VZIP dX1,dX3
I understand that they deinterleave B to Dn (real) and Dm (imag), same for
W, and then they do 4 multiplies (16*4=64bits) a time. But really, they
used a split-radix, out of place, clever (too clever for me) algorithm.
The question is if it possible to link libraries compiled by RVCT with the
GNU ARM tool chain (btw Koen, I started using it today), if it makes sense
to translate all the assembly code (nightmare), or just buy RVCT.
=> Poor people have a slow FFT.
Cheers,
Michele
> Michele Bavaro wrote:
>> Hello Philip,
>>
>> thank you for coming back on this subject.
>> I modified the library developed by Gregory Heckler, the source code is
>> here:
>>
>> http://github.com/gps-sdr/gps-sdr/tree/6153c01317f34a26b2fb41926505b9d97f764e90/objects
>>
>> To give you an example, the DIT butterfly looks like this:
>
> So basically, you need to calculate (where a, b, c, w are complex)
>
> c[0] = a[0] + b[1] * w
> c[1] = a[1] + b[0] * w
>
> If this is correct, I'll try and come up with a NEON way exploiting the
> SIMD nature of NEON.
>
> Philip
>
>
>
>>
>>
>> #define BUTTERFLY_FWD(_A, _B, _W) \
>> __asm__ ("LDR r0, [%0] \n\t" \
>> "LDR r2, [%1] \n\t" \
>> "MOV r3, #0 \n\t" \
>> "SHADD16 r0, r0, r3 \n\t" \
>> "SHADD16 r2, r2, r3 \n\t" \
>> "LDR r3, [%2] \n\t" \
>> "SMUADX r5, r2, r3 \n\t" \
>> "SMUSD r4, r2, r3 \n\t" \
>> "ADD r5, r5, #8192 \n\t" \
>> "ADD r4, r4, #8192 \n\t" \
>> "ASR r4, r4, #14 \n\t" \
>> "PKHBT r3, r4, r5, LSL #2 \n\t" \
>> "QSUB16 r2, r0, r3 \n\t" \
>> "QADD16 r0, r0, r3 \n\t" \
>> "STR r0, [%0] \n\t" \
>> "STR r2, [%1] \n\t" \
>> ::"r" (_A), "r" (_B), "r" (_W) \
>> :"r0", "r2", "r3", "r4", "r5", "memory")
>>
>>
>>
>> and just uses ARM assembly (NEON is complicated to use with this basic
>> radix2 implementation).
>>
>> As user space, I am using the Angstrom image v0.92:
>>
>> http://www.gumstix.net/overo-gm-images/v0.92/
>>
>> on my Overo Water. I use the CodeSourcery 2009q1 free toolchain, even
>> though today I've been suggested to try something else by Koen.
>>
>>
>> Regards,
>> Michele
>>
>>
>>
>>
>>> Michele Bavaro wrote:
>>>> Hello everyone,
>>>>
>>>> I'm porting my software GPS receiver on the OMAP, therefore I need
>>>> fast
>>>> signal processing libraries, and in particular FFTs.
>>>>
>>>> I have somehow adapted an open source library to do radix2 butterfly
>>>> using
>>>> ARM assembly. It works, but my 256 points fixed point 16 bit FFT still
>>>> takes about 60us. That's 12 times slower than 4.7us advertised with
>>>> NEON!
>>> What open source FFT library? You could try posting the code and seeing
>>> if anyone has any suggestions. (Post the code the Beagle list also,
>>> there are some good NEON people there)
>>>
>>>> Frustrated, I downloaded and compiled with the evaluation version of
>>>> RVCT
>>>> the openMAX libraries, but I don't manage to link the object file with
>>>> code compiled with the CodeSourcery gnu toolchain.
>>> What user space are you using? Angstrom or something else. You'll need
>>> to use a tool chain that matches your user space.
>>>
>>> Philip
>>>
>>>> I tried to translate the assembly, but unfortunately it's a very
>>>> challenging task for me.
>>>>
>>>> Can someone point me in the right direction on this subject?
>>>> Should I keep working on my fixed point 16 bit FFT? Should I buy the
>>>> ARM
>>>> toolchain and port all the software? Should I just give up and try
>>>> using
>>>> the DSP maybe?
>>>>
>>>> Thank you in advance for any reply, and good luck with the OpenSDR,
>>>> which
>>>> I'm watching very closely.
>>>>
>>>> Cheers,
>>>> Michele
>>>>
>>>>
>>>>
>>
>>
>>
>