Hi Philip,

actually that code implements a in-place FFT.
The base radix-2 butterfly has a complex mul of the B branch with the
twiddle factor:

void bfly(cpx_int16_t *_A, cpx_int16_t *_B, cpx_int16_t *_W)
{
  sint32_t bi, bq;

  _A->i >>= 1;
  _A->q >>= 1;
  _B->i >>= 1;
  _B->q >>= 1;

  bi = _B->i*_W->i - _B->q*_W->q;     // Complex MUL - real
  bq = _B->i*_W->q + _B->q*_W->i;     // Complex MUL - imag

  bi = (bi + 8192) >> 14;
  bq = (bq + 8192) >> 14;

  _B->i = _A->i - (sint16_t)bi;
  _B->q = _A->q - (sint16_t)bq;

  _A->i += (sint16_t) bi;
  _A->q += (sint16_t) bq;
}

The code I sent you translates this in ARM assembly.

To fully exploit the NEON though, I think you need to group 4 butterflies
together. The openMAX code has a no-so-easy for me way of doing this, and
reduces most of the computations to something like:

VUZP    dW1,dW2
VUZP    dX1,dX3
VMULL   qT0,dX1,dW1
VMLSL   qT0,dX3,dW2                       ;// real part
VMULL   qT1,dX3,dW1
VMLAL   qT1,dX1,dW2                       ;// imag part
VRSHRN  dX1,qT0,#15
VRSHRN  dX3,qT1,#15
VZIP    dX1,dX3

I understand that they deinterleave B to Dn (real) and Dm (imag), same for
W, and then they do 4 multiplies  (16*4=64bits) a time. But really, they
used a split-radix, out of place, clever (too clever for me) algorithm.

The question is if it possible to link libraries compiled by RVCT with the
GNU ARM tool chain (btw Koen, I started using it today), if it makes sense
to translate all the assembly code (nightmare), or just buy RVCT.

=> Poor people have a slow FFT.

Cheers,
Michele







> Michele Bavaro wrote:
>> Hello Philip,
>>
>> thank you for coming back on this subject.
>> I modified the library developed by Gregory Heckler, the source code is
>> here:
>>
>> http://github.com/gps-sdr/gps-sdr/tree/6153c01317f34a26b2fb41926505b9d97f764e90/objects
>>
>> To give you an example, the DIT butterfly looks like this:
>
> So basically, you need to calculate (where a, b, c, w are complex)
>
> c[0] = a[0] + b[1] * w
> c[1] = a[1] + b[0] * w
>
> If this is correct, I'll try and come up with a NEON way exploiting the
> SIMD nature of NEON.
>
> Philip
>
>
>
>>
>>
>> #define BUTTERFLY_FWD(_A, _B, _W)                                    \
>>   __asm__ ("LDR    r0, [%0]            \n\t"                         \
>>         "LDR    r2, [%1]            \n\t"                            \
>>         "MOV    r3, #0              \n\t"                            \
>>         "SHADD16  r0, r0, r3        \n\t"                            \
>>         "SHADD16  r2, r2, r3        \n\t"                            \
>>         "LDR    r3, [%2]            \n\t"                            \
>>         "SMUADX r5, r2, r3          \n\t"                            \
>>         "SMUSD  r4, r2, r3          \n\t"                            \
>>         "ADD    r5, r5, #8192       \n\t"                            \
>>         "ADD    r4, r4, #8192       \n\t"                            \
>>         "ASR    r4, r4, #14         \n\t"                            \
>>         "PKHBT  r3, r4, r5, LSL #2  \n\t"                            \
>>         "QSUB16 r2, r0, r3          \n\t"                            \
>>         "QADD16 r0, r0, r3          \n\t"                            \
>>         "STR    r0, [%0]            \n\t"                            \
>>         "STR    r2, [%1]            \n\t"                            \
>>         ::"r" (_A), "r" (_B), "r" (_W)                               \
>>         :"r0", "r2", "r3", "r4", "r5", "memory")
>>
>>
>>
>> and just uses ARM assembly (NEON is complicated to use with this basic
>> radix2 implementation).
>>
>> As user space, I am using the Angstrom image v0.92:
>>
>> http://www.gumstix.net/overo-gm-images/v0.92/
>>
>> on my Overo Water. I use the CodeSourcery 2009q1 free toolchain, even
>> though today I've been suggested to try something else by Koen.
>>
>>
>> Regards,
>> Michele
>>
>>
>>
>>
>>> Michele Bavaro wrote:
>>>> Hello everyone,
>>>>
>>>> I'm porting my software GPS receiver on the OMAP, therefore I need
>>>> fast
>>>> signal processing libraries, and in particular FFTs.
>>>>
>>>> I have somehow adapted an open source library to do radix2 butterfly
>>>> using
>>>> ARM assembly. It works, but my 256 points fixed point 16 bit FFT still
>>>> takes about 60us. That's 12 times slower than 4.7us advertised with
>>>> NEON!
>>> What open source FFT library? You could try posting the code and seeing
>>> if anyone has any suggestions. (Post the code the Beagle list also,
>>> there are some good NEON people there)
>>>
>>>> Frustrated, I downloaded and compiled with the evaluation version of
>>>> RVCT
>>>> the openMAX libraries, but I don't manage to link the object file with
>>>> code compiled with the CodeSourcery gnu toolchain.
>>> What user space are you using? Angstrom or something else. You'll need
>>> to use a tool chain that matches your user space.
>>>
>>> Philip
>>>
>>>> I tried to translate the assembly, but unfortunately it's a very
>>>> challenging task for me.
>>>>
>>>> Can someone point me in the right direction on this subject?
>>>> Should I keep working on my fixed point 16 bit FFT? Should I buy the
>>>> ARM
>>>> toolchain and port all the software? Should I just give up and try
>>>> using
>>>> the DSP maybe?
>>>>
>>>> Thank you in advance for any reply, and good luck with the OpenSDR,
>>>> which
>>>> I'm watching very closely.
>>>>
>>>> Cheers,
>>>> Michele
>>>>
>>>>
>>>>
>>
>>
>>
>


Reply via email to