Janne Grunau <[email protected]> writes:

> Approximately as fast as the ARM NEON version on Apple's A7.

I'm sorry I don't have the time to read your code carefully, but maybe
this is a good place to write down some notes on neon fft (based on
experience from doing 32-bit fixed point fft some time ago). I think the
same approach should work also for 32-bit floats, or other archs with
similar SIMD capabilities.

1. Write code to do four parallel fft4 in-place in registers only (I'd
   put real and imaginary parts in separate neon registers, four each).
   Needs only 16 instructions (vadd, vsub, vhadd, vhsub, whatever
   appropriate), no moves, and pretty few additional registers for
   temporaries.

2. Write code to do one fft16 in-place in registers, by doing four fft4
   as above, transposing and applying twiddle factors (which all fit in
   spare registers), then another four fft4.

3. Write a loop which does several fft16, by loading the few needed
   twiddle factors in registers up front, then reading 16 elements at a
   time, apply the above fft16, store.

4. Use this fft16 as a building block for fft256. Needs two passes of
   fft16 (but with different memory access order), and one of the passes
   needs to read and apply twiddle factors from a larger table in
   memory.

To do fft128, one also needs code to do two fft8 in parallel in
registers, and some loop around that. Not quite as neat (since one can't
pack real and imaginary parts in separate 128-bit registers), but not
too ugly either.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.
_______________________________________________
libav-devel mailing list
[email protected]
https://lists.libav.org/mailman/listinfo/libav-devel

Reply via email to