Janne Grunau <[email protected]> writes: > Approximately as fast as the ARM NEON version on Apple's A7.
I'm sorry I don't have the time to read your code carefully, but maybe this is a good place to write down some notes on neon fft (based on experience from doing 32-bit fixed point fft some time ago). I think the same approach should work also for 32-bit floats, or other archs with similar SIMD capabilities. 1. Write code to do four parallel fft4 in-place in registers only (I'd put real and imaginary parts in separate neon registers, four each). Needs only 16 instructions (vadd, vsub, vhadd, vhsub, whatever appropriate), no moves, and pretty few additional registers for temporaries. 2. Write code to do one fft16 in-place in registers, by doing four fft4 as above, transposing and applying twiddle factors (which all fit in spare registers), then another four fft4. 3. Write a loop which does several fft16, by loading the few needed twiddle factors in registers up front, then reading 16 elements at a time, apply the above fft16, store. 4. Use this fft16 as a building block for fft256. Needs two passes of fft16 (but with different memory access order), and one of the passes needs to read and apply twiddle factors from a larger table in memory. To do fft128, one also needs code to do two fft8 in parallel in registers, and some loop around that. Not quite as neat (since one can't pack real and imaginary parts in separate 128-bit registers), but not too ugly either. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. _______________________________________________ libav-devel mailing list [email protected] https://lists.libav.org/mailman/listinfo/libav-devel
