Here are some extra implementations that extend Christophe's work. Differences with v1:
* AVX/FMA3: Removed the main loop and related bookkeepeing for x64 since said loop would be run only once anyway. * FMA3: Replaced mulps+subps with FMA3 instructions, meaning two less instructions run per loop in that version. * Removed some unnecessary preprocessor guards and added some missing ones. Knowing that currently AMD has lackluster performance with ymm registers I could add an FMA4 version of this function using xmm registers, which would benefit said processors unlike the AVX/FMA3 ymm ones. Thoughts? James Almer (3): x86/synth_filter: add synth_filter_sse x86/synth_filter: add synth_filter_avx x86/synth_filter: add synth_filter_fma3 libavcodec/x86/dcadsp.asm | 138 ++++++++++++++++++++++++++++++++----------- libavcodec/x86/dcadsp_init.c | 55 +++++++++++------ 2 files changed, 143 insertions(+), 50 deletions(-) -- 1.8.3.2 _______________________________________________ libav-devel mailing list [email protected] https://lists.libav.org/mailman/listinfo/libav-devel
