pierluigilenoci wrote: @RKSimon Thank you for testing on actual hardware — you're right, the test values are wrong. My VDBPSADBW algorithm implementation is incorrect.
After reviewing the GCC reference implementation (`gcc/testsuite/gcc.target/i386/avx512bw-vdbpsadbw-2.c`), I can see the algorithm has two distinct phases: 1. **Shuffle phase**: Uses all four 2-bit fields of imm8 to shuffle src2 into a temp buffer (my code only used bits[1:0] and bits[3:2]) 2. **SAD phase**: Uses a sliding/overlapping comparison pattern, not simple aligned block-vs-block SAD I'll rework the implementation to match the correct algorithm and update all test values. Sorry for the incorrect numbers — I should have verified against hardware or the reference implementation before pushing. I'll also incorporate @tbaederr's suggestions (which I believe are already applied in the latest push). Will update the PR shortly. https://github.com/llvm/llvm-project/pull/188887 _______________________________________________ cfe-commits mailing list [email protected] https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
