pierluigilenoci wrote:

@RKSimon Thank you for testing on actual hardware — you're right, the test 
values are wrong. My VDBPSADBW algorithm implementation is incorrect.

After reviewing the GCC reference implementation 
(`gcc/testsuite/gcc.target/i386/avx512bw-vdbpsadbw-2.c`), I can see the 
algorithm has two distinct phases:

1. **Shuffle phase**: Uses all four 2-bit fields of imm8 to shuffle src2 into a 
temp buffer (my code only used bits[1:0] and bits[3:2])
2. **SAD phase**: Uses a sliding/overlapping comparison pattern, not simple 
aligned block-vs-block SAD

I'll rework the implementation to match the correct algorithm and update all 
test values. Sorry for the incorrect numbers — I should have verified against 
hardware or the reference implementation before pushing.

I'll also incorporate @tbaederr's suggestions (which I believe are already 
applied in the latest push).

Will update the PR shortly.

https://github.com/llvm/llvm-project/pull/188887
_______________________________________________
cfe-commits mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

Reply via email to