pierluigilenoci wrote:

I've pushed a complete rewrite of the VDBPSADBW constexpr algorithm. The 
previous implementation was fundamentally wrong — it only used 2 of the 4 imm8 
bit fields and had an incorrect SAD computation structure.

**Root cause**: The old code extracted just `BlockOffsetA = (imm & 0x3) * 4` 
and `BlockOffsetB = ((imm >> 2) & 0x3) * 4`, then compared fixed blocks from 
src1 against sliding windows in src2. The real instruction does the opposite: 
it shuffles src2 using all four 2-bit fields, then computes sliding SADs 
between src1 and the shuffled result.

**Correct algorithm** (verified against GCC's reference in 
`gcc/testsuite/gcc.target/i386/avx512bw-vdbpsadbw-2.c`):

**Phase 1 — Shuffle src2**: Within each 128-bit lane, for each group j (0..3), 
the 2-bit field `(imm >> (2*j)) & 3` selects which 4-byte block of src2 to 
place at position `4*j` in the temporary array.

**Phase 2 — Sliding SAD**: For every group of 4 output u16 values at index i 
(stepping by 4):
```
dst[i]   = Σ|src1[2i+j]   - tmp[2i+j]  |  for j=0..3
dst[i+1] = Σ|src1[2i+j]   - tmp[2i+j+1]|  for j=0..3
dst[i+2] = Σ|src1[2i+j+4] - tmp[2i+j+2]|  for j=0..3
dst[i+3] = Σ|src1[2i+j+4] - tmp[2i+j+3]|  for j=0..3
```

**Verification**: `_mm_dbsad_epu8([0..15], [1..16], 4)` now produces `[4, 8, 4, 
0, 28, 28, 44, 44]`, matching the hardware output from your earlier test.

Both `ExprConstant.cpp` and `InterpBuiltin.cpp` are rewritten with the same 
corrected algorithm, and all `TEST_CONSTEXPR` expected values have been 
recomputed.

https://github.com/llvm/llvm-project/pull/188887
_______________________________________________
cfe-commits mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

Reply via email to