Issue |
68117
|
Summary |
Regression: Infinite recursion in x86-64 AVX-512 shuffle optimization
|
Labels |
backend:X86,
clang:codegen,
regression,
crash-on-valid
|
Assignees |
|
Reporter |
bjacob
|
# Summary
Minimized valid C + AVX-512 intrinsics testcase causes crash (infinite recursion) at optimization levels >= `-O1`.
# Regression window:
(See Compiler Explorer experiment: https://godbolt.org/z/zbGoT3Gc9).
- Crashes in Clang >= 16 (including current trunk).
- Worked in Clang <= 15.
# Minimized testcase
Compiler Explorer link:
https://godbolt.org/z/zbGoT3Gc9
```c
#include <immintrin.h>
#include <stdint.h>
static __m512bh bitcast_16xf32_to_32xbf16(__m512 a) {
return *(const __m512bh *)(&a);
}
static __m512bh load_32xbf16(const uint16_t *src) {
return bitcast_16xf32_to_32xbf16(_mm512_loadu_ps((const float *)src));
}
static __m512bh broadcast_load_2xbf16(const uint16_t *src) {
return bitcast_16xf32_to_32xbf16(_mm512_set1_ps(*(const float *)src));
}
void dotprod_16x2xbf16_times_broadcasted_2xbf16_into_16xf32(
float *out_ptr, const uint16_t *lhs_ptr, const uint16_t *rhs_ptr) {
__m512 acc = _mm512_loadu_ps(out_ptr);
__m512bh rhs = load_32xbf16(rhs_ptr);
rhs_ptr += 32;
acc =
_mm512_dpbf16_ps(acc, rhs, broadcast_load_2xbf16(lhs_ptr));
lhs_ptr += 2;
_mm512_storeu_ps(out_ptr, acc);
}
```
Compile with these flags:
```
clang -O3 -mavx -mavx2 -mfma -mf16c -mavx512f -mavx512vl -mavx512cd -mavx512bw -mavx512dq -mavx512bf16
```
Note: the crash already reproduces with `-emit-llvm`, so it's not specific to object-code generation.
# Explanation of the testcase
AVX-512-BF16 brings a `vdpbf16ps` instruction that has a variant where the RHS operand is a `m32bcst` - a 32bit memory operand that the instruction broadcasts across all 32bit lanes. It can be obtained from intrinsics by feeding the result of a broadcast. This is the intent in this C code. It's a very typical pattern. That's working perfectly with Clang <= 15 as demo'd by the above Compiler Explorer link (https://godbolt.org/z/zbGoT3Gc9).
# Backtrace in LLDB shows infinite recursion in codegen:
```
(lldb) bt 100
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=2, address=0x16f5ffff8)
* frame #0: 0x000000010011d834 clang-18`llvm::TargetLoweringBase::getValueType(this=0x00000001281183c0, DL=0x000000011d4312e0, Ty=<unavailable>, AllowUnknown=<unavailable>) const at TargetLowering.h:1567
frame #1: 0x000000010055f35c clang-18`llvm::BasicTTIImplBase<llvm::X86TTIImpl>::getTypeLegalizationCost(this=0x000000011d2257d8, Ty=0x000000011dc73d28) const at BasicTTIImpl.h:822:25
frame #2: 0x000000010056ecfc clang-18`llvm::X86TTIImpl::getVectorInstrCost(this=0x000000011d2257d8, Opcode=62, Val=0x000000011dc73d28, CostKind=TCK_RecipThroughput, Index=0, Op0=0x0000000000000000, Op1=0x0000000000000000) at X86TargetTransformInfo.cpp:4381:42
frame #3: 0x0000000100580c18 clang-18`llvm::BasicTTIImplBase<llvm::X86TTIImpl>::getPermuteShuffleOverhead(this=0x000000011d2257d8, VTy=0x000000011dc73d28, CostKind=TCK_RecipThroughput) at BasicTTIImpl.h:117:24
frame #4: 0x0000000100568710 clang-18`llvm::BasicTTIImplBase<llvm::X86TTIImpl>::getShuffleCost(this=0x000000011d2257d8, Kind=SK_PermuteTwoSrc, Tp=0x000000011dc73d28, Mask=ArrayRef<int> @ 0x000000016f600610, CostKind=TCK_RecipThroughput, Index=0, SubTp=0x000000011dc73d28, Args=ArrayRef<const llvm::Value *> @ 0x000000016f600600) at BasicTTIImpl.h:995:16
frame #5: 0x0000000100567e7c clang-18`llvm::X86TTIImpl::getShuffleCost(this=0x000000011d2257d8, Kind=SK_PermuteTwoSrc, BaseTp=0x000000011dc73d28, Mask=ArrayRef<int> @ 0x000000016f600e30, CostKind=TCK_RecipThroughput, Index=0, SubTp=0x000000011dc73d28, Args=ArrayRef<const llvm::Value *> @ 0x000000016f600e20) at X86TargetTransformInfo.cpp:2080:17
frame #6: 0x000000010056f2a0 clang-18`llvm::X86TTIImpl::getVectorInstrCost(this=0x000000011d2257d8, Opcode=62, Val=0x000000011dc73d28, CostKind=TCK_RecipThroughput, Index=1, Op0=0x0000000000000000, Op1=0x0000000000000000) at X86TargetTransformInfo.cpp:4465:21
frame #7: 0x0000000100580c18 clang-18`llvm::BasicTTIImplBase<llvm::X86TTIImpl>::getPermuteShuffleOverhead(this=0x000000011d2257d8, VTy=0x000000011dc73d28, CostKind=TCK_RecipThroughput) at BasicTTIImpl.h:117:24
frame #8: 0x0000000100568710 clang-18`llvm::BasicTTIImplBase<llvm::X86TTIImpl>::getShuffleCost(this=0x000000011d2257d8, Kind=SK_PermuteTwoSrc, Tp=0x000000011dc73d28, Mask=ArrayRef<int> @ 0x000000016f6013d0, CostKind=TCK_RecipThroughput, Index=0, SubTp=0x000000011dc73d28, Args=ArrayRef<const llvm::Value *> @ 0x000000016f6013c0) at BasicTTIImpl.h:995:16
frame #9: 0x0000000100567e7c clang-18`llvm::X86TTIImpl::getShuffleCost(this=0x000000011d2257d8, Kind=SK_PermuteTwoSrc, BaseTp=0x000000011dc73d28, Mask=ArrayRef<int> @ 0x000000016f601bf0, CostKind=TCK_RecipThroughput, Index=0, SubTp=0x000000011dc73d28, Args=ArrayRef<const llvm::Value *> @ 0x000000016f601be0) at X86TargetTransformInfo.cpp:2080:17
frame #10: 0x000000010056f2a0 clang-18`llvm::X86TTIImpl::getVectorInstrCost(this=0x000000011d2257d8, Opcode=62, Val=0x000000011dc73d28, CostKind=TCK_RecipThroughput, Index=1, Op0=0x0000000000000000, Op1=0x0000000000000000) at X86TargetTransformInfo.cpp:4465:21
frame #11: 0x0000000100580c18 clang-18`llvm::BasicTTIImplBase<llvm::X86TTIImpl>::getPermuteShuffleOverhead(this=0x000000011d2257d8, VTy=0x000000011dc73d28, CostKind=TCK_RecipThroughput) at BasicTTIImpl.h:117:24
frame #12: 0x0000000100568710 clang-18`llvm::BasicTTIImplBase<llvm::X86TTIImpl>::getShuffleCost(this=0x000000011d2257d8, Kind=SK_PermuteTwoSrc, Tp=0x000000011dc73d28, Mask=ArrayRef<int> @ 0x000000016f602190, CostKind=TCK_RecipThroughput, Index=0, SubTp=0x000000011dc73d28, Args=ArrayRef<const llvm::Value *> @ 0x000000016f602180) at BasicTTIImpl.h:995:16
```
_______________________________________________
llvm-bugs mailing list
llvm-bugs@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs