https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101472
Bug ID: 101472 Summary: AVX-512 wrong code for consecutive masked scatters Product: gcc Version: 11.1.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: dlustig at nvidia dot com Target Milestone: --- Created attachment 51162 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=51162&action=edit Test case $ cat two_scatters.c #include <immintrin.h> void two_scatters(void* base_addr, __mmask8 k1, __mmask8 k2, __m512i vindex, __m256i a) { _mm512_mask_i64scatter_epi32(base_addr, k1, vindex, a, 1); _mm512_mask_i64scatter_epi32(base_addr, k2, vindex, a, 1); } $ g++-11 -S -O3 -march=skylake-avx512 -Wall -Wextra -fno-strict-aliasing -fwrapv -fno-aggressive-loop-optimizations two_scatters.c -o - ... _Z12two_scattersPvhhDv8_xDv4_x: kmovb %edx, %k2 vpscatterqd %ymm1, (%rdi,%zmm0,1){%k2} ret ... Only one vpscatterqd instruction is generated, even though I would expect two. The optimizer seems to think the first store is redundant with the second due to matching addresses, and hence optimizes it away. However, since two different masks are being used, the scatters are not actually redundant. Perturbing the example by passing two different base addresses, or inserting an asm("nop"); in between, etc., will cause both scatters to get emitted: https://godbolt.org/z/3b8v86on4 Stand-alone executable test case attached as well. GCC version info: $ gcc-11 -v Using built-in specs. COLLECT_GCC=gcc-11 COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/11/lto-wrapper OFFLOAD_TARGET_NAMES=nvptx-none:amdgcn-amdhsa OFFLOAD_TARGET_DEFAULT=1 Target: x86_64-linux-gnu Configured with: ../src/configure -v --with-pkgversion='Ubuntu 11.1.0-1ubuntu1~18.04.1' --with-bugurl=file:///usr/share/doc/gcc-11/README.Bugs --enable-languages=c,ada,c++,go,brig,d,fortran,objc,obj-c++,m2 --prefix=/usr --with-gcc-major-version-only --program-suffix=-11 --program-prefix=x86_64-linux-gnu- --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-vtable-verify --enable-plugin --enable-default-pie --with-system-zlib --enable-libphobos-checking=release --with-target-system-zlib=auto --enable-objc-gc=auto --enable-multiarch --disable-werror --disable-cet --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic --enable-offload-targets=nvptx-none=/build/gcc-11-YRKbe7/gcc-11-11.1.0/debian/tmp-nvptx/usr,amdgcn-amdhsa=/build/gcc-11-YRKbe7/gcc-11-11.1.0/debian/tmp-gcn/usr --without-cuda-driver --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu Thread model: posix Supported LTO compression algorithms: zlib zstd gcc version 11.1.0 (Ubuntu 11.1.0-1ubuntu1~18.04.1)