[Bug target/96906] Failure to optimize __builtin_ia32_psubusw128 compared to 0 to __builtin_ia32_pminuw128 compared to operand
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96906 Jakub Jelinek changed: What|Removed |Added Resolution|--- |FIXED Status|UNCONFIRMED |RESOLVED --- Comment #10 from Jakub Jelinek --- .
[Bug target/96906] Failure to optimize __builtin_ia32_psubusw128 compared to 0 to __builtin_ia32_pminuw128 compared to operand
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96906 --- Comment #9 from Hongtao.liu --- Fixed in GCC11.
[Bug target/96906] Failure to optimize __builtin_ia32_psubusw128 compared to 0 to __builtin_ia32_pminuw128 compared to operand
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96906 --- Comment #8 from CVS Commits --- The master branch has been updated by hongtao Liu : https://gcc.gnu.org/g:70310982492071f98eacdac0747521769b0f0328 commit r11-5697-g70310982492071f98eacdac0747521769b0f0328 Author: liuhongt Date: Mon Nov 30 13:27:16 2020 +0800 Optimize vpsubusw compared to 0 into vpcmpleuw or vpcmpnleuw [PR96906] For signed comparisons, it handles cases that are eq or neq to 0. For unsigned comparisons, it additionaly handles cases that are le or gt to 0(equivilent to eq or neq to 0). Transform case eq to leu, case neq to gtu. .i.e. for -mavx512bw -mavx512vl transform eq case code from vpsubusw%xmm1, %xmm0, %xmm0 vpxor %xmm1, %xmm1, %xmm1 vpcmpeqw %xmm1, %xmm0, %k0 to vpcmpleuw %xmm1, %xmm0, %k0 .i.e. for -mavx512bw -mavx512vl transform neq case code from vpsubusw%xmm1, %xmm0, %xmm0 vpxor %xmm1, %xmm1, %xmm1 vpcmpneqw %xmm1, %xmm0, %k0 to vpcmpnleuw %xmm1, %xmm0, %k0 gcc/ChangeLog PR target/96906 * config/i386/sse.md (_ucmp3): Add a new define_split after this insn. gcc/testsuite/ChangeLog * gcc.target/i386/avx512bw-pr96906-1.c: New test. * gcc.target/i386/pr96906-1.c: Add -mno-avx512f.
[Bug target/96906] Failure to optimize __builtin_ia32_psubusw128 compared to 0 to __builtin_ia32_pminuw128 compared to operand
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96906 --- Comment #7 from Hongtao.liu --- (In reply to Jakub Jelinek from comment #6) > Implemented now for non-AVX512*. Hongtao, do you think you could have a > look at the avx512{bw,vl}/avx512bw splitter(s)? Yes, i'll do it. Thanks for the patch.
[Bug target/96906] Failure to optimize __builtin_ia32_psubusw128 compared to 0 to __builtin_ia32_pminuw128 compared to operand
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96906 --- Comment #6 from Jakub Jelinek --- Implemented now for non-AVX512*. Hongtao, do you think you could have a look at the avx512{bw,vl}/avx512bw splitter(s)?
[Bug target/96906] Failure to optimize __builtin_ia32_psubusw128 compared to 0 to __builtin_ia32_pminuw128 compared to operand
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96906 --- Comment #5 from CVS Commits --- The master branch has been updated by Jakub Jelinek : https://gcc.gnu.org/g:32b0abb24b8702ec9954448739682ace6fa5ccf5 commit r11-5398-g32b0abb24b8702ec9954448739682ace6fa5ccf5 Author: Jakub Jelinek Date: Thu Nov 26 08:44:15 2020 +0100 i386: Optimize psubusw compared to 0 into pminuw compared to op0 [PR96906] The following patch renames VI12_AVX2 iterator to VI12_AVX2_AVX512BW for consistency with some other iterators, as I need VI12_AVX2 without AVX512BW for this change. The real meat is a combiner split which combine can use to optimize psubusw compared to 0 into pminuw compared to op0 (and similarly for psubusb compared to 0 into pminub compared to op0). According to Agner Fog's tables, psubus[bw] and pminu[bw] timings are the same, but the advantage of pminu[bw] is that the comparison doesn't need a zero operand, so e.g. for -msse4.1 it causes changes like - psubusw %xmm1, %xmm0 - pxor%xmm1, %xmm1 + pminuw %xmm0, %xmm1 pcmpeqw %xmm1, %xmm0 and similarly for avx2: - vpsubusb%ymm1, %ymm0, %ymm0 - vpxor %xmm1, %xmm1, %xmm1 - vpcmpeqb%ymm1, %ymm0, %ymm0 + vpminub %ymm1, %ymm0, %ymm1 + vpcmpeqb%ymm0, %ymm1, %ymm0 I haven't done the AVX512{BW,VL} define_split(s), they'll need to match the UNSPEC_PCMP which are used for avx512 comparisons. 2020-11-26 Jakub Jelinek PR target/96906 * config/i386/sse.md (VI12_AVX2): Remove V64QI/V32HI modes. (VI12_AVX2_AVX512BW): New mode iterator. (_3, uavg3_ceil, _uavg3): Use VI12_AVX2_AVX512BW iterator instead of VI12_AVX2. (*_3): Likewise. (*_uavg3): Likewise. (*_3): Add a new define_split after this insn. * gcc.target/i386/pr96906-1.c: New test.
[Bug target/96906] Failure to optimize __builtin_ia32_psubusw128 compared to 0 to __builtin_ia32_pminuw128 compared to operand
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96906 Jakub Jelinek changed: What|Removed |Added CC||jakub at gcc dot gnu.org --- Comment #4 from Jakub Jelinek --- Created attachment 49621 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=49621&action=edit gcc11-pr96906.patch Looking over Agner Fog's table, pminus[bw] and psubus[bw] seems to have the same timing. This untested patch does the optimization in the combiner for SSE2/SSE4.1/AVX2, but handling AVX512BW and AVX512BW+AVX512VL will need further define_insn_and_split patterns I don't have cycles for right now (match the unspec comparisons in there).
[Bug target/96906] Failure to optimize __builtin_ia32_psubusw128 compared to 0 to __builtin_ia32_pminuw128 compared to operand
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96906 --- Comment #3 from Hongtao.liu --- (gdb) f 6 #6 0x00f04a33 in expand_vect_cond_optab_fn (stmt=0x7fffea1193f0, optab=vcond_optab) at ../../../gcc/gnu-toolchain/master/gcc/internal-fn.c:2612 2612 expand_insn (icode, 6, ops); (gdb) p icode $22 = CODE_FOR_vcondv8hiv8hi Shouldn't CODE_FOR_vconduv8hiv8hi be used? then sign info could be handled by extra param in ix86_expand_int_vcond.
[Bug target/96906] Failure to optimize __builtin_ia32_psubusw128 compared to 0 to __builtin_ia32_pminuw128 compared to operand
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96906 --- Comment #2 from Hongtao.liu --- (In reply to Hongtao.liu from comment #1) > For vector compare to integer mask, gcc use UNSPEC, that's why it's not > handled by general part, maybe peephole2 should be added for this. Or in ix86_expand_int_vcond
[Bug target/96906] Failure to optimize __builtin_ia32_psubusw128 compared to 0 to __builtin_ia32_pminuw128 compared to operand
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96906 Hongtao.liu changed: What|Removed |Added CC||crazylht at gmail dot com --- Comment #1 from Hongtao.liu --- For vector compare to integer mask, gcc use UNSPEC, that's why it's not handled by general part, maybe peephole2 should be added for this.