[Bug target/104610] memcmp () == 0 can be optimized better for avx512f
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104610 --- Comment #23 from CVS Commits --- The master branch has been updated by hongtao Liu : https://gcc.gnu.org/g:8c40b72036c967fbb1d1150515cf70aec382f0a2 commit r14-5002-g8c40b72036c967fbb1d1150515cf70aec382f0a2 Author: liuhongt Date: Mon Oct 9 15:07:54 2023 +0800 Improve memcmpeq for 512-bit vector with vpcmpeq + kortest. When 2 vectors are equal, kmask is allones and kortest will set CF, else CF will be cleared. So CF bit can be used to check for the result of the comparison. Before: vmovdqu (%rsi), %ymm0 vpxorq (%rdi), %ymm0, %ymm0 vptest %ymm0, %ymm0 jne .L2 vmovdqu 32(%rsi), %ymm0 vpxorq 32(%rdi), %ymm0, %ymm0 vptest %ymm0, %ymm0 je .L5 .L2: movl$1, %eax xorl$1, %eax vzeroupper ret After: vmovdqu64 (%rsi), %zmm0 xorl%eax, %eax vpcmpeqd(%rdi), %zmm0, %k0 kortestw%k0, %k0 setc%al vzeroupper ret gcc/ChangeLog: PR target/104610 * config/i386/i386-expand.cc (ix86_expand_branch): Handle 512-bit vector with vpcmpeq + kortest. * config/i386/i386.md (cbranchxi4): New expander. * config/i386/sse.md: (cbranch4): Extend to V16SImode and V8DImode. gcc/testsuite/ChangeLog: * gcc.target/i386/pr104610-2.c: New test.
[Bug target/104610] memcmp () == 0 can be optimized better for avx512f
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104610 --- Comment #22 from Hongtao.liu --- For 64-byte memory comparison int compare (const char* s1, const char* s2) { return __builtin_memcmp (s1, s2, 64) == 0; } We're generating vmovdqu (%rsi), %ymm0 vpxorq (%rdi), %ymm0, %ymm0 vptest %ymm0, %ymm0 jne .L2 vmovdqu 32(%rsi), %ymm0 vpxorq 32(%rdi), %ymm0, %ymm0 vptest %ymm0, %ymm0 je .L5 .L2: movl$1, %eax xorl$1, %eax vzeroupper ret An alternative way is using vpcmpeq + kortest and check Carry bit vmovdqu64 (%rsi), %zmm0 xorl%eax, %eax vpcmpeqd(%rdi), %zmm0, %k0 kortestw%k0, %k0 setc%al vzeroupper Not sure if it's better or not.
[Bug target/104610] memcmp () == 0 can be optimized better for avx512f
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104610 --- Comment #21 from Uroš Bizjak --- Just before the patch from Comment #20, the compiler creates (-O2 -mavx): --cut here-- vmovdqa .LC1(%rip), %xmm0 vmovdqa %xmm0, -24(%rsp) vmovdqu (%rdi), %xmm0 vpxor .LC0(%rip), %xmm0, %xmm0 vptest %xmm0, %xmm0 je .L5 .L2: movl$1, %eax testl %eax, %eax sete%al ret .L5: vmovdqu 16(%rdi), %xmm0 vpxor -24(%rsp), %xmm0, %xmm0 vptest %xmm0, %xmm0 jne .L2 xorl%eax, %eax testl %eax, %eax sete%al ret --cut here-- Please note the creative way of returning 0 and 1 ... : movl$1, %eax testl %eax, %eax sete%al ret Even the new code (From comment #20) is unnecessarily convoluted: .L2:movl$1, %eax xorl$1, %eax ret .L5:xorl%eax, %eax xorl$1, %eax ret
[Bug target/104610] memcmp () == 0 can be optimized better for avx512f
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104610 --- Comment #20 from CVS Commits --- The master branch has been updated by Roger Sayle : https://gcc.gnu.org/g:4afbebcdc5780d28e52b7d65643e462c7c3882ce commit r14-2159-g4afbebcdc5780d28e52b7d65643e462c7c3882ce Author: Roger Sayle Date: Wed Jun 28 11:11:34 2023 +0100 i386: Add cbranchti4 pattern to i386.md (for -m32 compare_by_pieces). This patch fixes some very odd (unanticipated) code generation by compare_by_pieces with -m32 -mavx, since the recent addition of the cbranchoi4 pattern. The issue is that cbranchoi4 is available with TARGET_AVX, but cbranchti4 is currently conditional on TARGET_64BIT which results in the odd behaviour (thanks to OPTAB_WIDEN) that with -m32 -mavx, compare_by_pieces ends up (inefficiently) widening 128-bit comparisons to 256-bits before performing PTEST. This patch fixes this by providing a cbranchti4 pattern that's available with either TARGET_64BIT or TARGET_SSE4_1. For the test case below (again from PR 104610): int foo(char *a) { static const char t[] = "0123456789012345678901234567890"; return __builtin_memcmp(a, [0], sizeof(t)) == 0; } GCC with -m32 -O2 -mavx currently produces the bonkers: foo:pushl %ebp movl%esp, %ebp andl$-32, %esp subl$64, %esp movl8(%ebp), %eax vmovdqa .LC0, %xmm4 movl$0, 48(%esp) vmovdqu (%eax), %xmm2 movl$0, 52(%esp) movl$0, 56(%esp) movl$0, 60(%esp) movl$0, 16(%esp) movl$0, 20(%esp) movl$0, 24(%esp) movl$0, 28(%esp) vmovdqa %xmm2, 32(%esp) vmovdqa %xmm4, (%esp) vmovdqa (%esp), %ymm5 vpxor 32(%esp), %ymm5, %ymm0 vptest %ymm0, %ymm0 jne .L2 vmovdqu 16(%eax), %xmm7 movl$0, 48(%esp) movl$0, 52(%esp) vmovdqa %xmm7, 32(%esp) vmovdqa .LC1, %xmm7 movl$0, 56(%esp) movl$0, 60(%esp) movl$0, 16(%esp) movl$0, 20(%esp) movl$0, 24(%esp) movl$0, 28(%esp) vmovdqa %xmm7, (%esp) vmovdqa (%esp), %ymm1 vpxor 32(%esp), %ymm1, %ymm0 vptest %ymm0, %ymm0 je .L6 .L2:movl$1, %eax xorl$1, %eax vzeroupper leave ret .L6:xorl%eax, %eax xorl$1, %eax vzeroupper leave ret with this patch, we now generate the (slightly) more sensible: foo:vmovdqa .LC0, %xmm0 movl4(%esp), %eax vpxor (%eax), %xmm0, %xmm0 vptest %xmm0, %xmm0 jne .L2 vmovdqa .LC1, %xmm0 vpxor 16(%eax), %xmm0, %xmm0 vptest %xmm0, %xmm0 je .L5 .L2:movl$1, %eax xorl$1, %eax ret .L5:xorl%eax, %eax xorl$1, %eax ret 2023-06-28 Roger Sayle gcc/ChangeLog * config/i386/i386-expand.cc (ix86_expand_branch): Also use ptest for TImode comparisons on 32-bit architectures. * config/i386/i386.md (cbranch4): Change from SDWIM to SWIM1248x to exclude/avoid TImode being conditional on -m64. (cbranchti4): New define_expand for TImode on both TARGET_64BIT and/or with TARGET_SSE4_1. * config/i386/predicates.md (ix86_timode_comparison_operator): New predicate that depends upon TARGET_64BIT. (ix86_timode_comparison_operand): Likewise. gcc/testsuite/ChangeLog * gcc.target/i386/pieces-memcmp-2.c: New test case.
[Bug target/104610] memcmp () == 0 can be optimized better for avx512f
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104610 --- Comment #19 from Hongtao.liu --- I'm wondering would targetm.overlap_op_by_pieces_p helps here.
[Bug target/104610] memcmp () == 0 can be optimized better for avx512f
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104610 --- Comment #18 from Hongtao.liu --- Fixed in GCC13.
[Bug target/104610] memcmp () == 0 can be optimized better for avx512f
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104610 --- Comment #17 from CVS Commits --- The master branch has been updated by hongtao Liu : https://gcc.gnu.org/g:850a13d754497faae91afabc6958780f1d63a574 commit r13-580-g850a13d754497faae91afabc6958780f1d63a574 Author: liuhongt Date: Tue Mar 1 13:41:52 2022 +0800 Expand __builtin_memcmp_eq with ptest for OImode. gcc/ChangeLog: PR target/104610 * config/i386/i386-expand.cc (ix86_expand_branch): Use ptest for QImode when code is EQ or NE. * config/i386/i386.md (cbranchoi4): New expander. gcc/testsuite/ChangeLog: * gcc.target/i386/pr104610.c: New test.
[Bug target/104610] memcmp () == 0 can be optimized better for avx512f
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104610 Hongtao.liu changed: What|Removed |Added Attachment #52495|0 |1 is obsolete|| --- Comment #16 from Hongtao.liu --- Created attachment 52692 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=52692=edit Patch pending for GCC13
[Bug target/104610] memcmp () == 0 can be optimized better for avx512f
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104610 --- Comment #15 from Hongtao.liu --- Could someone help to mark this blocks PR105073, the patch is ready and waiting for GCC13.
[Bug target/104610] memcmp () == 0 can be optimized better for avx512f
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104610 Bug 104610 depends on bug 104704, which changed state. Bug 104704 Summary: [12 Regression] ix86_gen_scratch_sse_rtx doesn't work with explicit XMM7/XMM15/XMM31 usage https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104704 What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED
[Bug target/104610] memcmp () == 0 can be optimized better for avx512f
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104610 --- Comment #14 from H.J. Lu --- (In reply to H.J. Lu from comment #13) > (In reply to Hongtao.liu from comment #8) > > Created attachment 52495 [details] > > untested patch. > > I see these regressions with -m32: > > FAIL: gcc.dg/lower-subreg-1.c scan-rtl-dump subreg1 "Splitting reg" > FAIL: gcc.target/i386/iamcu/test_basic_64bit_returning.c execution, -O0 > FAIL: gcc.target/i386/iamcu/test_basic_64bit_returning.c execution, -O1 > FAIL: gcc.target/i386/iamcu/test_basic_64bit_returning.c execution, -O2 > FAIL: gcc.target/i386/iamcu/test_basic_64bit_returning.c execution, -O3 -g > FAIL: gcc.target/i386/iamcu/test_basic_64bit_returning.c execution, -Os > FAIL: gcc.target/i386/iamcu/test_struct_returning.c execution, -O0 > FAIL: gcc.target/i386/iamcu/test_struct_returning.c execution, -O1 > FAIL: gcc.target/i386/iamcu/test_struct_returning.c execution, -Og -g -m64 regression: FAIL: gcc.target/i386/pr82580.c scan-assembler-not \\mmovzb
[Bug target/104610] memcmp () == 0 can be optimized better for avx512f
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104610 --- Comment #13 from H.J. Lu --- (In reply to Hongtao.liu from comment #8) > Created attachment 52495 [details] > untested patch. I see these regressions with -m32: FAIL: gcc.dg/lower-subreg-1.c scan-rtl-dump subreg1 "Splitting reg" FAIL: gcc.target/i386/iamcu/test_basic_64bit_returning.c execution, -O0 FAIL: gcc.target/i386/iamcu/test_basic_64bit_returning.c execution, -O1 FAIL: gcc.target/i386/iamcu/test_basic_64bit_returning.c execution, -O2 FAIL: gcc.target/i386/iamcu/test_basic_64bit_returning.c execution, -O3 -g FAIL: gcc.target/i386/iamcu/test_basic_64bit_returning.c execution, -Os FAIL: gcc.target/i386/iamcu/test_struct_returning.c execution, -O0 FAIL: gcc.target/i386/iamcu/test_struct_returning.c execution, -O1 FAIL: gcc.target/i386/iamcu/test_struct_returning.c execution, -Og -g
[Bug target/104610] memcmp () == 0 can be optimized better for avx512f
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104610 H.J. Lu changed: What|Removed |Added Depends on||104704 --- Comment #12 from H.J. Lu --- (In reply to Hongtao.liu from comment #8) > Created attachment 52495 [details] > untested patch. > > With the patch, it exposes one potential issue related to dse(or > ix86_gen_scratch_sse_rtx usage). in dse1, it try to replace load insn with > equivalent value, but the inserted new insns(insn 45, insn 44, insn 46) will > set xmm31, but dse is not aware of that, and xmm31 is alive and will be used > by insn 10 which is exactly after new added insns, and it breaks data flow. > I opened PR 104704. Referenced Bugs: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104704 [Bug 104704] [12 Regression] ix86_gen_scratch_sse_rtx doesn't work with explicit XMM7/XMM15/XMM31 usage
[Bug target/104610] memcmp () == 0 can be optimized better for avx512f
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104610 --- Comment #11 from H.J. Lu --- Don't worry about vzeroupper. It's ok to have vzeroupper.
[Bug target/104610] memcmp () == 0 can be optimized better for avx512f
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104610 --- Comment #10 from Hongtao.liu --- (In reply to H.J. Lu from comment #9) > ix86_gen_scratch_sse_rtx was added to prevent combine from changing > store of vector registers with constant value to store of constant > value. You can change ix86_gen_scratch_sse_rtx to return a pseudo > register and watch the regressions in GCC testsuite. If we can fix > these regressions, ix86_gen_scratch_sse_rtx isn't needed. it regresses, i'm thinking of add a peephole2 to split mov mem to mov + movd + shufd which can prevent regression for pr100865, and for vzeroupper, i don't have a good way to avoid those regressions. gcc.target/i386/pr100865-11b.c scan-assembler-times vmovdqa64[\\t ]%xmm[0-9]+, 16 gcc.target/i386/pr100865-12b.c scan-assembler-times vmovdqa64[\\t ]%xmm[0-9]+, 16 gcc.target/i386/pr100865-8a.c scan-assembler-times (?:vpbroadcastd|vpshufd)[\\t ]+[^\n]*, %xmm[0-9]+ 1 gcc.target/i386/pr100865-8b.c scan-assembler-times vmovdqa64[\\t ]%xmm[0-9]+, 16 gcc.target/i386/pr100865-8c.c scan-assembler-times vpshufd[\\t ]+[^\n]*, %xmm[0-9]+ 1 gcc.target/i386/pr100865-9b.c scan-assembler-times vmovdqa64[\\t ]%xmm[0-9]+, 16 gcc.target/i386/pr100865-9c.c scan-assembler-times vpshufd[\\t ]+[^\n]*, %xmm[0-9]+ 1 gcc.target/i386/pr82941-1.c scan-assembler-not vzeroupper gcc.target/i386/pr82942-1.c scan-assembler-not vzeroupper gcc.target/i386/pr82990-1.c scan-assembler-not vzeroupper gcc.target/i386/pr82990-3.c scan-assembler-not vzeroupper gcc.target/i386/pr82990-5.c scan-assembler-not vzeroupper
[Bug target/104610] memcmp () == 0 can be optimized better for avx512f
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104610 H.J. Lu changed: What|Removed |Added Last reconfirmed||2022-02-23 Status|UNCONFIRMED |NEW Ever confirmed|0 |1 --- Comment #9 from H.J. Lu --- ix86_gen_scratch_sse_rtx was added to prevent combine from changing store of vector registers with constant value to store of constant value. You can change ix86_gen_scratch_sse_rtx to return a pseudo register and watch the regressions in GCC testsuite. If we can fix these regressions, ix86_gen_scratch_sse_rtx isn't needed.
[Bug target/104610] memcmp () == 0 can be optimized better for avx512f
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104610 --- Comment #8 from Hongtao.liu --- Created attachment 52495 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=52495=edit untested patch. With the patch, it exposes one potential issue related to dse(or ix86_gen_scratch_sse_rtx usage). in dse1, it try to replace load insn with equivalent value, but the inserted new insns(insn 45, insn 44, insn 46) will set xmm31, but dse is not aware of that, and xmm31 is alive and will be used by insn 10 which is exactly after new added insns, and it breaks data flow. and i think for i386 part, maybe we shouldn't use ix86_gen_scratch_sse_rtx in ix86_expand_vector_move which is called by emit_move_insn and used in many pre_reload passes, it may break data flow if there're other explicit hard register used. dump before vs after dse +(insn 45 8 44 2 (set (reg:DI 91) +(const_int 4855531112742205610 [0x43624fd242db38aa])) "gcc/testsuite/gcc.target/i386/avx512f-typecast-1.c":48:8 80 {*movdi_internal} + (nil)) +(insn 44 45 46 2 (set (reg:V4DI 67 xmm31) +(vec_duplicate:V4DI (reg:DI 91))) "gcc/testsuite/gcc.target/i386/avx512f-typecast-1.c":48:8 7768 {*avx512vl_vec_dup_gprv4di} + (expr_list:REG_DEAD (reg:DI 91) +(nil))) +(insn 46 44 10 2 (set (reg:OI 90) +(reg:OI 67 xmm31)) "gcc/testsuite/gcc.target/i386/avx512f-typecast-1.c":48:8 78 {*movoi_internal_avx} + (expr_list:REG_EQUAL (const_wide_int 0x43624fd242db38aa43624fd242db38aa43624fd242db38aa43624fd242db38aa) (nil))) (insn 10 46 13 2 (set (mem/j/c:V16SF (plus:DI (reg/f:DI 19 frame) (const_int -128 [0xff80])) [4 bd.x+0 S64 A512]) (reg:V16SF 67 xmm31)) "gcc/testsuite/gcc.target/i386/avx512f-typecast-1.c":48:8 1707 {movv16sf_internal} (expr_list:REG_DEAD (reg:V16SF 67 xmm31) (nil))) (insn 13 10 14 2 (set (reg:OI 86 [ MEM [(void *)] ]) -(mem/c:OI (plus:DI (reg/f:DI 19 frame) -(const_int -128 [0xff80])) [0 MEM [(void *)]+0 S32 A512])) "gcc/testsuite/gcc.target/i386/avx512f-typecast-1.c":49:7 78 {*movoi_internal_avx} - (nil)) +(reg:OI 90)) "gcc/testsuite/gcc.target/i386/avx512f-typecast-1.c":49:7 78 {*movoi_internal_avx} + (expr_list:REG_DEAD (reg:OI 90) +(nil)))
[Bug target/104610] memcmp () == 0 can be optimized better for avx512f
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104610 --- Comment #7 from Hongtao.liu --- (In reply to Hongtao.liu from comment #6) > (In reply to Hongtao.liu from comment #5) > > (In reply to Hongtao.liu from comment #4) > > > (In reply to Hongtao.liu from comment #3) > > > > (In reply to Hongtao.liu from comment #2) > > > > > in Gimple, there're > > > > > > > > > > _1 = __builtin_memcmp_eq (a_5(D), [0], 32); > > > > > _2 = _1 == 0; > > > > > _6 = (int) _2; > > > > > > > > > > > > > > > So it's related to codegen optimization with vectorized codes for > > > > > __builtin_memcmp_eq, guess we can start with size multiple of 16 > > > > > bytes? > > > > > > > > > There's no optab or target_hook for backend to participate in > > > > optimization > > But there's cbranch_optab check in can_compare_p, and i386 supports > > V8SI/V4DI/V4SI/V2DI, but not for OI/TI, adding support for them? > > > > 25899(define_expand "cbranch4" > > 25900 [(set (reg:CC FLAGS_REG) > > 25901(compare:CC (match_operand:VI48_AVX 1 "register_operand") > > 25902(match_operand:VI48_AVX 2 "nonimmediate_operand"))) > > 25903 (set (pc) (if_then_else > > 25904 (match_operator 0 "bt_comparison_operator" > > 25905[(reg:CC FLAGS_REG) (const_int 0)]) > > 25906 (label_ref (match_operand 3)) > > After supporting cbranchoi4, gcc generates > > _Z1fPc: > .LFB0: > .cfi_startproc > vmovdqa .LC1(%rip), %ymm0 > vpxor (%rdi), %ymm0, %ymm0 > vptest %ymm0, %ymm0 > sete%al > vzeroupper > > which is optimal as clang/llvm does. Also extend cbranchti to ptest when target_sse4_1 and CODE == NE || CODE == EQ so gcc generates movdqu (%rdi), %xmm0 movdqa .LC1(%rip), %xmm1 pxor%xmm1, %xmm0 ptest %xmm0, %xmm0 sete%al for bool f128(char *a) { char t[] = "012345678901234"; return __builtin_memcmp(a, [0], sizeof(t)) == 0; } the original codegen is movabsq $14692989455579448, %rax xorq8(%rdi), %rax movabsq $3978425819141910832, %rdx xorq(%rdi), %rdx orq %rdx, %rax sete%al ret
[Bug target/104610] memcmp () == 0 can be optimized better for avx512f
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104610 --- Comment #6 from Hongtao.liu --- (In reply to Hongtao.liu from comment #5) > (In reply to Hongtao.liu from comment #4) > > (In reply to Hongtao.liu from comment #3) > > > (In reply to Hongtao.liu from comment #2) > > > > in Gimple, there're > > > > > > > > _1 = __builtin_memcmp_eq (a_5(D), [0], 32); > > > > _2 = _1 == 0; > > > > _6 = (int) _2; > > > > > > > > > > > > So it's related to codegen optimization with vectorized codes for > > > > __builtin_memcmp_eq, guess we can start with size multiple of 16 bytes? > > > > > > > There's no optab or target_hook for backend to participate in optimization > But there's cbranch_optab check in can_compare_p, and i386 supports > V8SI/V4DI/V4SI/V2DI, but not for OI/TI, adding support for them? > > 25899(define_expand "cbranch4" > 25900 [(set (reg:CC FLAGS_REG) > 25901(compare:CC (match_operand:VI48_AVX 1 "register_operand") > 25902(match_operand:VI48_AVX 2 "nonimmediate_operand"))) > 25903 (set (pc) (if_then_else > 25904 (match_operator 0 "bt_comparison_operator" > 25905[(reg:CC FLAGS_REG) (const_int 0)]) > 25906 (label_ref (match_operand 3)) After supporting cbranchoi4, gcc generates _Z1fPc: .LFB0: .cfi_startproc vmovdqa .LC1(%rip), %ymm0 vpxor (%rdi), %ymm0, %ymm0 vptest %ymm0, %ymm0 sete%al vzeroupper which is optimal as clang/llvm does.
[Bug target/104610] memcmp () == 0 can be optimized better for avx512f
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104610 --- Comment #5 from Hongtao.liu --- (In reply to Hongtao.liu from comment #4) > (In reply to Hongtao.liu from comment #3) > > (In reply to Hongtao.liu from comment #2) > > > in Gimple, there're > > > > > > _1 = __builtin_memcmp_eq (a_5(D), [0], 32); > > > _2 = _1 == 0; > > > _6 = (int) _2; > > > > > > > > > So it's related to codegen optimization with vectorized codes for > > > __builtin_memcmp_eq, guess we can start with size multiple of 16 bytes? > > > > > There's no optab or target_hook for backend to participate in optimization But there's cbranch_optab check in can_compare_p, and i386 supports V8SI/V4DI/V4SI/V2DI, but not for OI/TI, adding support for them? 25899(define_expand "cbranch4" 25900 [(set (reg:CC FLAGS_REG) 25901(compare:CC (match_operand:VI48_AVX 1 "register_operand") 25902(match_operand:VI48_AVX 2 "nonimmediate_operand"))) 25903 (set (pc) (if_then_else 25904 (match_operator 0 "bt_comparison_operator" 25905[(reg:CC FLAGS_REG) (const_int 0)]) 25906 (label_ref (match_operand 3))
[Bug target/104610] memcmp () == 0 can be optimized better for avx512f
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104610 --- Comment #4 from Hongtao.liu --- (In reply to Hongtao.liu from comment #3) > (In reply to Hongtao.liu from comment #2) > > in Gimple, there're > > > > _1 = __builtin_memcmp_eq (a_5(D), [0], 32); > > _2 = _1 == 0; > > _6 = (int) _2; > > > > > > So it's related to codegen optimization with vectorized codes for > > __builtin_memcmp_eq, guess we can start with size multiple of 16 bytes? > > > There's no optab or target_hook for backend to participate in optimization > of Participation in optimization. typo last optimization should be compare_by_pieces.
[Bug target/104610] memcmp () == 0 can be optimized better for avx512f
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104610 --- Comment #3 from Hongtao.liu --- (In reply to Hongtao.liu from comment #2) > in Gimple, there're > > _1 = __builtin_memcmp_eq (a_5(D), [0], 32); > _2 = _1 == 0; > _6 = (int) _2; > > > So it's related to codegen optimization with vectorized codes for > __builtin_memcmp_eq, guess we can start with size multiple of 16 bytes? > There's no optab or target_hook for backend to participate in optimization of Participation in optimization.
[Bug target/104610] memcmp () == 0 can be optimized better for avx512f
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104610 --- Comment #2 from Hongtao.liu --- in Gimple, there're _1 = __builtin_memcmp_eq (a_5(D), [0], 32); _2 = _1 == 0; _6 = (int) _2; So it's related to codegen optimization with vectorized codes for __builtin_memcmp_eq, guess we can start with size multiple of 16 bytes? also i saw when size is 9, llvm generates f(char*): # @f(char*) movabs rcx, 3979270244072042800 xor rcx, qword ptr [rdi] movzx edx, byte ptr [rdi + 8] xor eax, eax or rdx, rcx setne al ret while gcc f(char*): movabsq $3979270244072042800, %rax cmpq%rax, (%rdi) je .L5 .L2: movl$1, %eax ret .L5: cmpb$0, 8(%rdi) jne .L2 xorl%eax, %eax ret
[Bug target/104610] memcmp () == 0 can be optimized better for avx512f
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104610 --- Comment #1 from Andrew Pinski --- Note even without avx512f, LLVM does: movdqu (%rdi), %xmm0 movdqu 16(%rdi), %xmm1 pcmpeqb .LCPI0_0(%rip), %xmm1 pcmpeqb .LCPI0_1(%rip), %xmm0 pand%xmm1, %xmm0 pmovmskb%xmm0, %eax cmpl$65535, %eax# imm = 0x sete%al