[Bug target/104610] memcmp () == 0 can be optimized better for avx512f

2023-10-29 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104610

--- Comment #23 from CVS Commits  ---
The master branch has been updated by hongtao Liu :

https://gcc.gnu.org/g:8c40b72036c967fbb1d1150515cf70aec382f0a2

commit r14-5002-g8c40b72036c967fbb1d1150515cf70aec382f0a2
Author: liuhongt 
Date:   Mon Oct 9 15:07:54 2023 +0800

Improve memcmpeq for 512-bit vector with vpcmpeq + kortest.

When 2 vectors are equal, kmask is allones and kortest will set CF,
else CF will be cleared.

So CF bit can be used to check for the result of the comparison.

Before:
vmovdqu (%rsi), %ymm0
vpxorq  (%rdi), %ymm0, %ymm0
vptest  %ymm0, %ymm0
jne .L2
vmovdqu 32(%rsi), %ymm0
vpxorq  32(%rdi), %ymm0, %ymm0
vptest  %ymm0, %ymm0
je  .L5
.L2:
movl$1, %eax
xorl$1, %eax
vzeroupper
ret

After:
vmovdqu64   (%rsi), %zmm0
xorl%eax, %eax
vpcmpeqd(%rdi), %zmm0, %k0
kortestw%k0, %k0
setc%al
vzeroupper
ret

gcc/ChangeLog:

PR target/104610
* config/i386/i386-expand.cc (ix86_expand_branch): Handle
512-bit vector with vpcmpeq + kortest.
* config/i386/i386.md (cbranchxi4): New expander.
* config/i386/sse.md: (cbranch4): Extend to V16SImode
and V8DImode.

gcc/testsuite/ChangeLog:

* gcc.target/i386/pr104610-2.c: New test.

[Bug target/104610] memcmp () == 0 can be optimized better for avx512f

2023-10-10 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104610

--- Comment #22 from Hongtao.liu  ---
For 64-byte memory comparison

int compare (const char* s1, const char* s2)
{
  return __builtin_memcmp (s1, s2, 64) == 0;
}

We're generating

vmovdqu (%rsi), %ymm0
vpxorq  (%rdi), %ymm0, %ymm0
vptest  %ymm0, %ymm0
jne .L2
vmovdqu 32(%rsi), %ymm0
vpxorq  32(%rdi), %ymm0, %ymm0
vptest  %ymm0, %ymm0
je  .L5
.L2:
movl$1, %eax
xorl$1, %eax
vzeroupper
ret

An alternative way is using vpcmpeq + kortest and check Carry bit

vmovdqu64   (%rsi), %zmm0
xorl%eax, %eax
vpcmpeqd(%rdi), %zmm0, %k0
kortestw%k0, %k0
setc%al
vzeroupper

Not sure if it's better or not.

[Bug target/104610] memcmp () == 0 can be optimized better for avx512f

2023-06-28 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104610

--- Comment #21 from Uroš Bizjak  ---
Just before the patch from Comment #20, the compiler creates (-O2 -mavx):

--cut here--
vmovdqa .LC1(%rip), %xmm0
vmovdqa %xmm0, -24(%rsp)
vmovdqu (%rdi), %xmm0
vpxor   .LC0(%rip), %xmm0, %xmm0
vptest  %xmm0, %xmm0
je  .L5
.L2:
movl$1, %eax
testl   %eax, %eax
sete%al
ret
.L5:
vmovdqu 16(%rdi), %xmm0
vpxor   -24(%rsp), %xmm0, %xmm0
vptest  %xmm0, %xmm0
jne .L2
xorl%eax, %eax
testl   %eax, %eax
sete%al
ret
--cut here--

Please note the creative way of returning 0 and 1 ... :

movl$1, %eax
testl   %eax, %eax
sete%al
ret

Even the new code (From comment #20) is unnecessarily convoluted:

.L2:movl$1, %eax
xorl$1, %eax
ret
.L5:xorl%eax, %eax
xorl$1, %eax
ret

[Bug target/104610] memcmp () == 0 can be optimized better for avx512f

2023-06-28 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104610

--- Comment #20 from CVS Commits  ---
The master branch has been updated by Roger Sayle :

https://gcc.gnu.org/g:4afbebcdc5780d28e52b7d65643e462c7c3882ce

commit r14-2159-g4afbebcdc5780d28e52b7d65643e462c7c3882ce
Author: Roger Sayle 
Date:   Wed Jun 28 11:11:34 2023 +0100

i386: Add cbranchti4 pattern to i386.md (for -m32 compare_by_pieces).

This patch fixes some very odd (unanticipated) code generation by
compare_by_pieces with -m32 -mavx, since the recent addition of the
cbranchoi4 pattern.  The issue is that cbranchoi4 is available with
TARGET_AVX, but cbranchti4 is currently conditional on TARGET_64BIT
which results in the odd behaviour (thanks to OPTAB_WIDEN) that with
-m32 -mavx, compare_by_pieces ends up (inefficiently) widening 128-bit
comparisons to 256-bits before performing PTEST.

This patch fixes this by providing a cbranchti4 pattern that's available
with either TARGET_64BIT or TARGET_SSE4_1.

For the test case below (again from PR 104610):

int foo(char *a)
{
static const char t[] = "0123456789012345678901234567890";
return __builtin_memcmp(a, [0], sizeof(t)) == 0;
}

GCC with -m32 -O2 -mavx currently produces the bonkers:

foo:pushl   %ebp
movl%esp, %ebp
andl$-32, %esp
subl$64, %esp
movl8(%ebp), %eax
vmovdqa .LC0, %xmm4
movl$0, 48(%esp)
vmovdqu (%eax), %xmm2
movl$0, 52(%esp)
movl$0, 56(%esp)
movl$0, 60(%esp)
movl$0, 16(%esp)
movl$0, 20(%esp)
movl$0, 24(%esp)
movl$0, 28(%esp)
vmovdqa %xmm2, 32(%esp)
vmovdqa %xmm4, (%esp)
vmovdqa (%esp), %ymm5
vpxor   32(%esp), %ymm5, %ymm0
vptest  %ymm0, %ymm0
jne .L2
vmovdqu 16(%eax), %xmm7
movl$0, 48(%esp)
movl$0, 52(%esp)
vmovdqa %xmm7, 32(%esp)
vmovdqa .LC1, %xmm7
movl$0, 56(%esp)
movl$0, 60(%esp)
movl$0, 16(%esp)
movl$0, 20(%esp)
movl$0, 24(%esp)
movl$0, 28(%esp)
vmovdqa %xmm7, (%esp)
vmovdqa (%esp), %ymm1
vpxor   32(%esp), %ymm1, %ymm0
vptest  %ymm0, %ymm0
je  .L6
.L2:movl$1, %eax
xorl$1, %eax
vzeroupper
leave
ret
.L6:xorl%eax, %eax
xorl$1, %eax
vzeroupper
leave
ret

with this patch, we now generate the (slightly) more sensible:

foo:vmovdqa .LC0, %xmm0
movl4(%esp), %eax
vpxor   (%eax), %xmm0, %xmm0
vptest  %xmm0, %xmm0
jne .L2
vmovdqa .LC1, %xmm0
vpxor   16(%eax), %xmm0, %xmm0
vptest  %xmm0, %xmm0
je  .L5
.L2:movl$1, %eax
xorl$1, %eax
ret
.L5:xorl%eax, %eax
xorl$1, %eax
ret

2023-06-28  Roger Sayle  

gcc/ChangeLog
* config/i386/i386-expand.cc (ix86_expand_branch): Also use ptest
for TImode comparisons on 32-bit architectures.
* config/i386/i386.md (cbranch4): Change from SDWIM to
SWIM1248x to exclude/avoid TImode being conditional on -m64.
(cbranchti4): New define_expand for TImode on both TARGET_64BIT
and/or with TARGET_SSE4_1.
* config/i386/predicates.md (ix86_timode_comparison_operator):
New predicate that depends upon TARGET_64BIT.
(ix86_timode_comparison_operand): Likewise.

gcc/testsuite/ChangeLog
* gcc.target/i386/pieces-memcmp-2.c: New test case.

[Bug target/104610] memcmp () == 0 can be optimized better for avx512f

2022-06-16 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104610

--- Comment #19 from Hongtao.liu  ---
I'm wondering would targetm.overlap_op_by_pieces_p helps here.

[Bug target/104610] memcmp () == 0 can be optimized better for avx512f

2022-05-17 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104610

--- Comment #18 from Hongtao.liu  ---
Fixed in GCC13.

[Bug target/104610] memcmp () == 0 can be optimized better for avx512f

2022-05-17 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104610

--- Comment #17 from CVS Commits  ---
The master branch has been updated by hongtao Liu :

https://gcc.gnu.org/g:850a13d754497faae91afabc6958780f1d63a574

commit r13-580-g850a13d754497faae91afabc6958780f1d63a574
Author: liuhongt 
Date:   Tue Mar 1 13:41:52 2022 +0800

Expand __builtin_memcmp_eq with ptest for OImode.

gcc/ChangeLog:

PR target/104610
* config/i386/i386-expand.cc (ix86_expand_branch): Use ptest
for QImode when code is EQ or NE.
* config/i386/i386.md (cbranchoi4): New expander.

gcc/testsuite/ChangeLog:

* gcc.target/i386/pr104610.c: New test.

[Bug target/104610] memcmp () == 0 can be optimized better for avx512f

2022-03-27 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104610

Hongtao.liu  changed:

   What|Removed |Added

  Attachment #52495|0   |1
is obsolete||

--- Comment #16 from Hongtao.liu  ---
Created attachment 52692
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=52692=edit
Patch pending for GCC13

[Bug target/104610] memcmp () == 0 can be optimized better for avx512f

2022-03-27 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104610

--- Comment #15 from Hongtao.liu  ---
Could someone help to mark this blocks PR105073, the patch is ready and waiting
for GCC13.

[Bug target/104610] memcmp () == 0 can be optimized better for avx512f

2022-03-03 Thread hjl.tools at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104610
Bug 104610 depends on bug 104704, which changed state.

Bug 104704 Summary: [12 Regression] ix86_gen_scratch_sse_rtx doesn't work with 
explicit XMM7/XMM15/XMM31 usage
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104704

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

[Bug target/104610] memcmp () == 0 can be optimized better for avx512f

2022-02-27 Thread hjl.tools at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104610

--- Comment #14 from H.J. Lu  ---
(In reply to H.J. Lu from comment #13)
> (In reply to Hongtao.liu from comment #8)
> > Created attachment 52495 [details]
> > untested patch.
> 
> I see these regressions with -m32:
> 
> FAIL: gcc.dg/lower-subreg-1.c scan-rtl-dump subreg1 "Splitting reg"
> FAIL: gcc.target/i386/iamcu/test_basic_64bit_returning.c execution,  -O0 
> FAIL: gcc.target/i386/iamcu/test_basic_64bit_returning.c execution,  -O1 
> FAIL: gcc.target/i386/iamcu/test_basic_64bit_returning.c execution,  -O2 
> FAIL: gcc.target/i386/iamcu/test_basic_64bit_returning.c execution,  -O3 -g 
> FAIL: gcc.target/i386/iamcu/test_basic_64bit_returning.c execution,  -Os 
> FAIL: gcc.target/i386/iamcu/test_struct_returning.c execution,  -O0 
> FAIL: gcc.target/i386/iamcu/test_struct_returning.c execution,  -O1 
> FAIL: gcc.target/i386/iamcu/test_struct_returning.c execution,  -Og -g

-m64 regression:

FAIL: gcc.target/i386/pr82580.c scan-assembler-not \\mmovzb

[Bug target/104610] memcmp () == 0 can be optimized better for avx512f

2022-02-27 Thread hjl.tools at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104610

--- Comment #13 from H.J. Lu  ---
(In reply to Hongtao.liu from comment #8)
> Created attachment 52495 [details]
> untested patch.

I see these regressions with -m32:

FAIL: gcc.dg/lower-subreg-1.c scan-rtl-dump subreg1 "Splitting reg"
FAIL: gcc.target/i386/iamcu/test_basic_64bit_returning.c execution,  -O0 
FAIL: gcc.target/i386/iamcu/test_basic_64bit_returning.c execution,  -O1 
FAIL: gcc.target/i386/iamcu/test_basic_64bit_returning.c execution,  -O2 
FAIL: gcc.target/i386/iamcu/test_basic_64bit_returning.c execution,  -O3 -g 
FAIL: gcc.target/i386/iamcu/test_basic_64bit_returning.c execution,  -Os 
FAIL: gcc.target/i386/iamcu/test_struct_returning.c execution,  -O0 
FAIL: gcc.target/i386/iamcu/test_struct_returning.c execution,  -O1 
FAIL: gcc.target/i386/iamcu/test_struct_returning.c execution,  -Og -g

[Bug target/104610] memcmp () == 0 can be optimized better for avx512f

2022-02-26 Thread hjl.tools at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104610

H.J. Lu  changed:

   What|Removed |Added

 Depends on||104704

--- Comment #12 from H.J. Lu  ---
(In reply to Hongtao.liu from comment #8)
> Created attachment 52495 [details]
> untested patch.
> 
> With the patch, it exposes one potential issue related to dse(or
> ix86_gen_scratch_sse_rtx usage). in dse1, it try to replace load insn with
> equivalent value, but the inserted new insns(insn 45, insn 44, insn 46) will
> set xmm31, but dse is not aware of that, and xmm31 is alive and will be used
> by insn 10 which is exactly after new added insns, and it breaks data flow.
> 

I opened PR 104704.


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104704
[Bug 104704] [12 Regression] ix86_gen_scratch_sse_rtx doesn't work with
explicit XMM7/XMM15/XMM31 usage

[Bug target/104610] memcmp () == 0 can be optimized better for avx512f

2022-02-23 Thread hjl.tools at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104610

--- Comment #11 from H.J. Lu  ---
Don't worry about vzeroupper.
It's ok to have vzeroupper.

[Bug target/104610] memcmp () == 0 can be optimized better for avx512f

2022-02-23 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104610

--- Comment #10 from Hongtao.liu  ---
(In reply to H.J. Lu from comment #9)
> ix86_gen_scratch_sse_rtx was added to prevent combine from changing
> store of vector registers with constant value to store of constant
> value.  You can change ix86_gen_scratch_sse_rtx to return a pseudo
> register and watch the regressions in GCC testsuite.  If we can fix
> these regressions, ix86_gen_scratch_sse_rtx isn't needed.

it regresses, i'm thinking of add a peephole2 to split mov mem to mov + movd +
shufd which can prevent regression for pr100865, and for vzeroupper, i don't
have a good way to avoid those regressions.

gcc.target/i386/pr100865-11b.c scan-assembler-times vmovdqa64[\\t ]%xmm[0-9]+, 
16
gcc.target/i386/pr100865-12b.c scan-assembler-times vmovdqa64[\\t ]%xmm[0-9]+, 
16
gcc.target/i386/pr100865-8a.c scan-assembler-times (?:vpbroadcastd|vpshufd)[\\t
]+[^\n]*, %xmm[0-9]+ 1
gcc.target/i386/pr100865-8b.c scan-assembler-times vmovdqa64[\\t ]%xmm[0-9]+, 
16
gcc.target/i386/pr100865-8c.c scan-assembler-times vpshufd[\\t ]+[^\n]*,
%xmm[0-9]+ 1
gcc.target/i386/pr100865-9b.c scan-assembler-times vmovdqa64[\\t ]%xmm[0-9]+, 
16
gcc.target/i386/pr100865-9c.c scan-assembler-times vpshufd[\\t ]+[^\n]*,
%xmm[0-9]+ 1
gcc.target/i386/pr82941-1.c scan-assembler-not vzeroupper
gcc.target/i386/pr82942-1.c scan-assembler-not vzeroupper
gcc.target/i386/pr82990-1.c scan-assembler-not vzeroupper
gcc.target/i386/pr82990-3.c scan-assembler-not vzeroupper
gcc.target/i386/pr82990-5.c scan-assembler-not vzeroupper

[Bug target/104610] memcmp () == 0 can be optimized better for avx512f

2022-02-23 Thread hjl.tools at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104610

H.J. Lu  changed:

   What|Removed |Added

   Last reconfirmed||2022-02-23
 Status|UNCONFIRMED |NEW
 Ever confirmed|0   |1

--- Comment #9 from H.J. Lu  ---
ix86_gen_scratch_sse_rtx was added to prevent combine from changing
store of vector registers with constant value to store of constant
value.  You can change ix86_gen_scratch_sse_rtx to return a pseudo
register and watch the regressions in GCC testsuite.  If we can fix
these regressions, ix86_gen_scratch_sse_rtx isn't needed.

[Bug target/104610] memcmp () == 0 can be optimized better for avx512f

2022-02-22 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104610

--- Comment #8 from Hongtao.liu  ---
Created attachment 52495
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=52495=edit
untested patch.

With the patch, it exposes one potential issue related to dse(or
ix86_gen_scratch_sse_rtx usage). in dse1, it try to replace load insn with
equivalent value, but the inserted new insns(insn 45, insn 44, insn 46) will
set xmm31, but dse is not aware of that, and xmm31 is alive and will be used by
insn 10 which is exactly after new added insns, and it breaks data flow.

and i think for i386 part, maybe we shouldn't use ix86_gen_scratch_sse_rtx in
ix86_expand_vector_move which is called by emit_move_insn and used in many
pre_reload passes, it may break data flow if there're other explicit hard
register used. 

dump before vs after dse

+(insn 45 8 44 2 (set (reg:DI 91)
+(const_int 4855531112742205610 [0x43624fd242db38aa]))
"gcc/testsuite/gcc.target/i386/avx512f-typecast-1.c":48:8 80 {*movdi_internal}
+ (nil))
+(insn 44 45 46 2 (set (reg:V4DI 67 xmm31)
+(vec_duplicate:V4DI (reg:DI 91)))
"gcc/testsuite/gcc.target/i386/avx512f-typecast-1.c":48:8 7768
{*avx512vl_vec_dup_gprv4di}
+ (expr_list:REG_DEAD (reg:DI 91)
+(nil)))
+(insn 46 44 10 2 (set (reg:OI 90)
+(reg:OI 67 xmm31))
"gcc/testsuite/gcc.target/i386/avx512f-typecast-1.c":48:8 78
{*movoi_internal_avx}
+ (expr_list:REG_EQUAL (const_wide_int
0x43624fd242db38aa43624fd242db38aa43624fd242db38aa43624fd242db38aa)
 (nil)))
(insn 10 46 13 2 (set (mem/j/c:V16SF (plus:DI (reg/f:DI 19 frame)
 (const_int -128 [0xff80])) [4 bd.x+0 S64 A512])
 (reg:V16SF 67 xmm31))
"gcc/testsuite/gcc.target/i386/avx512f-typecast-1.c":48:8 1707
{movv16sf_internal}
  (expr_list:REG_DEAD (reg:V16SF 67 xmm31)
 (nil)))
 (insn 13 10 14 2 (set (reg:OI 86 [ MEM  [(void *)] ])
-(mem/c:OI (plus:DI (reg/f:DI 19 frame)
-(const_int -128 [0xff80])) [0 MEM 
[(void *)]+0 S32 A512]))
"gcc/testsuite/gcc.target/i386/avx512f-typecast-1.c":49:7 78
{*movoi_internal_avx}
- (nil))
+(reg:OI 90)) "gcc/testsuite/gcc.target/i386/avx512f-typecast-1.c":49:7
78 {*movoi_internal_avx}
+ (expr_list:REG_DEAD (reg:OI 90)
+(nil)))

[Bug target/104610] memcmp () == 0 can be optimized better for avx512f

2022-02-22 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104610

--- Comment #7 from Hongtao.liu  ---
(In reply to Hongtao.liu from comment #6)
> (In reply to Hongtao.liu from comment #5)
> > (In reply to Hongtao.liu from comment #4)
> > > (In reply to Hongtao.liu from comment #3)
> > > > (In reply to Hongtao.liu from comment #2)
> > > > > in Gimple, there're
> > > > > 
> > > > >   _1 = __builtin_memcmp_eq (a_5(D), [0], 32);
> > > > >   _2 = _1 == 0;
> > > > >   _6 = (int) _2;
> > > > > 
> > > > > 
> > > > > So it's related to codegen optimization with vectorized codes for
> > > > > __builtin_memcmp_eq, guess we can start with size multiple of 16 
> > > > > bytes?
> > > > > 
> > > > There's no optab or target_hook for backend to participate in 
> > > > optimization
> > But there's cbranch_optab check in can_compare_p, and i386 supports
> > V8SI/V4DI/V4SI/V2DI, but not for OI/TI, adding support for them?
> > 
> > 25899(define_expand "cbranch4"
> > 25900  [(set (reg:CC FLAGS_REG)
> > 25901(compare:CC (match_operand:VI48_AVX 1 "register_operand")
> > 25902(match_operand:VI48_AVX 2 "nonimmediate_operand")))
> > 25903   (set (pc) (if_then_else
> > 25904   (match_operator 0 "bt_comparison_operator"
> > 25905[(reg:CC FLAGS_REG) (const_int 0)])
> > 25906   (label_ref (match_operand 3))
> 
> After supporting cbranchoi4, gcc generates
> 
> _Z1fPc:
> .LFB0:
> .cfi_startproc
> vmovdqa .LC1(%rip), %ymm0
> vpxor   (%rdi), %ymm0, %ymm0
> vptest  %ymm0, %ymm0
> sete%al
> vzeroupper
> 
> which is optimal as clang/llvm does.

Also extend cbranchti to ptest when target_sse4_1 and CODE == NE || CODE == EQ
so gcc generates 

movdqu  (%rdi), %xmm0
movdqa  .LC1(%rip), %xmm1
pxor%xmm1, %xmm0
ptest   %xmm0, %xmm0
sete%al

for 

bool f128(char *a)
{
  char t[] = "012345678901234";
  return __builtin_memcmp(a, [0], sizeof(t)) == 0;
}

the original codegen is

movabsq $14692989455579448, %rax
xorq8(%rdi), %rax
movabsq $3978425819141910832, %rdx
xorq(%rdi), %rdx
orq %rdx, %rax
sete%al
ret

[Bug target/104610] memcmp () == 0 can be optimized better for avx512f

2022-02-21 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104610

--- Comment #6 from Hongtao.liu  ---
(In reply to Hongtao.liu from comment #5)
> (In reply to Hongtao.liu from comment #4)
> > (In reply to Hongtao.liu from comment #3)
> > > (In reply to Hongtao.liu from comment #2)
> > > > in Gimple, there're
> > > > 
> > > >   _1 = __builtin_memcmp_eq (a_5(D), [0], 32);
> > > >   _2 = _1 == 0;
> > > >   _6 = (int) _2;
> > > > 
> > > > 
> > > > So it's related to codegen optimization with vectorized codes for
> > > > __builtin_memcmp_eq, guess we can start with size multiple of 16 bytes?
> > > > 
> > > There's no optab or target_hook for backend to participate in optimization
> But there's cbranch_optab check in can_compare_p, and i386 supports
> V8SI/V4DI/V4SI/V2DI, but not for OI/TI, adding support for them?
> 
> 25899(define_expand "cbranch4"
> 25900  [(set (reg:CC FLAGS_REG)
> 25901(compare:CC (match_operand:VI48_AVX 1 "register_operand")
> 25902(match_operand:VI48_AVX 2 "nonimmediate_operand")))
> 25903   (set (pc) (if_then_else
> 25904   (match_operator 0 "bt_comparison_operator"
> 25905[(reg:CC FLAGS_REG) (const_int 0)])
> 25906   (label_ref (match_operand 3))

After supporting cbranchoi4, gcc generates

_Z1fPc:
.LFB0:
.cfi_startproc
vmovdqa .LC1(%rip), %ymm0
vpxor   (%rdi), %ymm0, %ymm0
vptest  %ymm0, %ymm0
sete%al
vzeroupper

which is optimal as clang/llvm does.

[Bug target/104610] memcmp () == 0 can be optimized better for avx512f

2022-02-21 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104610

--- Comment #5 from Hongtao.liu  ---
(In reply to Hongtao.liu from comment #4)
> (In reply to Hongtao.liu from comment #3)
> > (In reply to Hongtao.liu from comment #2)
> > > in Gimple, there're
> > > 
> > >   _1 = __builtin_memcmp_eq (a_5(D), [0], 32);
> > >   _2 = _1 == 0;
> > >   _6 = (int) _2;
> > > 
> > > 
> > > So it's related to codegen optimization with vectorized codes for
> > > __builtin_memcmp_eq, guess we can start with size multiple of 16 bytes?
> > > 
> > There's no optab or target_hook for backend to participate in optimization
But there's cbranch_optab check in can_compare_p, and i386 supports
V8SI/V4DI/V4SI/V2DI, but not for OI/TI, adding support for them?

25899(define_expand "cbranch4"
25900  [(set (reg:CC FLAGS_REG)
25901(compare:CC (match_operand:VI48_AVX 1 "register_operand")
25902(match_operand:VI48_AVX 2 "nonimmediate_operand")))
25903   (set (pc) (if_then_else
25904   (match_operator 0 "bt_comparison_operator"
25905[(reg:CC FLAGS_REG) (const_int 0)])
25906   (label_ref (match_operand 3))

[Bug target/104610] memcmp () == 0 can be optimized better for avx512f

2022-02-21 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104610

--- Comment #4 from Hongtao.liu  ---
(In reply to Hongtao.liu from comment #3)
> (In reply to Hongtao.liu from comment #2)
> > in Gimple, there're
> > 
> >   _1 = __builtin_memcmp_eq (a_5(D), [0], 32);
> >   _2 = _1 == 0;
> >   _6 = (int) _2;
> > 
> > 
> > So it's related to codegen optimization with vectorized codes for
> > __builtin_memcmp_eq, guess we can start with size multiple of 16 bytes?
> > 
> There's no optab or target_hook for backend to participate in optimization
> of Participation in optimization.
typo last optimization should be compare_by_pieces.

[Bug target/104610] memcmp () == 0 can be optimized better for avx512f

2022-02-21 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104610

--- Comment #3 from Hongtao.liu  ---
(In reply to Hongtao.liu from comment #2)
> in Gimple, there're
> 
>   _1 = __builtin_memcmp_eq (a_5(D), [0], 32);
>   _2 = _1 == 0;
>   _6 = (int) _2;
> 
> 
> So it's related to codegen optimization with vectorized codes for
> __builtin_memcmp_eq, guess we can start with size multiple of 16 bytes?
> 
There's no optab or target_hook for backend to participate in optimization of
Participation in optimization.

[Bug target/104610] memcmp () == 0 can be optimized better for avx512f

2022-02-21 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104610

--- Comment #2 from Hongtao.liu  ---
in Gimple, there're

  _1 = __builtin_memcmp_eq (a_5(D), [0], 32);
  _2 = _1 == 0;
  _6 = (int) _2;


So it's related to codegen optimization with vectorized codes for
__builtin_memcmp_eq, guess we can start with size multiple of 16 bytes?

also i saw when size is 9, llvm generates

f(char*): # @f(char*)
movabs  rcx, 3979270244072042800
xor rcx, qword ptr [rdi]
movzx   edx, byte ptr [rdi + 8]
xor eax, eax
or  rdx, rcx
setne   al
ret


while gcc

f(char*):
movabsq $3979270244072042800, %rax
cmpq%rax, (%rdi)
je  .L5
.L2:
movl$1, %eax
ret
.L5:
cmpb$0, 8(%rdi)
jne .L2
xorl%eax, %eax
ret

[Bug target/104610] memcmp () == 0 can be optimized better for avx512f

2022-02-21 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104610

--- Comment #1 from Andrew Pinski  ---
Note even without avx512f, LLVM does:

movdqu  (%rdi), %xmm0
movdqu  16(%rdi), %xmm1
pcmpeqb .LCPI0_0(%rip), %xmm1
pcmpeqb .LCPI0_1(%rip), %xmm0
pand%xmm1, %xmm0
pmovmskb%xmm0, %eax
cmpl$65535, %eax# imm = 0x
sete%al