[Bug target/88809] do not use rep-scasb for inline strlen/memchr

2021-04-07 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88809

Andrew Pinski  changed:

   What|Removed |Added

 CC||novemberizing at gmail dot com

--- Comment #9 from Andrew Pinski  ---
*** Bug 99953 has been marked as a duplicate of this bug. ***

[Bug target/88809] do not use rep-scasb for inline strlen/memchr

2019-05-03 Thread dominiq at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88809

--- Comment #8 from dominiq at gcc dot gnu.org ---
Author: dominiq
Date: Fri May  3 10:00:27 2019
New Revision: 270843

URL: https://gcc.gnu.org/viewcvs?rev=270843=gcc=rev
Log:
2019-05-03  Dominique d'Humieres  

PR target/88809
* gcc.target/i386/pr88809.c: Adjust for darwin.
* gcc.target/i386/pr88809-2.c: Adjust for i386 and darwin.


Modified:
trunk/gcc/testsuite/ChangeLog
trunk/gcc/testsuite/gcc.target/i386/pr88809-2.c
trunk/gcc/testsuite/gcc.target/i386/pr88809.c

[Bug target/88809] do not use rep-scasb for inline strlen/memchr

2019-05-02 Thread marxin at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88809

Martin Liška  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #7 from Martin Liška  ---
Fixed, closing as I'm not planning to backport that.

[Bug target/88809] do not use rep-scasb for inline strlen/memchr

2019-05-02 Thread marxin at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88809

--- Comment #6 from Martin Liška  ---
Author: marxin
Date: Thu May  2 07:57:38 2019
New Revision: 270787

URL: https://gcc.gnu.org/viewcvs?rev=270787=gcc=rev
Log:
Prefer to use strlen call instead of inline expansion (PR target/88809).

2019-05-02  Martin Liska  

PR target/88809
* config/i386/i386.c (ix86_expand_strlen): Use strlen call.
With -minline-all-stringops use inline expansion using 4B loop.
* doc/invoke.texi: Document the change of
-minline-all-stringops.
2019-05-02  Martin Liska  

PR target/88809
* gcc.target/i386/pr88809.c: New test.
* gcc.target/i386/pr88809-2.c: New test.

Added:
trunk/gcc/testsuite/gcc.target/i386/pr88809-2.c
trunk/gcc/testsuite/gcc.target/i386/pr88809.c
Modified:
trunk/gcc/ChangeLog
trunk/gcc/config/i386/i386.c
trunk/gcc/doc/invoke.texi
trunk/gcc/testsuite/ChangeLog

[Bug target/88809] do not use rep-scasb for inline strlen/memchr

2019-04-25 Thread marxin at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88809

Martin Liška  changed:

   What|Removed |Added

   Target Milestone|--- |10.0

[Bug target/88809] do not use rep-scasb for inline strlen/memchr

2019-04-16 Thread marxin at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88809

Martin Liška  changed:

   What|Removed |Added

 Status|NEW |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |marxin at gcc dot 
gnu.org

[Bug target/88809] do not use rep-scasb for inline strlen/memchr

2019-04-16 Thread marxin at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88809

Martin Liška  changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
   Last reconfirmed||2019-04-16
 Ever confirmed|0   |1

--- Comment #5 from Martin Liška  ---
I would like working on this problem. I've read the Peters very detail analysis
on Stack overflow and I have first couple of questions and observations I've
did:

1) I would suggest to remove usage of 'rep scasb' at all; even for -Os the
price paid is quite huge
2) I've made strlen instrumentation for -fprofile-{generate,use} and collected
SPEC2016 statistics for train runs:

Benchmark  strlen calls executed strlen calls total
executions   avg. strlen
400.perlbench102   39   2358804
10.97
401.bzip2  00 0 
403.gcc  144   21  4081
  9.3
410.bwaves 00 0 
416.gamess 00 0 
429.mcf00 0 
433.milc   30 0 
434.zeusmp 00 0 
435.gromacs   86792
12.46
436.cactusADM110   46 61788
10.61
437.leslie3d   00 0 
444.namd   00 0 
445.gobmk 417 75196
 2.01
447.dealII 30 0 
450.soplex 86   1161517
25.59
453.povray67   25 54584
33.25
454.calculix  540 0 
456.hmmer 93   1052
 15.1
458.sjeng  00 0 
459.GemsFDTD   00 0 
462.libquantum 00 0 
464.h264ref   121 1
  14274.0
465.tonto  00 0 
470.lbm00 0 
471.omnetpp   50   15  24291732
 9.79
473.astar  00 0 
481.wrf   42   15 20490
 9.41
482.sphinx3   23   11402963
 1.61
483.xalancbmk 273   160
13.04

Columns: Benchmark name, # of strlen calls in the benchmarks, # of strlen calls
that were called
during train run, total number of strlen execution, average strlen

Note: 14274.0 value for 464.h264ref is correct:

  content_76 = GetConfigFileContent (filename_53);
  _7 = strlen (content_76);

Based on the numbers an average string for which a strlen is called is quite
short (<32B).

3) The assumption that most strlen arguments have a known 16B alignment is
quite optimistic.
As mentioned, {c,}alloc returns a memory aligned to that, but strlen is most
commonly called
for a generic character pointer for which we can't prove the alignment.

4) Peter's suggested ASM expansion assumes such alignment. I expect a bit more
complex
code for a general alignment situation?

5) strlen call has the advantage then even though being compiled with -O2
-march=x86-64 (a distribution options),
the glibc can use ifunc to dispatch to an optimized implementation

6) The decision code in ix86_expand_strlen looks strange to me:

bool
ix86_expand_strlen (rtx out, rtx src, rtx eoschar, rtx align)
{
  rtx addr, scratch1, scratch2, scratch3, scratch4;

  /* The generic case of strlen expander is long.  Avoid it's
 expanding unless TARGET_INLINE_ALL_STRINGOPS.  */

  if (TARGET_UNROLL_STRLEN && eoschar == const0_rtx && optimize > 1
  && !TARGET_INLINE_ALL_STRINGOPS
  && !optimize_insn_for_size_p ()
  && (!CONST_INT_P (align) || INTVAL (align) < 4))
return false;

That explains why we generate 'rep scasb' for -O1.

My 

[Bug target/88809] do not use rep-scasb for inline strlen/memchr

2019-04-09 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88809

Peter Cordes  changed:

   What|Removed |Added

 CC||peter at cordes dot ca

--- Comment #4 from Peter Cordes  ---
Yes, rep scasb is abysmal, and gcc -O3's 4-byte-at-a-time scalar loop is not
very good either.

With 16-byte alignment, (which we have from calloc on x86-64 System V), we can
inline a *much* better SSE2 loop.  See
https://stackoverflow.com/a/55589634/224132 for more details and
microbenchmarks; 

On Skylake it's about 4 to 5x faster than the current 4-byte loop for large
strings, 3x faster for short strings.  For short strings (strlen=33), it's
about 1.5x faster than calling strlen.  For very large strings (too big for L2
cache), it's ~1.7x slower than glibc's AVX2 strlen.

The lack of VEX encoding for pxor and pmovmskb is just me being lazy; let gcc
emit them all with VEX if AVX is enabled.

   # at this point gcc has `s` in RDX, `i` in ECX

pxor   %xmm0, %xmm0 # zeroed vector to compare against
.p2align 4
.Lstrlen16: # do {
#ifdef __AVX__
vpcmpeqb   (%rdx), %xmm0, %xmm1
#else
movdqa (%rdx), %xmm1
pcmpeqb%xmm0, %xmm1   # xmm1 = -1 where there was a 0 in memory
#endif

add $16, %rdx # ptr++
pmovmskb  %xmm1, %eax # extract high bit of each byte to a
16-bit mask
test   %eax, %eax
jz.Lstrlen16# }while(mask==0);
# RDX points at the 16-byte chunk *after* the one containing the terminator
# EAX = bit-mask of the 0 bytes, and is known to be non-zero
bsf%eax, %eax   # EAX = bit-index of the lowest set bit

# terminator is at rdx+rax - 16
#  movb   $'A', -16(%rdx, %rax)  // for a microbench that used
s[strlen(s)]='A'
sub%rbp, %rdx   # p -= start
lea   -16(%rdx, %rax)   # p += byte_within_vector - 16

We should actually use  REP BSF  because that's faster on AMD (tzcnt), and same
speed on Intel.


Also an inline-asm implementation of it with a microbenchmark adapted from the
SO question.  (Compile with -DUSE_ASM -DREAD_ONLY to benchmark a fixed length
repeatedly)
https://godbolt.org/z/9tuVE5

It uses clock() for timing, which I didn't bother updating.  I made it possible
to run it for lots of iterations for consistent timing.  (And so the real work
portion dominates the runtime so we can use perf stat to measure it.)




If we only have 4-byte alignment, maybe check the first 4B, then do (p+4) & ~7
to either overlap that 4B again or not when we start 8B chunks.  But probably
it's good to get to 16-byte alignment and do whole SSE2 vectors, because
repeating an aligned 16-byte test that overlaps an 8-byte test costs the same
as doing another 8-byte test.  (Except on CPUs like Bobcat that split 128-bit
vectors into 64-bit halves).  The extra AND to round down to an alignment
boundary is all it takes, plus the code-size cost of peeling 1 iteration each
of 4B and 8B before a 16-byte loop.

We can use 4B / 8B with movd / movq instead of movdqa.  For pmovmskb, we can
ignore the compare-true results for the upper 8 bytes by testing the result
with `test %al,%al`, or in general with `test $0x0F, %al` to check only the low
4 bits of EAX for the 4-byte case.



The scalar bithack version can use BSF instead of CMOV binary search for the
byte with a set high bit.  That should be a win if we ever wanted to do scalar
on some x86 target especially with 8-byte registers, or on AArch64.  AArch64
can rbit / clz to emulate bsf and find the position of the first set bit.

(Without efficient SIMD compare result -> integer_mask, or efficient SIMD ->
integer at all on some ARM / AArch64 chips, SIMD compares for search loops
aren't always (ever?) a win.  IIRC, glibc strlen and memchr don't use vectors
on ARM / AArch64, just scalar bithacks.)

[Bug target/88809] do not use rep-scasb for inline strlen/memchr

2019-04-07 Thread glisse at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88809

--- Comment #3 from Marc Glisse  ---
(In reply to Alexander Monakov from comment #0)
> Therefore I suggest we don't use rep-scasb for inline strlen anymore by
> default (we currently do at -Os).

According to https://stackoverflow.com/q/55563598/1918193 , we also do at -O1.

[Bug target/88809] do not use rep-scasb for inline strlen/memchr

2019-01-11 Thread pinskia at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88809

--- Comment #2 from Andrew Pinski  ---
>(although to be fair, a call to strlen prevents use of redzone and clobbers 
>more registers)

And causes more register pressure ...

[Bug target/88809] do not use rep-scasb for inline strlen/memchr

2019-01-11 Thread pinskia at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88809

--- Comment #1 from Andrew Pinski  ---
(In reply to Alexander Monakov from comment #0)
> Therefore I suggest we don't use rep-scasb for inline strlen anymore by
> default (we currently do at -Os). This is in part motivated by PR 88793 and
> the Redhat bug referenced from there.

Is it smaller to call a function or inline it?  -Os is really truely optimize
for size no matter what.  I know non-embedded folks don't like that and it is
also the reason why Apple added -Oz (a similar thing to this -Os issue but on
PowerPC where the string instructions are used).