[Bug target/88809] do not use rep-scasb for inline strlen/memchr
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88809 Andrew Pinski changed: What|Removed |Added CC||novemberizing at gmail dot com --- Comment #9 from Andrew Pinski --- *** Bug 99953 has been marked as a duplicate of this bug. ***
[Bug target/88809] do not use rep-scasb for inline strlen/memchr
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88809 --- Comment #8 from dominiq at gcc dot gnu.org --- Author: dominiq Date: Fri May 3 10:00:27 2019 New Revision: 270843 URL: https://gcc.gnu.org/viewcvs?rev=270843=gcc=rev Log: 2019-05-03 Dominique d'Humieres PR target/88809 * gcc.target/i386/pr88809.c: Adjust for darwin. * gcc.target/i386/pr88809-2.c: Adjust for i386 and darwin. Modified: trunk/gcc/testsuite/ChangeLog trunk/gcc/testsuite/gcc.target/i386/pr88809-2.c trunk/gcc/testsuite/gcc.target/i386/pr88809.c
[Bug target/88809] do not use rep-scasb for inline strlen/memchr
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88809 Martin Liška changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED --- Comment #7 from Martin Liška --- Fixed, closing as I'm not planning to backport that.
[Bug target/88809] do not use rep-scasb for inline strlen/memchr
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88809 --- Comment #6 from Martin Liška --- Author: marxin Date: Thu May 2 07:57:38 2019 New Revision: 270787 URL: https://gcc.gnu.org/viewcvs?rev=270787=gcc=rev Log: Prefer to use strlen call instead of inline expansion (PR target/88809). 2019-05-02 Martin Liska PR target/88809 * config/i386/i386.c (ix86_expand_strlen): Use strlen call. With -minline-all-stringops use inline expansion using 4B loop. * doc/invoke.texi: Document the change of -minline-all-stringops. 2019-05-02 Martin Liska PR target/88809 * gcc.target/i386/pr88809.c: New test. * gcc.target/i386/pr88809-2.c: New test. Added: trunk/gcc/testsuite/gcc.target/i386/pr88809-2.c trunk/gcc/testsuite/gcc.target/i386/pr88809.c Modified: trunk/gcc/ChangeLog trunk/gcc/config/i386/i386.c trunk/gcc/doc/invoke.texi trunk/gcc/testsuite/ChangeLog
[Bug target/88809] do not use rep-scasb for inline strlen/memchr
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88809 Martin Liška changed: What|Removed |Added Target Milestone|--- |10.0
[Bug target/88809] do not use rep-scasb for inline strlen/memchr
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88809 Martin Liška changed: What|Removed |Added Status|NEW |ASSIGNED Assignee|unassigned at gcc dot gnu.org |marxin at gcc dot gnu.org
[Bug target/88809] do not use rep-scasb for inline strlen/memchr
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88809 Martin Liška changed: What|Removed |Added Status|UNCONFIRMED |NEW Last reconfirmed||2019-04-16 Ever confirmed|0 |1 --- Comment #5 from Martin Liška --- I would like working on this problem. I've read the Peters very detail analysis on Stack overflow and I have first couple of questions and observations I've did: 1) I would suggest to remove usage of 'rep scasb' at all; even for -Os the price paid is quite huge 2) I've made strlen instrumentation for -fprofile-{generate,use} and collected SPEC2016 statistics for train runs: Benchmark strlen calls executed strlen calls total executions avg. strlen 400.perlbench102 39 2358804 10.97 401.bzip2 00 0 403.gcc 144 21 4081 9.3 410.bwaves 00 0 416.gamess 00 0 429.mcf00 0 433.milc 30 0 434.zeusmp 00 0 435.gromacs 86792 12.46 436.cactusADM110 46 61788 10.61 437.leslie3d 00 0 444.namd 00 0 445.gobmk 417 75196 2.01 447.dealII 30 0 450.soplex 86 1161517 25.59 453.povray67 25 54584 33.25 454.calculix 540 0 456.hmmer 93 1052 15.1 458.sjeng 00 0 459.GemsFDTD 00 0 462.libquantum 00 0 464.h264ref 121 1 14274.0 465.tonto 00 0 470.lbm00 0 471.omnetpp 50 15 24291732 9.79 473.astar 00 0 481.wrf 42 15 20490 9.41 482.sphinx3 23 11402963 1.61 483.xalancbmk 273 160 13.04 Columns: Benchmark name, # of strlen calls in the benchmarks, # of strlen calls that were called during train run, total number of strlen execution, average strlen Note: 14274.0 value for 464.h264ref is correct: content_76 = GetConfigFileContent (filename_53); _7 = strlen (content_76); Based on the numbers an average string for which a strlen is called is quite short (<32B). 3) The assumption that most strlen arguments have a known 16B alignment is quite optimistic. As mentioned, {c,}alloc returns a memory aligned to that, but strlen is most commonly called for a generic character pointer for which we can't prove the alignment. 4) Peter's suggested ASM expansion assumes such alignment. I expect a bit more complex code for a general alignment situation? 5) strlen call has the advantage then even though being compiled with -O2 -march=x86-64 (a distribution options), the glibc can use ifunc to dispatch to an optimized implementation 6) The decision code in ix86_expand_strlen looks strange to me: bool ix86_expand_strlen (rtx out, rtx src, rtx eoschar, rtx align) { rtx addr, scratch1, scratch2, scratch3, scratch4; /* The generic case of strlen expander is long. Avoid it's expanding unless TARGET_INLINE_ALL_STRINGOPS. */ if (TARGET_UNROLL_STRLEN && eoschar == const0_rtx && optimize > 1 && !TARGET_INLINE_ALL_STRINGOPS && !optimize_insn_for_size_p () && (!CONST_INT_P (align) || INTVAL (align) < 4)) return false; That explains why we generate 'rep scasb' for -O1. My
[Bug target/88809] do not use rep-scasb for inline strlen/memchr
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88809 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #4 from Peter Cordes --- Yes, rep scasb is abysmal, and gcc -O3's 4-byte-at-a-time scalar loop is not very good either. With 16-byte alignment, (which we have from calloc on x86-64 System V), we can inline a *much* better SSE2 loop. See https://stackoverflow.com/a/55589634/224132 for more details and microbenchmarks; On Skylake it's about 4 to 5x faster than the current 4-byte loop for large strings, 3x faster for short strings. For short strings (strlen=33), it's about 1.5x faster than calling strlen. For very large strings (too big for L2 cache), it's ~1.7x slower than glibc's AVX2 strlen. The lack of VEX encoding for pxor and pmovmskb is just me being lazy; let gcc emit them all with VEX if AVX is enabled. # at this point gcc has `s` in RDX, `i` in ECX pxor %xmm0, %xmm0 # zeroed vector to compare against .p2align 4 .Lstrlen16: # do { #ifdef __AVX__ vpcmpeqb (%rdx), %xmm0, %xmm1 #else movdqa (%rdx), %xmm1 pcmpeqb%xmm0, %xmm1 # xmm1 = -1 where there was a 0 in memory #endif add $16, %rdx # ptr++ pmovmskb %xmm1, %eax # extract high bit of each byte to a 16-bit mask test %eax, %eax jz.Lstrlen16# }while(mask==0); # RDX points at the 16-byte chunk *after* the one containing the terminator # EAX = bit-mask of the 0 bytes, and is known to be non-zero bsf%eax, %eax # EAX = bit-index of the lowest set bit # terminator is at rdx+rax - 16 # movb $'A', -16(%rdx, %rax) // for a microbench that used s[strlen(s)]='A' sub%rbp, %rdx # p -= start lea -16(%rdx, %rax) # p += byte_within_vector - 16 We should actually use REP BSF because that's faster on AMD (tzcnt), and same speed on Intel. Also an inline-asm implementation of it with a microbenchmark adapted from the SO question. (Compile with -DUSE_ASM -DREAD_ONLY to benchmark a fixed length repeatedly) https://godbolt.org/z/9tuVE5 It uses clock() for timing, which I didn't bother updating. I made it possible to run it for lots of iterations for consistent timing. (And so the real work portion dominates the runtime so we can use perf stat to measure it.) If we only have 4-byte alignment, maybe check the first 4B, then do (p+4) & ~7 to either overlap that 4B again or not when we start 8B chunks. But probably it's good to get to 16-byte alignment and do whole SSE2 vectors, because repeating an aligned 16-byte test that overlaps an 8-byte test costs the same as doing another 8-byte test. (Except on CPUs like Bobcat that split 128-bit vectors into 64-bit halves). The extra AND to round down to an alignment boundary is all it takes, plus the code-size cost of peeling 1 iteration each of 4B and 8B before a 16-byte loop. We can use 4B / 8B with movd / movq instead of movdqa. For pmovmskb, we can ignore the compare-true results for the upper 8 bytes by testing the result with `test %al,%al`, or in general with `test $0x0F, %al` to check only the low 4 bits of EAX for the 4-byte case. The scalar bithack version can use BSF instead of CMOV binary search for the byte with a set high bit. That should be a win if we ever wanted to do scalar on some x86 target especially with 8-byte registers, or on AArch64. AArch64 can rbit / clz to emulate bsf and find the position of the first set bit. (Without efficient SIMD compare result -> integer_mask, or efficient SIMD -> integer at all on some ARM / AArch64 chips, SIMD compares for search loops aren't always (ever?) a win. IIRC, glibc strlen and memchr don't use vectors on ARM / AArch64, just scalar bithacks.)
[Bug target/88809] do not use rep-scasb for inline strlen/memchr
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88809 --- Comment #3 from Marc Glisse --- (In reply to Alexander Monakov from comment #0) > Therefore I suggest we don't use rep-scasb for inline strlen anymore by > default (we currently do at -Os). According to https://stackoverflow.com/q/55563598/1918193 , we also do at -O1.
[Bug target/88809] do not use rep-scasb for inline strlen/memchr
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88809 --- Comment #2 from Andrew Pinski --- >(although to be fair, a call to strlen prevents use of redzone and clobbers >more registers) And causes more register pressure ...
[Bug target/88809] do not use rep-scasb for inline strlen/memchr
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88809 --- Comment #1 from Andrew Pinski --- (In reply to Alexander Monakov from comment #0) > Therefore I suggest we don't use rep-scasb for inline strlen anymore by > default (we currently do at -Os). This is in part motivated by PR 88793 and > the Redhat bug referenced from there. Is it smaller to call a function or inline it? -Os is really truely optimize for size no matter what. I know non-embedded folks don't like that and it is also the reason why Apple added -Oz (a similar thing to this -Os issue but on PowerPC where the string instructions are used).