[Bug tree-optimization/69908] New: recognizing idioms that check for a buffer of all-zeros could make much better code

peter at cordes dot ca Mon, 22 Feb 2016 14:48:46 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69908


            Bug ID: 69908
           Summary: recognizing idioms that check for a buffer of
                    all-zeros could make *much* better code
           Product: gcc
           Version: 5.3.0
            Status: UNCONFIRMED
          Keywords: missed-optimization, ssemmx
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: peter at cordes dot ca
  Target Milestone: ---

Checking a block of memory to see if it's all-zero, or to find the first
non-zero, seems to be a not-uncommon problem.  It's really hard to get gcc to
emit code that's even half-way efficient.

The most recent stackoverflow question about this (with links to previous ones)
is
http://stackoverflow.com/questions/35450237/fastest-way-to-check-mass-data-if-null-in-c

Summary:

* gcc would benefit a lot from recognizing zero-checking idioms (with a
suggested asm loop for x86).
* one zero-checking function compiles to bad x86 code in multiple ways
* even a simple byte-at-a-time loop on a fixed-size buffer compiles to
byte-at-a-time asm.
* gcc is bad at horizontal reductions, esp with AVX2

I'm using x86 asm for examples of whether gcc auto-vectorizes or not, but this
is architecture-independent.

---------


Ideally we'd like the main loop in these functions to test 64B at a time (a
whole cache-line on all modern x86 microarchitectures), something like:

    ... some intro stuff ...
    pxor     %xmm5, %xmm5
.Lvec_loop:
    movdqa   (%rsi), %xmm0
    por    16(%rsi), %xmm0
    por    32(%rsi), %xmm0
    por    48(%rsi), %xmm0
#    ptest     %xmm0, %xmm0  # SSE4.1
#    jnz  .Lnonzero_found
    pcmpeqb   %xmm5, %xmm0
    pmovmskb  %xmm0, %eax
    cmp       $0xffff, %eax  # check that all the bytes compared equal to zero
    jne   .Lnonzero_found

    add    $64, %rsi
    cmp    pointer, end
    jb  $Lvec_loop 
    # Intel: 9 fused-domain uops in the loop: 
    # one too many to issue in 2 cycles and saturate 2 loads per cycle.

    # epilogue for the final partial cache-line
    # We can test some bytes again,
    # e.g. using 16B unaligned loads that end at the correct place
    movdqu  -16(end), %xmm0
    movdqu  -32(end), %xmm1
    por     %xmm1, %xmm0
    movdqu  -48(end), %xmm1
    por     %xmm1, %xmm0
    movdqu  -64(end), %xmm3
    por     %xmm1, %xmm0
    # ptest or pcmpeq / pmovmskb / cmp

We'd have the intro code handle inputs smaller than 64B, so the epilogue
couldn't access memory from before the start of the buffer.

pcmpeq / pmovmsk / cmp is better than pshufd / por / movq / test, esp for 32bit
where another round of horizontal reduction would be needed.

It might be better to use two accumulators to make better use of two load ports
from a single cache-line, but hopefully the loads will dispatch mostly in
program order, so there hopefully won't be many cache-bank conflicts on SnB/IvB
when multiple iterations are in flight at once.  The POR dep chain is not
loop-carried, and out-of-order execution should hide it nicely.


I have no idea how to write C (without intrinsics) that would auto-vectorize to
anything like that, or even to something acceptable.  It would be nice if there
was some kind of idiom that compilers could recognize and make good code for,
without needing custom code for every platform where we want non-terrible
output.


----------

The most solid attempt on that SO question ORs together eight size_t elements
in the main loop, then uses a byte cleanup loop.  gcc makes a mess of it:

Summary of problems with gcc 5.3.0 -O3 -march=haswell for this function:

(I can report separate bugs for the separate problems; Other than recognizing
zero-checking idioms, most of these problems could probably be fixed
separately.)

* gcc doesn't realize that we're ultimately testing for all-zero, and just
treats OR as any other associative operation.

* even a simple byte-loop over a fixed-size buffer doesn't get optimized at all
(different function, see below)

* main loop not auto-vectorized
* word-at-a-time and byte-at-a-time cleanup loops generate full loops:
  gcc doesn't realize they're just cleanup that will only do less than one
vector of data.
* word-at-a-time cleanup loop gets a bloated fully-unrolled scalar intro (which
is all that will ever run)
* byte cleanup loop auto-vectorization unpacks vectors of bytes to longs before
ORing, with a big chain of vextracti128 / vmovzx.
* Without AVX2, gcc does a full-unroll of the unaligned-epilogue for the byte
cleanup autovectorization.


The bad auto-vectorized cleanup-loop code will never run, only their scalar
intros, because of the logic of the function.  Presumably gcc would generate
the nasty pmovzx byte-unpacking code in situations where it would actually run.

The byte cleanup loop has a byte-at-a-time scalar intro loop (not unrolled),
which is ok I guess.


//  See on godbolt: http://goo.gl/DHKUIE
int dataisnull_orig_evenworse(const void *data, size_t length) {
    /* assuming data was returned by malloc, thus is properly aligned */
    size_t i = 0, n = length / sizeof(size_t);
    const size_t *pw = data;
    const unsigned char *pb = data;
#define UNROLL_FACTOR  8
    size_t n1 = n - n % UNROLL_FACTOR;
    for (; i < n1; i += UNROLL_FACTOR) {
        size_t val = pw[i + 0] | pw[i + 1] | pw[i + 2] | pw[i + 3] |
                     pw[i + 4] | pw[i + 5] | pw[i + 6] | pw[i + 7];
        if (val)
            return 0;
    }

    size_t val = 0;
    // gcc fully unrolls this cleanup loop
    for (; i < n; i++) {
        val |= pw[i];
    }

    i = n * sizeof(size_t)
    // Without AVX2, gcc does a full-unroll of the unaligned-epilogue for the
byte cleanup autovectorization

    for (; i < length; i++) {
        val |= pb[i];
    }
    return val == 0;
}

The main loop doesn't get autovectorized at all, so it's just a sequence of
straightforward scalar ORs and then a conditional branch.  The cleanup code is
massively bloated, though, at -O3.  The byte-cleanup loop gets auto-vectorized,
and gcc doesn't figure out that it can run at most 7B.  Worst of all, it
unpacks bytes to longs before ORing, with a chain of many vextracti128, 2x
vpmovzxbw, 4x vpmovzxwd, and 8x vpmovzxdq.

-----

Even the sequence to horizontally OR a single vector down to a long is clunky:

        vpxor   %xmm1, %xmm1, %xmm1
        vperm2i128      $33, %ymm1, %ymm0, %ymm2
        vpor    %ymm2, %ymm0, %ymm0
        vperm2i128      $33, %ymm1, %ymm0, %ymm1
        vpalignr        $8, %ymm0, %ymm1, %ymm1
        vpor    %ymm1, %ymm0, %ymm0
        vmovq   %xmm0, %rdx

With only -mavx, it only uses 16B vectors, and does the horizontal OR with

        vpsrldq $8, %xmm2, %xmm0
        vpor    %xmm0, %xmm2, %xmm2
        addq    %r12, %rcx
        vmovq   %xmm2, %rdx


The optimal AVX2 sequence would use vextracti128, not vperm2i128.  Shuffling
with a zeroed vector is perverse, and using 256b lane-crossing shuffles is
slow.  Reducing to a 128b vector ASAP is better for energy/power reasons, and
latency.  Esp for a hypothetical CPU that implements AVX2 the way
Bulldozer-family does AVX (with 128b execution units).

        # suggested optimal horizontal-OR sequence
        vextracti128    $1, %ymm0, %xmm1
        vpor            %xmm1, %xmm0, %xmm0
        vpshufd         $0b1001110, %xmm0, %xmm1  # swap upper/lower halves
        vpor            %xmm1, %xmm0, %xmm0
        vmovq           %xmm0, %rdx

Using pshufd works great in the non-AVX case, saving a movdqa because it's not
an in-place shuffle like psrldq, unpckhqdq.  pshufd is also slightly cheaper
than psrldq on Pentium M, but apparently not on first-gen Core2 (according to
Agner Fog's tables).  It's a tiny difference: avoiding the movdqa with pshufd
is the way to go for the benefit of all other CPUs in builds without -mavx. 
With -mtune=core2. movdqa / unpckhqdq would be optimal, because shuffles with
64b elements are very cheap on Merom.

movhlps %xmm0, %xmm1 would be good, except for a potential bypass-delay
(int->float and back) on some CPUs (probably Nehalem).  We have a scratch
register (xmm1) that's part of the same dep chain, and that has to be ready
already.

palignr $24, %xmm0, %xmm1  has possibilities (since we have a scratch reg we
don't mind a data dependency on): slightly faster than pshufd on first-gen
core2 (merom/conroe), but same everywhere else.  Longer instruction encoding
than pshufd, though, so it's actually worse than pshufd for everything else
(and much harder to use; because we need to identify a scratch reg that we
already have a data dependency on).

What gcc actually does with -march=core2 is movdqa / psrldq / por down to one
byte, then a movaps store, then a movzbl load into an integer register. 
Obviously a movd / movzbl %dl, %edx would be better, or recognize the
zero-checking and don't worry about zeroing the upper bytes.  If the low byte
is zero, the upper bytes will be, too.  And if we're optimizing for first-gen
(not 45nm Penryn which has a full 128b shuffle unit), then palignr or movdqa /
unpckhqd would be optimal.  Then movq to integer, and mov/shr/or in integer
registers.  Those instructions are shorter, so you'll get less of a decode
bottleneck (definitely an issue for CPUs without a uop cache, with long SSE
instructions).


------------------

Even the most straightforward function with a fixed size doesn't get optimized
at all:

char data [static 4096] prevents decay into a simple pointer, guaranteeing that
it's safe to read the full object, instead of having to avoid reading data that
the C source wouldn't.

int allzero_fixedsize(const char data [static 4096]) {
  size_t len=4096;
  while(len--)
    if (*(data++)) return 0;
  return 1;
}
   gcc 5.3 -O3 -march=haswell

        lea     rax, [rdi+4096]
        jmp     .L10
.L15:
        cmp     rdi, rax
        je      .L14
.L10:
        add     rdi, 1
        cmp     BYTE PTR [rdi-1], 0
        je      .L15
        xor     eax, eax
        ret
.L14:
        mov     eax, 1

clang doesn't do any better (it's actually worse, including mov eax, 1 in the
hot loop.)

[Bug tree-optimization/69908] New: recognizing idioms that check for a buffer of all-zeros could make *much* better code

Reply via email to

[Bug tree-optimization/69908] New: recognizing idioms that check for a buffer of all-zeros could make much better code