https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69908
Bug ID: 69908 Summary: recognizing idioms that check for a buffer of all-zeros could make *much* better code Product: gcc Version: 5.3.0 Status: UNCONFIRMED Keywords: missed-optimization, ssemmx Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Checking a block of memory to see if it's all-zero, or to find the first non-zero, seems to be a not-uncommon problem. It's really hard to get gcc to emit code that's even half-way efficient. The most recent stackoverflow question about this (with links to previous ones) is http://stackoverflow.com/questions/35450237/fastest-way-to-check-mass-data-if-null-in-c Summary: * gcc would benefit a lot from recognizing zero-checking idioms (with a suggested asm loop for x86). * one zero-checking function compiles to bad x86 code in multiple ways * even a simple byte-at-a-time loop on a fixed-size buffer compiles to byte-at-a-time asm. * gcc is bad at horizontal reductions, esp with AVX2 I'm using x86 asm for examples of whether gcc auto-vectorizes or not, but this is architecture-independent. --------- Ideally we'd like the main loop in these functions to test 64B at a time (a whole cache-line on all modern x86 microarchitectures), something like: ... some intro stuff ... pxor %xmm5, %xmm5 .Lvec_loop: movdqa (%rsi), %xmm0 por 16(%rsi), %xmm0 por 32(%rsi), %xmm0 por 48(%rsi), %xmm0 # ptest %xmm0, %xmm0 # SSE4.1 # jnz .Lnonzero_found pcmpeqb %xmm5, %xmm0 pmovmskb %xmm0, %eax cmp $0xffff, %eax # check that all the bytes compared equal to zero jne .Lnonzero_found add $64, %rsi cmp pointer, end jb $Lvec_loop # Intel: 9 fused-domain uops in the loop: # one too many to issue in 2 cycles and saturate 2 loads per cycle. # epilogue for the final partial cache-line # We can test some bytes again, # e.g. using 16B unaligned loads that end at the correct place movdqu -16(end), %xmm0 movdqu -32(end), %xmm1 por %xmm1, %xmm0 movdqu -48(end), %xmm1 por %xmm1, %xmm0 movdqu -64(end), %xmm3 por %xmm1, %xmm0 # ptest or pcmpeq / pmovmskb / cmp We'd have the intro code handle inputs smaller than 64B, so the epilogue couldn't access memory from before the start of the buffer. pcmpeq / pmovmsk / cmp is better than pshufd / por / movq / test, esp for 32bit where another round of horizontal reduction would be needed. It might be better to use two accumulators to make better use of two load ports from a single cache-line, but hopefully the loads will dispatch mostly in program order, so there hopefully won't be many cache-bank conflicts on SnB/IvB when multiple iterations are in flight at once. The POR dep chain is not loop-carried, and out-of-order execution should hide it nicely. I have no idea how to write C (without intrinsics) that would auto-vectorize to anything like that, or even to something acceptable. It would be nice if there was some kind of idiom that compilers could recognize and make good code for, without needing custom code for every platform where we want non-terrible output. ---------- The most solid attempt on that SO question ORs together eight size_t elements in the main loop, then uses a byte cleanup loop. gcc makes a mess of it: Summary of problems with gcc 5.3.0 -O3 -march=haswell for this function: (I can report separate bugs for the separate problems; Other than recognizing zero-checking idioms, most of these problems could probably be fixed separately.) * gcc doesn't realize that we're ultimately testing for all-zero, and just treats OR as any other associative operation. * even a simple byte-loop over a fixed-size buffer doesn't get optimized at all (different function, see below) * main loop not auto-vectorized * word-at-a-time and byte-at-a-time cleanup loops generate full loops: gcc doesn't realize they're just cleanup that will only do less than one vector of data. * word-at-a-time cleanup loop gets a bloated fully-unrolled scalar intro (which is all that will ever run) * byte cleanup loop auto-vectorization unpacks vectors of bytes to longs before ORing, with a big chain of vextracti128 / vmovzx. * Without AVX2, gcc does a full-unroll of the unaligned-epilogue for the byte cleanup autovectorization. The bad auto-vectorized cleanup-loop code will never run, only their scalar intros, because of the logic of the function. Presumably gcc would generate the nasty pmovzx byte-unpacking code in situations where it would actually run. The byte cleanup loop has a byte-at-a-time scalar intro loop (not unrolled), which is ok I guess. // See on godbolt: http://goo.gl/DHKUIE int dataisnull_orig_evenworse(const void *data, size_t length) { /* assuming data was returned by malloc, thus is properly aligned */ size_t i = 0, n = length / sizeof(size_t); const size_t *pw = data; const unsigned char *pb = data; #define UNROLL_FACTOR 8 size_t n1 = n - n % UNROLL_FACTOR; for (; i < n1; i += UNROLL_FACTOR) { size_t val = pw[i + 0] | pw[i + 1] | pw[i + 2] | pw[i + 3] | pw[i + 4] | pw[i + 5] | pw[i + 6] | pw[i + 7]; if (val) return 0; } size_t val = 0; // gcc fully unrolls this cleanup loop for (; i < n; i++) { val |= pw[i]; } i = n * sizeof(size_t) // Without AVX2, gcc does a full-unroll of the unaligned-epilogue for the byte cleanup autovectorization for (; i < length; i++) { val |= pb[i]; } return val == 0; } The main loop doesn't get autovectorized at all, so it's just a sequence of straightforward scalar ORs and then a conditional branch. The cleanup code is massively bloated, though, at -O3. The byte-cleanup loop gets auto-vectorized, and gcc doesn't figure out that it can run at most 7B. Worst of all, it unpacks bytes to longs before ORing, with a chain of many vextracti128, 2x vpmovzxbw, 4x vpmovzxwd, and 8x vpmovzxdq. ----- Even the sequence to horizontally OR a single vector down to a long is clunky: vpxor %xmm1, %xmm1, %xmm1 vperm2i128 $33, %ymm1, %ymm0, %ymm2 vpor %ymm2, %ymm0, %ymm0 vperm2i128 $33, %ymm1, %ymm0, %ymm1 vpalignr $8, %ymm0, %ymm1, %ymm1 vpor %ymm1, %ymm0, %ymm0 vmovq %xmm0, %rdx With only -mavx, it only uses 16B vectors, and does the horizontal OR with vpsrldq $8, %xmm2, %xmm0 vpor %xmm0, %xmm2, %xmm2 addq %r12, %rcx vmovq %xmm2, %rdx The optimal AVX2 sequence would use vextracti128, not vperm2i128. Shuffling with a zeroed vector is perverse, and using 256b lane-crossing shuffles is slow. Reducing to a 128b vector ASAP is better for energy/power reasons, and latency. Esp for a hypothetical CPU that implements AVX2 the way Bulldozer-family does AVX (with 128b execution units). # suggested optimal horizontal-OR sequence vextracti128 $1, %ymm0, %xmm1 vpor %xmm1, %xmm0, %xmm0 vpshufd $0b1001110, %xmm0, %xmm1 # swap upper/lower halves vpor %xmm1, %xmm0, %xmm0 vmovq %xmm0, %rdx Using pshufd works great in the non-AVX case, saving a movdqa because it's not an in-place shuffle like psrldq, unpckhqdq. pshufd is also slightly cheaper than psrldq on Pentium M, but apparently not on first-gen Core2 (according to Agner Fog's tables). It's a tiny difference: avoiding the movdqa with pshufd is the way to go for the benefit of all other CPUs in builds without -mavx. With -mtune=core2. movdqa / unpckhqdq would be optimal, because shuffles with 64b elements are very cheap on Merom. movhlps %xmm0, %xmm1 would be good, except for a potential bypass-delay (int->float and back) on some CPUs (probably Nehalem). We have a scratch register (xmm1) that's part of the same dep chain, and that has to be ready already. palignr $24, %xmm0, %xmm1 has possibilities (since we have a scratch reg we don't mind a data dependency on): slightly faster than pshufd on first-gen core2 (merom/conroe), but same everywhere else. Longer instruction encoding than pshufd, though, so it's actually worse than pshufd for everything else (and much harder to use; because we need to identify a scratch reg that we already have a data dependency on). What gcc actually does with -march=core2 is movdqa / psrldq / por down to one byte, then a movaps store, then a movzbl load into an integer register. Obviously a movd / movzbl %dl, %edx would be better, or recognize the zero-checking and don't worry about zeroing the upper bytes. If the low byte is zero, the upper bytes will be, too. And if we're optimizing for first-gen (not 45nm Penryn which has a full 128b shuffle unit), then palignr or movdqa / unpckhqd would be optimal. Then movq to integer, and mov/shr/or in integer registers. Those instructions are shorter, so you'll get less of a decode bottleneck (definitely an issue for CPUs without a uop cache, with long SSE instructions). ------------------ Even the most straightforward function with a fixed size doesn't get optimized at all: char data [static 4096] prevents decay into a simple pointer, guaranteeing that it's safe to read the full object, instead of having to avoid reading data that the C source wouldn't. int allzero_fixedsize(const char data [static 4096]) { size_t len=4096; while(len--) if (*(data++)) return 0; return 1; } gcc 5.3 -O3 -march=haswell lea rax, [rdi+4096] jmp .L10 .L15: cmp rdi, rax je .L14 .L10: add rdi, 1 cmp BYTE PTR [rdi-1], 0 je .L15 xor eax, eax ret .L14: mov eax, 1 clang doesn't do any better (it's actually worse, including mov eax, 1 in the hot loop.)