https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84719
Bug ID: 84719 Summary: gcc's __builtin_memcpy performance with certain number of bytes is terrible compared to clang's Product: gcc Version: 7.3.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: bootstrap Assignee: unassigned at gcc dot gnu.org Reporter: gpnuma at centaurean dot com Target Milestone: --- I post this bug report as an echo to my post here : https://stackoverflow.com/questions/49098453/ To reproduce : just create a file (test.c), compile (gcc -O3 test.c) and run (time ./a.out) this simple code : #include <sys/stat.h> #include <sys/types.h> #include <stdio.h> #include <stdint.h> #include <stdlib.h> #include <stdbool.h> #include <string.h> int main(int argc, char *argv[]) { const uint64_t size = 1000000000; const size_t alloc_mem = size * sizeof(uint8_t); uint8_t *mem = (uint8_t*)malloc(alloc_mem); for (uint_fast64_t i = 0; i < size; i++) mem[i] = (uint8_t) (i >> 7); uint8_t block = 0; uint_fast64_t counter = 0; uint64_t total = 0x123456789abcdefllu; uint64_t receiver = 0; for(block = 1; block <= 8; block ++) { printf("%u ...\n", block); counter = 0; while (counter < size - 8) { __builtin_memcpy(&receiver, &mem[counter], block); receiver &= (0xffffffffffffffffllu >> (64 - ((block) << 3))); total += ((receiver * 0x321654987cbafedllu) >> 48); counter += block; } } printf("=> %llu\n", total); return EXIT_SUCCESS; } Timings for gcc compiled code are almost 3x slower than those for clang. As a side note, loop unrolling is not very well handled there as specifying a forced unroll in gcc 8 improves performance, but this is not any better with clang. Even with complete manual unrolling, the resulting gcc compiled code is still 3x slower than clangs's. After further testing it appears that the problem is caused by some specific number of bytes requested in __builtin_memcpy, in particular the __builtin_memcpy(,,3) performance is very poor. My platform compiler infos : gcc-7 -v Using built-in specs. COLLECT_GCC=gcc-7 COLLECT_LTO_WRAPPER=/usr/local/Cellar/gcc/7.3.0/libexec/gcc/x86_64-apple-darwin17.4.0/7.3.0/lto-wrapper Target: x86_64-apple-darwin17.4.0 Configured with: ../configure --build=x86_64-apple-darwin17.4.0 --prefix=/usr/local/Cellar/gcc/7.3.0 --libdir=/usr/local/Cellar/gcc/7.3.0/lib/gcc/7 --enable-languages=c,c++,objc,obj-c++,fortran --program-suffix=-7 --with-gmp=/usr/local/opt/gmp --with-mpfr=/usr/local/opt/mpfr --with-mpc=/usr/local/opt/libmpc --with-isl=/usr/local/opt/isl --with-system-zlib --enable-checking=release --with-pkgversion='Homebrew GCC 7.3.0' --with-bugurl=https://github.com/Homebrew/homebrew-core/issues --disable-nls Thread model: posix gcc version 7.3.0 (Homebrew GCC 7.3.0) cc -v Apple LLVM version 9.0.0 (clang-900.0.39.2) Target: x86_64-apple-darwin17.4.0 Thread model: posix InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bi