https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114449

            Bug ID: 114449
           Summary: bswap64 not optimized
           Product: gcc
           Version: 13.2.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: middle-end
          Assignee: unassigned at gcc dot gnu.org
          Reporter: pali at kernel dot org
  Target Milestone: ---

https://godbolt.org/z/dc3br9dYT

gcc 13.2 with -O3 does not detect straightforward code for bswap64
functionality. It generates unoptimized code.

    uint64_t bswap64_1(uint64_t num) {
        uint64_t ret = 0;
        for (size_t i = 0; i < sizeof(num); i++) {
            ret |= ((num >> (8*(sizeof(num)-1-i))) & 0xff) << (8*i);
        }
        return ret;
    }


Rewriting the code to manually unpack the loop cause that gcc produces
optimized code with single "bswap" instruction on x86-64.

    uint64_t bswap64_2(uint64_t num) {
        uint64_t ret = 0;
        ret |= (((num >> 56) & 0xff) <<  0);
        ret |= (((num >> 48) & 0xff) <<  8);
        ret |= (((num >> 40) & 0xff) << 16);
        ret |= (((num >> 32) & 0xff) << 24);
        ret |= (((num >> 24) & 0xff) << 32);
        ret |= (((num >> 16) & 0xff) << 40);
        ret |= (((num >>  8) & 0xff) << 48);
        ret |= (((num >>  0) & 0xff) << 56);
        return ret;
    }


Additional -funroll-all-loops argument for the first example does not help and
still produces unoptimized code.

Reply via email to