https://gcc.gnu.org/bugzilla/show_bug.cgi?id=120920
--- Comment #3 from Dusan Stojkovic <dusan.stojko...@rt-rk.com> --- Based on Jeff's review, I updated the patch: https://patchwork.sourceware.org/project/gcc/patch/pr3pr08mb5738402789a50779af3ae0abbe...@pr3pr08mb5738.eurprd08.prod.outlook.com/ >> A possibility for improvement for rv64 could be: >> ``` >> bswap8: >> rev8 t0, a0 >> srli t1, a0, 16 >> srai t0, t0, 48 >> slli t1, t1, 16 >> or a0, t1, t0 >> ret >> ``` >> I suspect you'll need to look at a simplify-rtx simplification of some kind. I ended up implementing a peephole pattern. It turned out that creating a pattern which matches the entire bswap8 sequence here would be quite large. So the resulting assembly ended up being: bswap8: rev8 a5,a0 li a4,-65536 srli a5,a5,48 and a0,a0,a4 or a0,a0,a5 sext.w a0,a0 ret -------------------------------------------------------------------------------- Interestingly, even though 0..7 and 8..15 are swapped, in the bswap pass it is treated as a bswap32: 32 bit bswap implementation found at: _10 = _3 | _6; unsigned int bswap8 (unsigned int n) { unsigned int _1; unsigned int _2; unsigned int _3; unsigned int bswapdst_4; unsigned int _5; unsigned int _6; unsigned int _8; unsigned int _10; unsigned int bswapmaskdst_11; <bb 2> [local count: 1073741824]: _1 = n_7(D) & 4294901760; _2 = n_7(D) >> 8; _3 = _2 & 255; _5 = n_7(D) << 8; _6 = _5 & 65535; bswapdst_4 = __builtin_bswap32 (n_7(D)); bswapmaskdst_11 = bswapdst_4 & 4294901760; _10 = bswapmaskdst_11 r<< 16; _8 = _1 | _10; return _8; } I narrowed it down to the attached diff. The resulting function with the applied patch becomes: bswap8: rev8 a5,a0 li a4,-65536 srli a5,a5,48 and a0,a0,a4 or a0,a5,a0 ret With the GIMPLE in the bswap pass becoming: ... 16 bit bswap implementation found at: _10 = _3 | _6; unsigned int bswap8 (unsigned int val) { unsigned int _1; unsigned int _2; unsigned int _3; short unsigned int bswapsrc_4; unsigned int _5; unsigned int _6; unsigned int _8; unsigned int _10; short unsigned int bswapdst_11; <bb 2> [local count: 1073741824]: _1 = val_7(D) & 4294901760; _2 = val_7(D) >> 8; _3 = _2 & 255; _5 = val_7(D) << 8; _6 = _5 & 65535; bswapsrc_4 = (short unsigned int) val_7(D); bswapdst_11 = bswapsrc_4 r>> 8; _10 = (unsigned int) bswapdst_11; _8 = _1 | _10; return _8; } I'm not sure how to approach this further in the bswap pass, though, since the attached diff produces regressions on x86 (see PR 115102)... I just wanted to document it here for future reference...