[Bug middle-end/104151] [9/10/11/12 Regression] x86: excessive code generated for 128-bit byteswap
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104151 --- Comment #12 from Uroš Bizjak --- (In reply to Uroš Bizjak from comment #10) > (In reply to Hongtao.liu from comment #4) > > Also there's separate issue, codegen for below is not optimal > > gimple: > > _11 = VIEW_CONVERT_EXPR(a_3(D)) > > asm: > > mov QWORD PTR [rsp-24], rdi > > mov QWORD PTR [rsp-16], rsi > > movdqa xmm0, XMMWORD PTR [rsp-24] > > > > > > I think this issue has been recorded in several existed PRs. > > Maybe this can be solved with secondary_reload when GPR and XMM regs are > involved. PR104306 is an experiment.
[Bug middle-end/104151] [9/10/11/12 Regression] x86: excessive code generated for 128-bit byteswap
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104151 Jakub Jelinek changed: What|Removed |Added CC||jakub at gcc dot gnu.org --- Comment #11 from Jakub Jelinek --- With -O3 it regresses with r7-2009-g8d4fc2d3d0c8f87bb3e182be1a618a511f8f9465 __uint128_t bswap(__uint128_t a) { return __builtin_bswap128 (a); } emits the optimal code but is only in GCC 11.1 and later. One fix for this might be to handle _8 = BIT_FIELD_REF ; _1 = __builtin_bswap64 (_8); y[0] = _1; _10 = BIT_FIELD_REF ; _2 = __builtin_bswap64 (_10); y[1] = _2; _7 = MEM [(char * {ref-all})]; in bswap or store merging. Though, current bswap infrastructure I'm afraid limits it to 64-bit size, because it tracks the bytes in uint64_t vars and uses 8 bits to determine which byte it is (0 value of zero, 1-8 byte index and 0xff unknown). While that is 10 different values right now, if we handled uint128_t we'd need 18 different values times 16. Note, even: unsigned long long bswap (unsigned long long a) { unsigned int x[2]; __builtin_memcpy (x, , 8); unsigned int y[2]; y[0] = __builtin_bswap32 (x[1]); y[1] = __builtin_bswap32 (x[0]); __builtin_memcpy (, y, 8); return a; } unsigned long long bswap2 (unsigned long long a) { return __builtin_bswap64 (a); } emits better code in the latter function rather than former store-merging isn't able to handle even that. So we want to handle it in store-merging, we should start with handling _8 = BIT_FIELD_REF ; _1 = __builtin_bswap32 (_8); _10 = (unsigned int) a_3(D); _2 = __builtin_bswap32 (_10); _11 = {_1, _2}; _5 = VIEW_CONVERT_EXPR(_11); and _8 = BIT_FIELD_REF ; _1 = __builtin_bswap32 (_8); y[0] = _1; _10 = (unsigned int) a_3(D); _2 = __builtin_bswap32 (_10); y[1] = _2; _7 = MEM [(char * {ref-all})]; and only once that is handled try _8 = BIT_FIELD_REF ; _1 = __builtin_bswap64 (_8); _10 = (long long unsigned int) a_3(D); _2 = __builtin_bswap64 (_10); _11 = {_1, _2}; _5 = VIEW_CONVERT_EXPR(_11); Doesn't look like stage4 material though. So in the meantime perhaps some other improvements.
[Bug middle-end/104151] [9/10/11/12 Regression] x86: excessive code generated for 128-bit byteswap
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104151 --- Comment #10 from Uroš Bizjak --- (In reply to Hongtao.liu from comment #4) > Also there's separate issue, codegen for below is not optimal > gimple: > _11 = VIEW_CONVERT_EXPR(a_3(D)) > asm: > mov QWORD PTR [rsp-24], rdi > mov QWORD PTR [rsp-16], rsi > movdqa xmm0, XMMWORD PTR [rsp-24] > > > I think this issue has been recorded in several existed PRs. Maybe this can be solved with secondary_reload when GPR and XMM regs are involved.
[Bug middle-end/104151] [9/10/11/12 Regression] x86: excessive code generated for 128-bit byteswap
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104151 --- Comment #9 from Richard Biener --- (In reply to Richard Biener from comment #8) > (In reply to rsand...@gcc.gnu.org from comment #7) > > (In reply to Richard Biener from comment #6) > > > Richard - I'm sure we can construct a similar case for aarch64 where > > > argument passing and vector mode use cause spilling? > > > > > > On x86 the simplest testcase showing this is > > > > > > typedef unsigned long long v2di __attribute__((vector_size(16))); > > > v2di bswap(__uint128_t a) > > > { > > > return *(v2di *) > > > } > > > > > > that produces > > > > > > bswap: > > > .LFB0: > > > .cfi_startproc > > > sub sp, sp, #16 > > > .cfi_def_cfa_offset 16 > > > stp x0, x1, [sp] > > > ldr q0, [sp] > > > add sp, sp, 16 > > > .cfi_def_cfa_offset 0 > > > ret > > > > > > on arm for me. Maybe the stp x0, x1 store can forward to the ldr load > > > though and I'm not sure there's another way to move x0/x1 to q0. > > It looks like this is a deliberate choice for aarch64. The generic > > costing has: > > > > /* Avoid the use of slow int<->fp moves for spilling by setting > > their cost higher than memmov_cost. */ > > 5, /* GP2FP */ > > > > So in cases like the above, we're telling IRA that spilling to > > memory and reloading is cheaper than moving between registers. > > For -mtune=thunderx we generate: > > > > fmovd0, x0 > > ins v0.d[1], x1 > > ret > > > > instead. > > Ah, interesting. On x86 we disallow/pessimize GPR<->XMM moves with > some tunings as well, still there a sequence like > >movq%rdi, -24(%rsp) >movq%rsi, -16(%rsp) >movq-24(%rsp), %xmm0 >movq-16(%rsp), %xmm1 >unpckhpd %xmm0, %xmm1 (fixme - that's wrong, but you get the idea) > > instead of > > movq%rdi, -24(%rsp) > movq%rsi, -16(%rsp) > movdqa -24(%rsp), %xmm0 > > would likely be faster. Not sure if one can get LRA to produce this > two-staged reload with just appropriate costing though. As said the > key of the cost of the bad sequence is the failing store forwarding, > so it's special for spilling of two-GPR TImode and reloading as > single FPR V2DImode. And a speciality for aarch64 seems to be that it has arguments passed in (reg:TI x0) which supposedly is a register-pair. On x86 there are no TImode register pair registers I think, instead the __int128 is passed as two 8-bytes in regular GPRs. So on aarch64 we have the simpler (insn 13 3 10 2 (set (reg:TI 95) (reg:TI 0 x0 [ a ])) "t.ii":3:2 58 {*movti_aarch64} (expr_list:REG_DEAD (reg:TI 0 x0 [ a ]) (nil))) (insn 10 13 11 2 (set (reg/i:V2DI 32 v0) (subreg:V2DI (reg:TI 95) 0)) "t.ii":5:2 1173 {*aarch64_simd_movv2di} (expr_list:REG_DEAD (reg:TI 95) (nil))) before RA.
[Bug middle-end/104151] [9/10/11/12 Regression] x86: excessive code generated for 128-bit byteswap
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104151 --- Comment #8 from Richard Biener --- (In reply to rsand...@gcc.gnu.org from comment #7) > (In reply to Richard Biener from comment #6) > > Richard - I'm sure we can construct a similar case for aarch64 where > > argument passing and vector mode use cause spilling? > > > > On x86 the simplest testcase showing this is > > > > typedef unsigned long long v2di __attribute__((vector_size(16))); > > v2di bswap(__uint128_t a) > > { > > return *(v2di *) > > } > > > > that produces > > > > bswap: > > .LFB0: > > .cfi_startproc > > sub sp, sp, #16 > > .cfi_def_cfa_offset 16 > > stp x0, x1, [sp] > > ldr q0, [sp] > > add sp, sp, 16 > > .cfi_def_cfa_offset 0 > > ret > > > > on arm for me. Maybe the stp x0, x1 store can forward to the ldr load > > though and I'm not sure there's another way to move x0/x1 to q0. > It looks like this is a deliberate choice for aarch64. The generic > costing has: > > /* Avoid the use of slow int<->fp moves for spilling by setting > their cost higher than memmov_cost. */ > 5, /* GP2FP */ > > So in cases like the above, we're telling IRA that spilling to > memory and reloading is cheaper than moving between registers. > For -mtune=thunderx we generate: > > fmovd0, x0 > ins v0.d[1], x1 > ret > > instead. Ah, interesting. On x86 we disallow/pessimize GPR<->XMM moves with some tunings as well, still there a sequence like movq%rdi, -24(%rsp) movq%rsi, -16(%rsp) movq-24(%rsp), %xmm0 movq-16(%rsp), %xmm1 unpckhpd %xmm0, %xmm1 (fixme - that's wrong, but you get the idea) instead of movq%rdi, -24(%rsp) movq%rsi, -16(%rsp) movdqa -24(%rsp), %xmm0 would likely be faster. Not sure if one can get LRA to produce this two-staged reload with just appropriate costing though. As said the key of the cost of the bad sequence is the failing store forwarding, so it's special for spilling of two-GPR TImode and reloading as single FPR V2DImode.
[Bug middle-end/104151] [9/10/11/12 Regression] x86: excessive code generated for 128-bit byteswap
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104151 --- Comment #7 from rsandifo at gcc dot gnu.org --- (In reply to Richard Biener from comment #6) > Richard - I'm sure we can construct a similar case for aarch64 where > argument passing and vector mode use cause spilling? > > On x86 the simplest testcase showing this is > > typedef unsigned long long v2di __attribute__((vector_size(16))); > v2di bswap(__uint128_t a) > { > return *(v2di *) > } > > that produces > > bswap: > .LFB0: > .cfi_startproc > sub sp, sp, #16 > .cfi_def_cfa_offset 16 > stp x0, x1, [sp] > ldr q0, [sp] > add sp, sp, 16 > .cfi_def_cfa_offset 0 > ret > > on arm for me. Maybe the stp x0, x1 store can forward to the ldr load > though and I'm not sure there's another way to move x0/x1 to q0. It looks like this is a deliberate choice for aarch64. The generic costing has: /* Avoid the use of slow int<->fp moves for spilling by setting their cost higher than memmov_cost. */ 5, /* GP2FP */ So in cases like the above, we're telling IRA that spilling to memory and reloading is cheaper than moving between registers. For -mtune=thunderx we generate: fmovd0, x0 ins v0.d[1], x1 ret instead.
[Bug middle-end/104151] [9/10/11/12 Regression] x86: excessive code generated for 128-bit byteswap
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104151 Richard Biener changed: What|Removed |Added CC||rsandifo at gcc dot gnu.org, ||vmakarov at gcc dot gnu.org Priority|P3 |P2 Target||x86_64-*-* --- Comment #6 from Richard Biener --- With just SSE2 we get the store vectorized only: bswap: .LFB0: .cfi_startproc bswap %rsi bswap %rdi movq%rsi, %xmm0 movq%rdi, %xmm1 punpcklqdq %xmm1, %xmm0 movaps %xmm0, -24(%rsp) movq-24(%rsp), %rax movq-16(%rsp), %rdx ret the 1 times vec_perm costs 4 in body BIT_FIELD_REF 1 times scalar_stmt costs 4 in body BIT_FIELD_REF 1 times scalar_stmt costs 4 in body costs are what we cost building the initial vector from __int128 compared to splitting that into a low/high part. [local count: 1073741824]: _8 = BIT_FIELD_REF ; _11 = VIEW_CONVERT_EXPR(a_3(D)); _13 = VIEW_CONVERT_EXPR(a_3(D)); _12 = VEC_PERM_EXPR <_11, _13, { 1, 0 }>; _14 = VIEW_CONVERT_EXPR(_12); _15 = VEC_PERM_EXPR <_14, _14, { 7, 6, 5, 4, 3, 2, 1, 0, 15, 14, 13, 12, 11, 10, 9, 8 }>; _16 = VIEW_CONVERT_EXPR(_15); _1 = __builtin_bswap64 (_8); _10 = BIT_FIELD_REF ; _2 = __builtin_bswap64 (_10); MEM [(long long unsigned int *)] = _16; _7 = MEM [(char * {ref-all})]; doesn't realize that it can maybe move the hi/lo swap across the two permutes to before the store, otherwise it looks as expected. Yes, the vectorizer doesn't account for ABI details on the function boundary but it's very hard to do that in a sensible way. Practically the worst part of the generated code is movq%rdi, -24(%rsp) movq%rsi, -16(%rsp) movdqa -24(%rsp), %xmm0 because the store will fail to forward, causing a huge performance issue. I wonder why we fail to merge those. We face (insn 26 25 27 2 (set (subreg:DI (reg/v:TI 87 [ a ]) 0) (reg:DI 97)) "t.c":2:1 80 {*movdi_internal} (expr_list:REG_DEAD (reg:DI 97) (nil))) (insn 27 26 7 2 (set (subreg:DI (reg/v:TI 87 [ a ]) 8) (reg:DI 98)) "t.c":2:1 80 {*movdi_internal} (expr_list:REG_DEAD (reg:DI 98) (nil))) (note 7 27 12 2 NOTE_INSN_FUNCTION_BEG) (insn 12 7 14 2 (set (reg:V2DI 91) (vec_select:V2DI (subreg:V2DI (reg/v:TI 87 [ a ]) 0) (parallel [ (const_int 1 [0x1]) (const_int 0 [0]) ]))) "t.c":6:12 7927 {*ssse3_palignrv2di_perm} (expr_list:REG_DEAD (reg/v:TI 87 [ a ]) (nil))) where Trying 26, 27 -> 12: 26: r87:TI#0=r97:DI REG_DEAD r97:DI 27: r87:TI#8=r98:DI REG_DEAD r98:DI 12: r91:V2DI=vec_select(r87:TI#0,parallel) REG_DEAD r87:TI Can't combine i2 into i3 possibly because 27 is a partial def of r87. We expand to (insn 4 3 5 2 (set (reg:TI 88) (subreg:TI (reg:DI 89) 0)) "t.c":2:1 -1 (nil)) (insn 5 4 6 2 (set (subreg:DI (reg:TI 88) 8) (reg:DI 90)) "t.c":2:1 -1 (nil)) (insn 6 5 7 2 (set (reg/v:TI 87 [ a ]) (reg:TI 88)) "t.c":2:1 -1 (nil)) (note 7 6 10 2 NOTE_INSN_FUNCTION_BEG) (insn 10 7 12 2 (set (reg:V2DI 82 [ _11 ]) (subreg:V2DI (reg/v:TI 87 [ a ]) 0)) -1 (nil)) (insn 12 10 13 2 (set (reg:V2DI 91) (vec_select:V2DI (reg:V2DI 82 [ _11 ]) (parallel [ (const_int 1 [0x1]) (const_int 0 [0]) ]))) "t.c":6:12 -1 (nil)) initially from _11 = VIEW_CONVERT_EXPR(a_3(D)); and fwprop1 still sees (insn 26 25 27 2 (set (subreg:DI (reg/v:TI 87 [ a ]) 0) (reg:DI 89 [ a ])) "t.c":2:1 80 {*movdi_internal} (expr_list:REG_DEAD (reg:DI 95 [ a ]) (nil))) (insn 27 26 7 2 (set (subreg:DI (reg/v:TI 87 [ a ]) 8) (reg:DI 90 [ a+8 ])) "t.c":2:1 80 {*movdi_internal} (expr_list:REG_DEAD (reg:DI 96 [+8 ]) (nil))) (note 7 27 10 2 NOTE_INSN_FUNCTION_BEG) (insn 10 7 12 2 (set (reg:V2DI 82 [ _11 ]) (subreg:V2DI (reg/v:TI 87 [ a ]) 0)) 1700 {movv2di_internal} (expr_list:REG_DEAD (reg/v:TI 87 [ a ]) (nil))) so that would be the best place to fix this up, realizing reg 87 dies after insn 10. Richard - I'm sure we can construct a similar case for aarch64 where argument passing and vector mode use cause spilling? On x86 the simplest testcase showing this is typedef unsigned long long v2di __attribute__((vector_size(16))); v2di bswap(__uint128_t a) { return *(v2di *) } that produces bswap: .LFB0: .cfi_startproc sub sp, sp, #16 .cfi_def_cfa_offset 16 stp x0, x1, [sp] ldr q0, [sp] add sp, sp, 16 .cfi_def_cfa_offset 0 ret on arm for me. Maybe the
[Bug middle-end/104151] [9/10/11/12 Regression] x86: excessive code generated for 128-bit byteswap
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104151 --- Comment #5 from Hongtao.liu --- > 1. According to ABI, uint128 is passed by 2 gpr, and there should be extra > cost for _11 = VIEW_CONVERT_EXPR(a_3(D)); And no scalar_cost is needed for BIT_FIELD_REF 1 times scalar_stmt costs 4 in body BIT_FIELD_REF 1 times scalar_stmt costs 4 in body
[Bug middle-end/104151] [9/10/11/12 Regression] x86: excessive code generated for 128-bit byteswap
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104151 --- Comment #4 from Hongtao.liu --- Also there's separate issue, codegen for below is not optimal gimple: _11 = VIEW_CONVERT_EXPR(a_3(D)) asm: mov QWORD PTR [rsp-24], rdi mov QWORD PTR [rsp-16], rsi movdqa xmm0, XMMWORD PTR [rsp-24] I think this issue has been recorded in several existed PRs.
[Bug middle-end/104151] [9/10/11/12 Regression] x86: excessive code generated for 128-bit byteswap
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104151 --- Comment #3 from Hongtao.liu --- 172_1 1 times scalar_store costs 12 in body 173_2 1 times scalar_store costs 12 in body 174__builtin_bswap64 (_8) 1 times scalar_stmt costs 4 in body 175__builtin_bswap64 (_10) 1 times scalar_stmt costs 4 in body 176BIT_FIELD_REF 1 times scalar_stmt costs 4 in body 177BIT_FIELD_REF 1 times scalar_stmt costs 4 in body 1781 1 times vec_perm costs 4 in body 179__builtin_bswap64 (_8) 1 times vector_stmt costs 4 in prologue 180__builtin_bswap64 (_8) 1 times vec_perm costs 4 in body 181_1 1 times vector_store costs 16 in body 182test.cc:9:10: note: Cost model analysis for part in loop 0: 183 Vector cost: 28 184 Scalar cost: 40 ... 243 [local count: 1073741824]: 244 _8 = BIT_FIELD_REF ; 245 _11 = VIEW_CONVERT_EXPR(a_3(D)); 246 _13 = VIEW_CONVERT_EXPR(a_3(D)); 247 _12 = VEC_PERM_EXPR <_11, _13, { 1, 0 }>; 248 _14 = VIEW_CONVERT_EXPR(_12); 249 _15 = VEC_PERM_EXPR <_14, _14, { 7, 6, 5, 4, 3, 2, 1, 0, 15, 14, 13, 12, 11, 10, 9, 8 }>; 250 _16 = VIEW_CONVERT_EXPR(_15); 251 _1 = __builtin_bswap64 (_8); 252 _10 = BIT_FIELD_REF ; 253 _2 = __builtin_bswap64 (_10); 254 MEM [(long unsigned int *)] = _16; 255 _7 = MEM [(char * {ref-all})]; 1. According to ABI, uint128 is passed by 2 gpr, and there should be extra cost for _11 = VIEW_CONVERT_EXPR(a_3(D)); 2. Why there's 1781 1 times vec_perm costs 4 in body, should it be 2 times vec_perm costs?
[Bug middle-end/104151] [9/10/11/12 Regression] x86: excessive code generated for 128-bit byteswap
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104151 --- Comment #2 from Hongtao.liu --- with -fno-tree-vectorize, gcc also produce optimal code. mov rax, rsi mov rdx, rdi bswap rax bswap rdx ret Guess it's related to vectorizer cost model.
[Bug middle-end/104151] [9/10/11/12 Regression] x86: excessive code generated for 128-bit byteswap
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104151 Andrew Pinski changed: What|Removed |Added Last reconfirmed||2022-01-20 Status|UNCONFIRMED |NEW Summary|x86: excessive code |[9/10/11/12 Regression] |generated for 128-bit |x86: excessive code |byteswap|generated for 128-bit ||byteswap Ever confirmed|0 |1 Blocks||101926 Target Milestone|--- |12.0 Known to work||6.1.0 Component|target |middle-end Keywords||missed-optimization --- Comment #1 from Andrew Pinski --- GCC 11 and before does at -O2 (GCC 6 and before could it for -O3 too): mov rax, rsi mov rdx, rdi bswap rax bswap rdx The reason is SLP vectorizer is turned on for -O2 in GCC 12. We get: _11 = {_1, _2}; _5 = VIEW_CONVERT_EXPR(_11); The expansion of this could be done using move instructions I notice for aarch64, SLP kicks in even more and does the following: fmovd0, x0 fmovv0.d[1], x1 ext v0.16b, v0.16b, v0.16b, #8 rev64 v0.16b, v0.16b umovx0, v0.d[0] umovx1, v0.d[1] This is even true for -O2 -mavx too: mov QWORD PTR [rsp-24], rdi mov QWORD PTR [rsp-16], rsi vmovdqa xmm1, XMMWORD PTR [rsp-24] vpalignrxmm0, xmm1, xmm1, 8 vpshufb xmm2, xmm0, XMMWORD PTR .LC0[rip] vmovdqa XMMWORD PTR [rsp-24], xmm2 mov rax, QWORD PTR [rsp-24] mov rdx, QWORD PTR [rsp-16] There are so many different little regressions when handling this code it seems. But I think it all comes down to modeling arguments and return value on the gimple level which breaks this. Referenced Bugs: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101926 [Bug 101926] [meta-bug] struct/complex argument passing and return should be improved