https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109093
Jakub Jelinek <jakub at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |hjl.tools at gmail dot com
--- Comment #5 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
My patch just caused far more .DEFERRED_INITs to be optimized away for dead
variables (though, as can be seen on #c0 apparently not all).
What I see on the #c0 testcase looks like a x86 backend bug to me.
In func_2.constprop.0.isra.0 there is in optimized dump:
uint64_t * * * * const * l_2254[6];
variable and the IL mentions it just in
l_2254 = .DEFERRED_INIT (48, 2, &"l_2254"[0]);
and
l_2254 ={v} {CLOBBER(eol)};
(the latter in 2 spots) statements. Why the .DEFERRED_INIT hasn't been DSEd is
certainly a question.
Anyway, l_2254 has 128-bit alignment (supposedly due to ix86_local_alignment
and psABI requirements).
Expansion expands that .DEFERRED_INIT into:
(insn 23 22 24 5 (parallel [
(set (reg:DI 162)
(plus:DI (reg/f:DI 19 frame)
(const_int -48 [0xffffffffffffffd0])))
(clobber (reg:CC 17 flags))
]) "runData/keep/in.16651.c":199:34 247 {*adddi_1}
(nil))
(insn 24 23 25 5 (set (reg:V32QI 163)
(const_vector:V32QI [
(const_int 0 [0]) repeated x32
])) "runData/keep/in.16651.c":199:34 1823 {movv32qi_internal}
(nil))
(insn 25 24 26 5 (set (mem/c:V16QI (reg:DI 162) [0 MEM <char[1:48]> [(void
*)_157]+0 S16 A128])
(vec_select:V16QI (reg:V32QI 163)
(parallel [
(const_int 0 [0])
(const_int 1 [0x1])
(const_int 2 [0x2])
(const_int 3 [0x3])
(const_int 4 [0x4])
(const_int 5 [0x5])
(const_int 6 [0x6])
(const_int 7 [0x7])
(const_int 8 [0x8])
(const_int 9 [0x9])
(const_int 10 [0xa])
(const_int 11 [0xb])
(const_int 12 [0xc])
(const_int 13 [0xd])
(const_int 14 [0xe])
(const_int 15 [0xf])
]))) "runData/keep/in.16651.c":199:34 4383
{vec_extract_lo_v32qi}
(nil))
(insn 26 25 27 5 (set (mem/c:V16QI (plus:DI (reg:DI 162)
(const_int 16 [0x10])) [0 MEM <char[1:48]> [(void *)_157]+16
S16 A128])
(vec_select:V16QI (reg:V32QI 163)
(parallel [
(const_int 16 [0x10])
(const_int 17 [0x11])
(const_int 18 [0x12])
(const_int 19 [0x13])
(const_int 20 [0x14])
(const_int 21 [0x15])
(const_int 22 [0x16])
(const_int 23 [0x17])
(const_int 24 [0x18])
(const_int 25 [0x19])
(const_int 26 [0x1a])
(const_int 27 [0x1b])
(const_int 28 [0x1c])
(const_int 29 [0x1d])
(const_int 30 [0x1e])
(const_int 31 [0x1f])
]))) "runData/keep/in.16651.c":199:34 4384
{vec_extract_hi_v32qi}
(nil))
(insn 27 26 28 5 (set (mem/c:V16QI (plus:DI (reg:DI 162)
(const_int 32 [0x20])) [0 MEM <char[1:48]> [(void *)_157]+32
S16 A128])
(subreg:V16QI (reg:V32QI 163) 0)) "runData/keep/in.16651.c":199:34 1824
{movv16qi_internal}
(nil))
cmpelim dump still has:
(insn 279 6 25 4 (set (reg/f:DI 38 r10 [215])
(plus:DI (reg/f:DI 7 sp)
(const_int -48 [0xffffffffffffffd0]))) 241 {*leadi}
(expr_list:REG_EQUIV (plus:DI (reg/f:DI 19 frame)
(const_int -48 [0xffffffffffffffd0]))
(nil)))
(insn 25 279 26 4 (set (reg:V16QI 21 xmm1 [orig:218 MEM <char[1:48]> [(void
*)_157] ] [218])
(const_vector:V16QI [
(const_int 0 [0]) repeated x16
])) "runData/keep/in.16651.c":199:34 1824 {movv16qi_internal}
(expr_list:REG_EQUIV (const_vector:V16QI [
(const_int 0 [0]) repeated x16
])
(nil)))
(insn 26 25 34 4 (set (reg:V16QI 20 xmm0 [orig:219 MEM <char[1:48]> [(void
*)_157]+16 ] [219])
(reg:V16QI 21 xmm1)) "runData/keep/in.16651.c":199:34 1824
{movv16qi_internal}
(expr_list:REG_EQUIV (const_vector:V16QI [
(const_int 0 [0]) repeated x16
])
(nil)))
before the loop and
(insn 289 22 290 5 (set (mem/c:V16QI (reg/f:DI 38 r10 [215]) [0 MEM
<char[1:48]> [(void *)_157]+0 S16 A128])
(reg:V16QI 21 xmm1 [orig:218 MEM <char[1:48]> [(void *)_157] ] [218]))
"runData/keep/in.16651.c":199:34 1824 {movv16qi_internal}
(nil))
(insn 290 289 291 5 (set (mem/c:V16QI (plus:DI (reg/f:DI 38 r10 [215])
(const_int 16 [0x10])) [0 MEM <char[1:48]> [(void *)_157]+16
S16 A128])
(reg:V16QI 20 xmm0 [orig:219 MEM <char[1:48]> [(void *)_157]+16 ]
[219])) "runData/keep/in.16651.c":199:34 1824 {movv16qi_internal}
(nil))
(insn 291 290 29 5 (set (mem/c:V16QI (plus:DI (reg/f:DI 38 r10 [215])
(const_int 32 [0x20])) [0 MEM <char[1:48]> [(void *)_157]+32
S16 A128])
(reg:V16QI 20 xmm0 [orig:219 MEM <char[1:48]> [(void *)_157]+16 ]
[219])) "runData/keep/in.16651.c":199:34 1824 {movv16qi_internal}
(nil))
in the loop. stack_alignment_needed is 128, but then pro_and_epilogue decides
to do:
(insn/f 337 315 338 2 (set (mem:DI (pre_dec:DI (reg/f:DI 7 sp)) [0 S8 A8])
(reg/f:DI 6 bp)) "runData/keep/in.16651.c":157:16 -1
(nil))
(insn/f 338 337 339 2 (set (reg/f:DI 6 bp)
(reg/f:DI 7 sp)) "runData/keep/in.16651.c":157:16 -1
(nil))
(insn/f 339 338 340 2 (set (mem:DI (pre_dec:DI (reg/f:DI 7 sp)) [0 S8 A8])
(reg:DI 41 r13)) "runData/keep/in.16651.c":157:16 -1
(nil))
(insn/f 340 339 341 2 (set (mem:DI (pre_dec:DI (reg/f:DI 7 sp)) [0 S8 A8])
(reg:DI 40 r12)) "runData/keep/in.16651.c":157:16 -1
(nil))
(insn/f 341 340 342 2 (set (mem:DI (pre_dec:DI (reg/f:DI 7 sp)) [0 S8 A8])
(reg:DI 3 bx)) "runData/keep/in.16651.c":157:16 -1
(nil))
(insn 342 341 343 2 (set (mem/v:BLK (scratch:DI) [0 A8])
(unspec:BLK [
(mem/v:BLK (scratch:DI) [0 A8])
] UNSPEC_MEMORY_BLOCKAGE)) "runData/keep/in.16651.c":157:16 -1
(nil))
...
(insn 279 6 25 3 (set (reg/f:DI 38 r10 [215])
(plus:DI (reg/f:DI 7 sp)
(const_int -48 [0xffffffffffffffd0]))) 241 {*leadi}
(expr_list:REG_EQUIV (plus:DI (reg/f:DI 19 frame)
(const_int -48 [0xffffffffffffffd0]))
(nil)))
which ends up:
pushq %rbp
.LCFI5:
movq %rsp, %rbp
.LCFI6:
pushq %r13
pushq %r12
pushq %rbx
...
leaq -48(%rsp), %r10
...
vmovdqa %xmm1, (%r10)
vmovdqa %xmm0, 16(%r10)
movl $8, %esi
vmovdqa %xmm0, 32(%r10)
But, this result in unaligned stores, because %rsp on entry to x86_64 functions
should be (%rsp & 15) == 8, such that %rbp is 16-byte aligned,
and then it does 3 pushes (24 bytes) and allocates l_2254 48 bytes below that,
so at %rbp - 72 bytes, so all the 3 vector stores are unaligned
((%r10 & 15) == 8).