https://gcc.gnu.org/bugzilla/show_bug.cgi?id=123997
Bug ID: 123997
Summary: Missing patterns for masked vector multiplication with
memory operand
Product: gcc
Version: 16.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: rguenth at gcc dot gnu.org
Target Milestone: ---
Trying 48 -> 49:
48: r134:V8DF=vec_merge(unspec[[r129:DI*0x8+r121:DI]]
178,const_vector,r118:QI)
REG_DEAD r129:DI
REG_DEAD r121:DI
49: r133:V8DF=vec_merge(r110:V8DF*r134:V8DF,const_vector,r118:QI)
REG_DEAD r134:V8DF
REG_DEAD r110:V8DF
Failed to match this instruction:
(set (reg:V8DF 133)
(vec_merge:V8DF (mult:V8DF (vec_merge:V8DF (unspec:V8DF [
(mem:V8DF (plus:DI (mult:DI (reg:DI 129 [ _66 ])
(const_int 8 [0x8]))
(reg/v/f:DI 121 [ in1 ])) [1 S64 A64])
] UNSPEC_MASKLOAD)
(const_vector:V8DF [
(const_double:DF 0.0 [0x0.0p+0]) repeated x8
])
(reg:QI 118 [ _89 ]))
(reg:V8DF 110 [ vect__23.21 ]))
(const_vector:V8DF [
(const_double:DF 0.0 [0x0.0p+0]) repeated x8
])
(reg:QI 118 [ _89 ])))
or, when with -Ofast you get originally unmasked multiplication but masked
load:
Trying 48 -> 49:
48: r133:V8DF=vec_merge(unspec[[r129:DI*0x8+r121:DI]]
178,const_vector,r118:QI)
REG_DEAD r129:DI
REG_DEAD r121:DI
49: r134:V8DF=r133:V8DF*r110:V8DF
REG_DEAD r133:V8DF
REG_DEAD r110:V8DF
Failed to match this instruction:
(set (reg:V8DF 134 [ vect__9.25_78 ])
(mult:V8DF (vec_merge:V8DF (unspec:V8DF [
(mem:V8DF (plus:DI (mult:DI (reg:DI 129 [ _66 ])
(const_int 8 [0x8]))
(reg/v/f:DI 121 [ in1 ])) [1 S64 A64])
] UNSPEC_MASKLOAD)
(const_vector:V8DF [
(const_double:DF 0.0 [0x0.0p+0]) repeated x8
])
(reg:QI 118 [ _89 ]))
(reg:V8DF 110 [ vect__23.21 ])))
Testcase, compile with -O{3,fast} -march=x86-64-v4 --param
vect-partial-vector-usage=1
void foo(double * restrict out,
double *in0,
double *in1,
int N) {
for ( int i = 0 ; i < N ; i++ ) {
out[i] = in0[i] * in1[i];
}
}
and you'll get masked epilogue assembly like
subl %eax, %ecx
vpbroadcastd %ecx, %ymm0
vpcmpud $6, .LC0(%rip), %ymm0, %k1
vmovupd (%r9,%rax,8), %zmm2{%k1}{z}
vmovupd (%r8,%rax,8), %zmm1{%k1}{z}
vmulpd %zmm1, %zmm2, %zmm0{%k1}{z} <----
vmovupd %zmm0, (%rdi,%rax,8){%k1}
where the indicated multiplication could use a memory operand. With -Ofast
the multiplication is instead
vmovupd (%r9,%rax,8), %zmm1{%k1}{z}
vmovupd (%r8,%rax,8), %zmm0{%k1}{z}
vmulpd %zmm1, %zmm0, %zmm0
I suspect quite some explosions in patterns if we want to handle this (and
other operations that can do fault suppression) memory forwarding via combine.
Unsure if there's another, better, way to achieve such forwarding with some
md-reorg?