https://gcc.gnu.org/bugzilla/show_bug.cgi?id=121451
Bug ID: 121451
Summary: RISC-V: zero-stride load broadcast vs. vector-scalar
Product: gcc
Version: 15.1.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: parras at gcc dot gnu.org
Target Milestone: ---
https://godbolt.org/z/brW6sG7KM
Reduced from 538.imagick topblock #0 (11 insns, 36.75%)
We get the following assembly:
fld fa5,0(a1)
vfmv.v.f v3,fa5
vfmacc.vv v1,v3,v2
But since PR119100 we should get:
fld fa5,0(a4)
vfmacc.vf v1,fa5,v2
What is preventing the combination here is the vec_duplicate operand being a
mem:
(set (reg:RVVM1DF 157 [ _20 ])
(vec_duplicate:RVVM1DF (mem:DF (reg/v/f:DI 153 [ g ]) [1 *g_16(D)+0 S8
A64])))
OTOH this seems to be candidate for a zero-stride load broadcast:
vlse64.v v3,0(a1),zero
vfmacc.vv v1,v3,v2
However since r16-2452-gf796f819c35cc0 this case is explicitly handled as a
regular broadcast (implying the vfmv). Is there a reason to prefer forcing
unconditionally the memory operand into a register (fld + vfmv) over a
zero-stride load (vlse)?
bool
can_be_broadcast_p (rtx op)
{
...
if (FLOAT_MODE_P (mode)
&& (memory_operand (op, mode) || CONSTANT_P (op))
&& can_create_pseudo_p ())
return true;
I also noticed the tunable discussed in PR118734 but the decision made here
does not involve it.