James Greenhalgh <james.greenha...@arm.com> writes:
> Hi,
>
> For this code:
>
>   struct y {
>     float x[4];
>   };
>
>   float
>   bar3 (struct y x)
>   {
>     return x.x[3];
>   }
>
> GCC generates:
>
>   bar3:
>       fmov    x1, d2
>       mov     x0, 0
>       bfi     x0, x1, 0, 32
>       fmov    x1, d3
>       bfi     x0, x1, 32, 32
>       sbfx    x0, x0, 32, 32
>       fmov    s0, w0
>       ret
>
> If you can wrap your head around that, you'll spot that it could be
> simplified to:
>
>   bar3:
>       fmov    s0, s3
>       ret

I get the second version with current trunk, at -O and above.  Does this
only happen with some other modifications, or for certain subtargets?
Or maybe I'm testing the wrong thing.

> Looking at it, I think the issue is the mode that we assign to the
> PARALLEL we build for an HFA in registers. When we get in to
> aarch64_layout_arg with a composite, MODE is set to the smallest integer
> mode that would contain the size of the composite type. That is to say, in
> the example above, MODE will be TImode.
>
> Looking at the expansion path through assign_parms, we're going to go:
>
> assign_parms
>   assign_parm_setup_reg
>     assign_parm_remove_parallels
>       emit_group_store
>
> assign_parm_remove_parallels is going to try to create a REG in MODE,
> then construct that REG using the values in the HFA PARALLEL we created. So,
> for the example above, we're going to try to create a packed TImode value
> built up from each of the four "S" registers we've assigned for the
> arguments. Using one of the struct elements is then a 32-bit extract from
> the TImode value (then a move back to FP/SIMD registers). This explains
> the code-gen in the example. Note that an extract from the TImode value
> makes the whole TImode value live, so we can't optimize away the
> construction in registers.
>
> If instead we make the PARALLEL that we create in aarch64_layout_arg BLKmode
> then our expansion path is through:
>
> assign_parms
>   assign_parm_setup_block
>
> Which handles creating a stack slot of the right size for our HFA, and
> copying it to there. We could then trust the usual optimisers to deal with
> the object construction and eliminate it where possible.
>
> However, we can't just return a BLKmode Parallel, as the mid-end was explictly
> asking us to return in MODE, and will eventually ICE given the inconsistency.
> One other way we can force these structures to be given BLKmode is through
> TARGET_MEMBER_TYPE_FORCES_BLK. Which is what we do in this patch.
>
> We're going to tell the mid-end that any structure of more than one element
> which contains either floating-point or vector data should be set out in
> BLKmode rather than a large-enough integer mode. In doing so, we implicitly
> fix the issue with HFA layout above. But at what cost!

The patch only seems to handle scalar floats, not vectors.

Wasn't replying to this to nitpick though :-).  I just wondered whether
there would be any benefit to having an array_mode hook that returns
V4SF for float[4].  (We needed that hook to handle load/store lanes
for SVE, so see git branch linaro-dev/sve if you want to try it.)
Maybe alignment would be a problem though?

Thanks,
Richard

Reply via email to