Dhruv Chawla <dhr...@nvidia.com> writes:
> This patch modifies Advanced SIMD assembly generation to emit an LDR
> instruction when a vector is created using a load to the first element with 
> the
> other elements being zero.
>
> This is similar to what *aarch64_combinez<mode> already does.
>
> Example:
>
> uint8x16_t foo(uint8_t *x) {
>    uint8x16_t r = vdupq_n_u8(0);
>    r = vsetq_lane_u8(*x, r, 0);
>    return r;
> }
>
> Currently, this generates:
>
> foo:
>       movi    v0.4s, 0
>       ld1     {v0.b}[0], [x0]
>       ret
>
> After applying the patch, this generates:
>
> foo:
>       ldr     b0, [x0]
>       ret
>
> Bootstrapped and regtested on aarch64-linux-gnu. Tested on
> aarch64_be-unknown-linux-gnu as well.
>
> Signed-off-by: Dhruv Chawla <dhr...@nvidia.com>
>
> gcc/ChangeLog:
>
>       * config/aarch64/aarch64-simd.md
>       (*aarch64_simd_vec_set_low<mode>): New pattern.
>
> gcc/testsuite/ChangeLog:
>
>       * gcc.target/aarch64/simd/ldr_first_le.c: New test.
>       * gcc.target/aarch64/simd/ldr_first_be.c: Likewise.
> ---
>   gcc/config/aarch64/aarch64-simd.md            |  12 ++
>   .../gcc.target/aarch64/simd/ldr_first_be.c    | 140 ++++++++++++++++++
>   .../gcc.target/aarch64/simd/ldr_first_le.c    | 139 +++++++++++++++++
>   3 files changed, 291 insertions(+)
>   create mode 100644 gcc/testsuite/gcc.target/aarch64/simd/ldr_first_be.c
>   create mode 100644 gcc/testsuite/gcc.target/aarch64/simd/ldr_first_le.c
>
> diff --git a/gcc/config/aarch64/aarch64-simd.md 
> b/gcc/config/aarch64/aarch64-simd.md
> index e2afe87e513..7be1c685fcf 100644
> --- a/gcc/config/aarch64/aarch64-simd.md
> +++ b/gcc/config/aarch64/aarch64-simd.md
> @@ -1164,6 +1164,18 @@
>     [(set_attr "type" "neon_logic<q>")]
>   )
>   
> +(define_insn "*aarch64_simd_vec_set_low<mode>"
> +  [(set (match_operand:VALL_F16 0 "register_operand" "=w")
> +     (vec_merge:VALL_F16
> +         (vec_duplicate:VALL_F16
> +             (match_operand:<VEL> 1 "aarch64_simd_nonimmediate_operand" "m"))

The constraint should be "Utv" rather than "m", since the operand doesn't
accept all addresses that are valid for <VEL>.  E.g. a normal SImode
memory would allow [reg, #imm], whereas this address does't.

> +         (match_operand:VALL_F16 3 "aarch64_simd_imm_zero" "i")
> +         (match_operand:SI 2 "immediate_operand" "i")))]

I think we should drop the two "i"s here, since the pattern doesn't
accept all immediates.  The predicate on the final operand should be
const_int_operand rather than immediate_operand.

Otherwise it looks good.  But I think we should think about how we
plan to integrate the related optimisation for register inputs.  E.g.:

int32x4_t foo(int32_t x) {
    return vsetq_lane_s32(x, vdupq_n_s32(0), 0);
}

generates:

foo:
        movi    v0.4s, 0
        ins     v0.s[0], w0
        ret

rather than a single UMOV.  Same idea when the input is in an FPR rather
than a GPR, but using FMOV rather than UMOV.

Conventionally, the register and memory forms should be listed as
alternatives in a single pattern, but that's somewhat complex because of
the different instruction availability for 64-bit+32-bit, 16-bit, and
8-bit register operations.

My worry is that if we handle the register case as an entirely separate
patch, it would have to rewrite this one.

The register case is somewhat related to Pengxuan's work on permutations.

Thanks,
Richard

Reply via email to