RE: [Neon intrinsics] Literal vector construction through vcombine is poor

Michael Collison Mon, 26 Jun 2017 23:16:05 -0700

Richard,

I reworked the patch using an assert as you suggested. Bootstrapped and 
retested. Okay for trunk?



-----Original Message-----
From: Richard Earnshaw (lists) [mailto:richard.earns...@arm.com] 
Sent: Friday, June 23, 2017 2:09 AM
To: Michael Collison <michael.colli...@arm.com>; GCC Patches 
<gcc-patches@gcc.gnu.org>
Cc: nd <n...@arm.com>
Subject: Re: [Neon intrinsics] Literal vector construction through vcombine is 
poor

On 23/06/17 00:10, Michael Collison wrote:
> Richard,
> 
> I reworked the patch and retested on big endian as well as little. The 
> original code was performing two swaps in the big endian case which works out 
> to no swaps at all.
> 
> I also updated the ChangeLog per your comments. Okay for trunk?
> 
> 2017-06-19  Michael Collison  <michael.colli...@arm.com>
> 
>       * config/aarch64/aarch64-simd.md (aarch64_combine<mode>): Directly
>       call aarch64_split_simd_combine.
>       * (aarch64_combine_internal<mode>): Delete pattern.
>       * config/aarch64/aarch64.c (aarch64_split_simd_combine):
>       Allow register and subreg operands.
> 
> -----Original Message-----
> From: Richard Earnshaw (lists) [mailto:richard.earns...@arm.com]
> Sent: Monday, June 19, 2017 6:37 AM
> To: Michael Collison <michael.colli...@arm.com>; GCC Patches 
> <gcc-patches@gcc.gnu.org>
> Cc: nd <n...@arm.com>
> Subject: Re: [Neon intrinsics] Literal vector construction through 
> vcombine is poor
> 
> On 16/06/17 22:08, Michael Collison wrote:
>> This patch improves code generation for literal vector construction by 
>> expanding and exposing the pattern to rtl optimization earlier. The current 
>> implementation delays splitting the pattern until after reload which results 
>> in poor code generation for the following code:
>>
>>
>> #include "arm_neon.h"
>>
>> int16x8_t
>> foo ()
>> {
>>   return vcombine_s16 (vdup_n_s16 (0), vdup_n_s16 (8)); }
>>
>> Trunk generates:
>>
>> foo:
>>      movi    v1.2s, 0
>>      movi    v0.4h, 0x8
>>      dup     d2, v1.d[0]
>>      ins     v2.d[1], v0.d[0]
>>      orr     v0.16b, v2.16b, v2.16b
>>      ret
>>
>> With the patch we now generate:
>>
>> foo:
>>      movi    v1.4h, 0x8
>>      movi    v0.4s, 0
>>      ins     v0.d[1], v1.d[0]
>>      ret
>>
>> Bootstrapped and tested on aarch64-linux-gnu. Okay for trunk.
>>
>> 2017-06-15  Michael Collison  <michael.colli...@arm.com>
>>
>>      * config/aarch64/aarch64-simd.md(aarch64_combine_internal<mode>):
>>      Convert from define_insn_and_split into define_expand
>>      * config/aarch64/aarch64.c(aarch64_split_simd_combine):
>>      Allow register and subreg operands.
>>
> 
> Your changelog entry is confusing.  You've deleted the 
> aarch64_combine_internal<mode> pattern entirely, having merged some of its 
> functionality directly into its caller (aarch64_combine<mode>).
> 
> So I think it should read:
> 
> * config/aarch64/aarch64-simd.md (aarch64_combine<mode>): Directly call 
> aarch64_split_simd_combine.
> (aarch64_combine_internal<mode>): Delete pattern.
> * ...
> 
> Note also there should be a space between the file name and the open bracket 
> for the first function name.
> 
> Why don't you need the big-endian code path any more?
> 
> R.
> 
>>
>> pr7057.patch
>>
>>
>> diff --git a/gcc/config/aarch64/aarch64-simd.md
>> b/gcc/config/aarch64/aarch64-simd.md
>> index c462164..4a253a9 100644
>> --- a/gcc/config/aarch64/aarch64-simd.md
>> +++ b/gcc/config/aarch64/aarch64-simd.md
>> @@ -2807,27 +2807,11 @@
>>        op1 = operands[1];
>>        op2 = operands[2];
>>      }
>> -  emit_insn (gen_aarch64_combine_internal<mode> (operands[0], op1, 
>> op2));
>> -  DONE;
>> -}
>> -)
>>  
>> -(define_insn_and_split "aarch64_combine_internal<mode>"
>> -  [(set (match_operand:<VDBL> 0 "register_operand" "=&w")
>> -        (vec_concat:<VDBL> (match_operand:VDC 1 "register_operand" "w")
>> -                       (match_operand:VDC 2 "register_operand" "w")))]
>> -  "TARGET_SIMD"
>> -  "#"
>> -  "&& reload_completed"
>> -  [(const_int 0)]
>> -{
>> -  if (BYTES_BIG_ENDIAN)
>> -    aarch64_split_simd_combine (operands[0], operands[2], operands[1]);
>> -  else
>> -    aarch64_split_simd_combine (operands[0], operands[1], operands[2]);
>> +  aarch64_split_simd_combine (operands[0], op1, op2);
>> +
>>    DONE;
>>  }
>> -[(set_attr "type" "multiple")]
>>  )
>>  
>>  (define_expand "aarch64_simd_combine<mode>"
>> diff --git a/gcc/config/aarch64/aarch64.c 
>> b/gcc/config/aarch64/aarch64.c index 2e385c4..46bd78b 100644
>> --- a/gcc/config/aarch64/aarch64.c
>> +++ b/gcc/config/aarch64/aarch64.c
>> @@ -1650,7 +1650,8 @@ aarch64_split_simd_combine (rtx dst, rtx src1, 
>> rtx src2)
>>  
>>    gcc_assert (VECTOR_MODE_P (dst_mode));
>>  
>> -  if (REG_P (dst) && REG_P (src1) && REG_P (src2))
>> +  if (register_operand (dst, dst_mode) && register_operand (src1, src_mode)
>> +      && register_operand (src2, src_mode))
>>      {
>>        rtx (*gen) (rtx, rtx, rtx);
>>  
>>
> 
> 
> pr7057v4.patch
> 
> 
> diff --git a/gcc/config/aarch64/aarch64-simd.md 
> b/gcc/config/aarch64/aarch64-simd.md
> index c462164..3043f81 100644
> --- a/gcc/config/aarch64/aarch64-simd.md
> +++ b/gcc/config/aarch64/aarch64-simd.md
> @@ -2796,38 +2796,10 @@
>     (match_operand:VDC 2 "register_operand")]
>    "TARGET_SIMD"
>  {
> -  rtx op1, op2;
> -  if (BYTES_BIG_ENDIAN)
> -    {
> -      op1 = operands[2];
> -      op2 = operands[1];
> -    }
> -  else
> -    {
> -      op1 = operands[1];
> -      op2 = operands[2];
> -    }
> -  emit_insn (gen_aarch64_combine_internal<mode> (operands[0], op1, 
> op2));
> -  DONE;
> -}
> -)
> +  aarch64_split_simd_combine (operands[0], operands[1], operands[2]);
>  
> -(define_insn_and_split "aarch64_combine_internal<mode>"
> -  [(set (match_operand:<VDBL> 0 "register_operand" "=&w")
> -        (vec_concat:<VDBL> (match_operand:VDC 1 "register_operand" "w")
> -                        (match_operand:VDC 2 "register_operand" "w")))]
> -  "TARGET_SIMD"
> -  "#"
> -  "&& reload_completed"
> -  [(const_int 0)]
> -{
> -  if (BYTES_BIG_ENDIAN)
> -    aarch64_split_simd_combine (operands[0], operands[2], operands[1]);
> -  else
> -    aarch64_split_simd_combine (operands[0], operands[1], operands[2]);
>    DONE;
>  }
> -[(set_attr "type" "multiple")]
>  )
>  
>  (define_expand "aarch64_simd_combine<mode>"
> diff --git a/gcc/config/aarch64/aarch64.c 
> b/gcc/config/aarch64/aarch64.c index 2e385c4..46bd78b 100644
> --- a/gcc/config/aarch64/aarch64.c
> +++ b/gcc/config/aarch64/aarch64.c
> @@ -1650,7 +1650,8 @@ aarch64_split_simd_combine (rtx dst, rtx src1, 
> rtx src2)
>  
>    gcc_assert (VECTOR_MODE_P (dst_mode));
>  
> -  if (REG_P (dst) && REG_P (src1) && REG_P (src2))
> +  if (register_operand (dst, dst_mode) && register_operand (src1, src_mode)
> +      && register_operand (src2, src_mode))
>      {
>        rtx (*gen) (rtx, rtx, rtx);
>  
> 

As far as I can see aarch64_split_simd_combine is only called from this one 
expand pattern and the predicates on the pattern enforce all the operands being 
registers.  Furthermore, aarch64_split_simd_combine does nothing if they aren't 
all registers, which would obviously result in wrong code generation.

So convert

> +  if (register_operand (dst, dst_mode) && register_operand (src1,
src_mode)
> +      && register_operand (src2, src_mode))

into an assertion and make the following code unconditional.

OK with that change.

R.

pr7057v5.patch
Description: pr7057v5.patch

RE: [Neon intrinsics] Literal vector construction through vcombine is poor

Reply via email to