Re: [PATCH][GCC][AArch64] Simplify movmem code by always doing overlapping copies when larger than 8 bytes.

James Greenhalgh Tue, 19 Jun 2018 14:23:47 -0700

On Tue, Jun 19, 2018 at 09:09:27AM -0500, Tamar Christina wrote:
> Hi All,
> 
> This changes the movmem code in AArch64 that does copy for data between 4 and 
> 7
> bytes to use the smallest possible mode capable of copying the remaining 
> bytes.
> 
> This means that if we're copying 5 bytes we would issue an SImode and QImode
> load instead of two SImode loads.
> 
> This does smaller memory accesses but also gives the mid-end a chance to 
> realise
> that it can CSE the loads in certain circumstances. e.g. when you have 
> something
> like
> 
> return foo;
> 
> where foo is a struct. This would be transformed by the mid-end into SSA form 
> as
> 
> D.XXXX = foo;
> 
> return D.XXXX;
> 
> This movmem routine will handle the first copy, but it's usually not needed,
> the mid-end would do SImode and QImode stores into X0 for the 5 bytes example
> but without the first copies being in the same mode, it doesn't know it 
> doesn't
> need the stores at all.
> 
> This will generate
> 
> fun5:
>       sub     sp, sp, #16
>       adrp    x0, .LANCHOR0
>       add     x1, x0, :lo12:.LANCHOR0
>       ldr     w0, [x0, #:lo12:.LANCHOR0]
>       str     w0, [sp, 8]
>       ldrb    w0, [x1, 4]
>       strb    w0, [sp, 12]
>       add     sp, sp, 16
>       ret
> 
> instead of
> 
> fun5:
>         sub     sp, sp, #16
>         adrp    x0, .LANCHOR0
>         add     x1, x0, :lo12:.LANCHOR0
>         ldr     w0, [x0, #:lo12:.LANCHOR0]
>         str     w0, [sp, 8]
>         ldr     w0, [x1, 1]
>         str     w0, [sp, 9]
>         add     sp, sp, 16
>         ret
> 
> for a 5 byte copy.


So... Given that I wrote the code to do the overlapping memmove back in 2014
https://gcc.gnu.org/ml/gcc-patches/2014-06/msg00585.html I'm confused about
why we want to achieve this transformation. I can see that we come out
about even for a 5 or 6 byte copy, but your patch would also force it for a
7 byte copy, adding an extra load instruction. That said, the 5 and 6 cases
probably weren't though through that closely.

I'd like to see a testcase which shows the advantages for CSE you describe
in your introduction to the patch; I'm not sure your testcase below gives
that.

> Regtested on aarch64-none-elf and .. issues.

Do I draw a zero on the line on my screen with a sharpie or what ;-)

> Ok for trunk?

Justification and performance impact would be appreciated. Right now I'm not
convinced this is forward progress.

> gcc/
> 2018-06-19  Tamar Christina  <tamar.christ...@arm.com>
> 
>       * config/aarch64/aarch64.c (aarch64_expand_movmem): Fix mode size.
> 
> gcc/testsuite/
> 2018-06-19  Tamar Christina  <tamar.christ...@arm.com>
> 
>       * gcc.target/aarch64/struct_cpy.c: New.
> 
> -- 

> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
> index 
> bd0ac2f04d8f43fd54b58ff9581645493b8d0cd1..ed5409403741fa634d977ba15cd22741bb9d1064
>  100644
> --- a/gcc/config/aarch64/aarch64.c
> +++ b/gcc/config/aarch64/aarch64.c
> @@ -16180,26 +16180,29 @@ aarch64_copy_one_block_and_progress_pointers (rtx 
> *src, rtx *dst,
>  bool
>  aarch64_expand_movmem (rtx *operands)
>  {
> -  unsigned int n;
> +  int n, mode_bits;
>    rtx dst = operands[0];
>    rtx src = operands[1];
>    rtx base;
> +  machine_mode cur_mode = BLKmode, next_mode;
>    bool speed_p = !optimize_function_for_size_p (cfun);
>  
>    /* When optimizing for size, give a better estimate of the length of a
> -     memcpy call, but use the default otherwise.  */
> -  unsigned int max_instructions = (speed_p ? 15 : AARCH64_CALL_RATIO) / 2;
> +     memcpy call, but use the default otherwise.  Moves larger than 8 bytes
> +     will always require an even number of instructions to do now.  And each
> +     operation requires both a load+store, so devide the max number by 2.  */
> +  int max_num_moves = (speed_p ? 16 : AARCH64_CALL_RATIO) / 2;

Why did we go from unsigned to signed int?

Thanks,
James

Re: [PATCH][GCC][AArch64] Simplify movmem code by always doing overlapping copies when larger than 8 bytes.

Reply via email to