Re: [PATCH][GCC][AArch64] Simplify movmem code by always doing overlapping copies when larger than 8 bytes.

2018-07-06 Thread Christophe Lyon
Hi Tamar,

On Tue, 3 Jul 2018 at 19:13, James Greenhalgh  wrote:
>
> On Tue, Jun 19, 2018 at 09:09:27AM -0500, Tamar Christina wrote:
> > Hi All,
>
> 
>
> OK.
>
> Thanks,
> James
>
> > Thanks,
> > Tamar
> >
> > gcc/
> > 2018-06-19  Tamar Christina  
> >
> >   * config/aarch64/aarch64.c (aarch64_expand_movmem): Fix mode size.
> >
> > gcc/testsuite/
> > 2018-06-19  Tamar Christina  
> >
> >   * gcc.target/aarch64/struct_cpy.c: New.
> >

The new test has a typo making it fail to compile:
/gcc/testsuite/gcc.target/aarch64/struct_cpy.c:49:7: error: ISO C does
not allow extra ';' outside of a function [-Wpedantic]
/gcc/testsuite/gcc.target/aarch64/struct_cpy.c:50:7: error: ISO C does
not allow extra ';' outside of a function [-Wpedantic]
/gcc/testsuite/gcc.target/aarch64/struct_cpy.c:51:7: error: ISO C does
not allow extra ';' outside of a function [-Wpedantic]
/gcc/testsuite/gcc.target/aarch64/struct_cpy.c:52:7: error: ISO C does
not allow extra ';' outside of a function [-Wpedantic]
/gcc/testsuite/gcc.target/aarch64/struct_cpy.c:53:7: error: ISO C does
not allow extra ';' outside of a function [-Wpedantic]
/gcc/testsuite/gcc.target/aarch64/struct_cpy.c:54:7: error: ISO C does
not allow extra ';' outside of a function [-Wpedantic]
/gcc/testsuite/gcc.target/aarch64/struct_cpy.c:55:7: error: ISO C does
not allow extra ';' outside of a function [-Wpedantic]
/gcc/testsuite/gcc.target/aarch64/struct_cpy.c:56:7: error: ISO C does
not allow extra ';' outside of a function [-Wpedantic]
/gcc/testsuite/gcc.target/aarch64/struct_cpy.c:57:7: error: ISO C does
not allow extra ';' outside of a function [-Wpedantic]
/gcc/testsuite/gcc.target/aarch64/struct_cpy.c:58:8: error: ISO C does
not allow extra ';' outside of a function [-Wpedantic]
/gcc/testsuite/gcc.target/aarch64/struct_cpy.c:59:8: error: ISO C does
not allow extra ';' outside of a function [-Wpedantic]
/gcc/testsuite/gcc.target/aarch64/struct_cpy.c:60:8: error: ISO C does
not allow extra ';' outside of a function [-Wpedantic]
/gcc/testsuite/gcc.target/aarch64/struct_cpy.c:61:8: error: ISO C does
not allow extra ';' outside of a function [-Wpedantic]
/gcc/testsuite/gcc.target/aarch64/struct_cpy.c:62:8: error: ISO C does
not allow extra ';' outside of a function [-Wpedantic]
/gcc/testsuite/gcc.target/aarch64/struct_cpy.c:63:8: error: ISO C does
not allow extra ';' outside of a function [-Wpedantic]
/gcc/testsuite/gcc.target/aarch64/struct_cpy.c:64:8: error: ISO C does
not allow extra ';' outside of a function [-Wpedantic]

Would you mind fixing it?

Thanks,

Christophe

> > --


Re: [PATCH][GCC][AArch64] Simplify movmem code by always doing overlapping copies when larger than 8 bytes.

2018-07-03 Thread James Greenhalgh
On Tue, Jun 19, 2018 at 09:09:27AM -0500, Tamar Christina wrote:
> Hi All,



OK.

Thanks,
James

> Thanks,
> Tamar
> 
> gcc/
> 2018-06-19  Tamar Christina  
> 
>   * config/aarch64/aarch64.c (aarch64_expand_movmem): Fix mode size.
> 
> gcc/testsuite/
> 2018-06-19  Tamar Christina  
> 
>   * gcc.target/aarch64/struct_cpy.c: New.
> 
> -- 


Re: [PATCH][GCC][AArch64] Simplify movmem code by always doing overlapping copies when larger than 8 bytes.

2018-06-20 Thread Tamar Christina
Hi James,

Many thanks for the review!

The 06/19/2018 22:23, James Greenhalgh wrote:
> On Tue, Jun 19, 2018 at 09:09:27AM -0500, Tamar Christina wrote:
> > Hi All,
> > 
> > This changes the movmem code in AArch64 that does copy for data between 4 
> > and 7
> > bytes to use the smallest possible mode capable of copying the remaining 
> > bytes.
> > 
> > for a 5 byte copy.
> 
> So... Given that I wrote the code to do the overlapping memmove back in 2014
> https://gcc.gnu.org/ml/gcc-patches/2014-06/msg00585.html I'm confused about
> why we want to achieve this transformation. I can see that we come out
> about even for a 5 or 6 byte copy, but your patch would also force it for a
> 7 byte copy, adding an extra load instruction. That said, the 5 and 6 cases
> probably weren't though through that closely.
> 

Actually, the 7 case stays the same.  This patch will generate just as before

ldr w1, [x0, 48]
ldr w0, [x0, 51]
str w1, [sp, 8]
str w0, [sp, 11]

This patch should never generate more loads or stores than it did before, in 
fact
it strictly generates less or the same amount of loads/stores.  Just in less 
code.
The reason this is the case is because of how the algorithm works.

This can be proved by induction. My hypothesis is that this algorithm never 
issues
more than 2 loads and 2 stores per copies of n % 32 bytes.

The case of n - (n % 32), the code will as before satisfy these using loads and 
stores
of TImode mode, issuing as many as required to get to n < 31. 

Which leaves the cases of n < 31 bytes.

the cases n > 24 and n < 31 will be done using overlapping TImode operations 
now,
whereas before they would issue more operations. e.g. n = 25 will now now 
overlap
where before it would have done 16 + 8 + 1. Another example, n == 27 before 
would
have done 16 + 8 + 4 (where the 4 overlaps). Now it will do 16 + 16 where the 
second
16 overlaps.

etc.

It does this by always doing the first copy using the largest mode that doesn't 
over
read. So for your 7 bytes example before, this would be a SImode. In the loop 
it selects
the smallest mode that is larger or equal to the remainder. SImode is the 
smallest mode
larger than 3 bytes. So it issues the same two SImode operations as before.

I had described this in the comments but if this wasn't clear I can expand on 
it.

> I'd like to see a testcase which shows the advantages for CSE you describe
> in your introduction to the patch; I'm not sure your testcase below gives
> that.

This CSE relies on a mid-end change also being accepted. This would be
https://gcc.gnu.org/ml/gcc-patches/2017-11/msg01088.html which was reverted do 
to
it not handling corner cases correctly 
https://gcc.gnu.org/ml/gcc-patches/2018-04/msg00239.html
I will am performing regression testing on the new version and will submit it 
soon.

If we take a more useful example of a program than the ones in the testcase and 
cover letter:

struct struct5  fun5()
{
  return foo5;
}

where foo5 is a simple 5 byte struct.

after SSA this becomes

struct struct5  fun5()
{
  D. = foo5
  return D.;
}

Which means two copies will be done, first one into D. and the second one 
by the return.
One handled by the mid-end and one by this back-end code.

Before this patch we generated at -O3 16 insn:

fun5:
adrpx1, .LANCHOR0
add x1, x1, :lo12:.LANCHOR0
mov x0, 0
sub sp, sp, #16
ldrbw2, [x1, 32]
ldr w1, [x1, 33]
add sp, sp, 16
bfi x0, x2, 0, 8
ubfxx3, x1, 8, 8
bfi x0, x1, 8, 8
ubfxx2, x1, 16, 8
lsr w1, w1, 24
bfi x0, x3, 16, 8
bfi x0, x2, 24, 8
bfi x0, x1, 32, 8
ret

With this patch alone 16 insn no regression, but no improvement either:

fun5:
adrpx2, .LANCHOR0
add x2, x2, :lo12:.LANCHOR0
mov x0, 0
sub sp, sp, #16
ldr w1, [x2, 32]
ldrbw2, [x2, 36]
add sp, sp, 16
ubfxx4, x1, 8, 8
bfi x0, x1, 0, 8
ubfxx3, x1, 16, 8
lsr w1, w1, 24
bfi x0, x4, 8, 8
bfi x0, x3, 16, 8
bfi x0, x1, 24, 8
bfi x0, x2, 32, 8
ret

And with this patch and mid-end patch this becomes 10 insn:

fun5:
adrpx1, .LANCHOR0
add x1, x1, :lo12:.LANCHOR0
mov x0, 0
sub sp, sp, #16
ldr w2, [x1, 32]
ldrbw1, [x1, 36]
add sp, sp, 16
bfi x0, x2, 0, 32
bfi x0, x1, 32, 8
ret

with the mid-end patch alone and the existing implementation it would be 14 
insn:

fun5:
adrpx1, .LANCHOR0
add x1, x1, :lo12:.LANCHOR0
sub sp, sp, #16
mov x0, 0
ldr w2, [x1, 32]
ldr w1, [x1, 33]
str w2, [sp, 8]
str w1, [sp, 9]
ldr  

Re: [PATCH][GCC][AArch64] Simplify movmem code by always doing overlapping copies when larger than 8 bytes.

2018-06-19 Thread James Greenhalgh
On Tue, Jun 19, 2018 at 09:09:27AM -0500, Tamar Christina wrote:
> Hi All,
> 
> This changes the movmem code in AArch64 that does copy for data between 4 and 
> 7
> bytes to use the smallest possible mode capable of copying the remaining 
> bytes.
> 
> This means that if we're copying 5 bytes we would issue an SImode and QImode
> load instead of two SImode loads.
> 
> This does smaller memory accesses but also gives the mid-end a chance to 
> realise
> that it can CSE the loads in certain circumstances. e.g. when you have 
> something
> like
> 
> return foo;
> 
> where foo is a struct. This would be transformed by the mid-end into SSA form 
> as
> 
> D. = foo;
> 
> return D.;
> 
> This movmem routine will handle the first copy, but it's usually not needed,
> the mid-end would do SImode and QImode stores into X0 for the 5 bytes example
> but without the first copies being in the same mode, it doesn't know it 
> doesn't
> need the stores at all.
> 
> This will generate
> 
> fun5:
>   sub sp, sp, #16
>   adrpx0, .LANCHOR0
>   add x1, x0, :lo12:.LANCHOR0
>   ldr w0, [x0, #:lo12:.LANCHOR0]
>   str w0, [sp, 8]
>   ldrbw0, [x1, 4]
>   strbw0, [sp, 12]
>   add sp, sp, 16
>   ret
> 
> instead of
> 
> fun5:
> sub sp, sp, #16
> adrpx0, .LANCHOR0
> add x1, x0, :lo12:.LANCHOR0
> ldr w0, [x0, #:lo12:.LANCHOR0]
> str w0, [sp, 8]
> ldr w0, [x1, 1]
> str w0, [sp, 9]
> add sp, sp, 16
> ret
> 
> for a 5 byte copy.

So... Given that I wrote the code to do the overlapping memmove back in 2014
https://gcc.gnu.org/ml/gcc-patches/2014-06/msg00585.html I'm confused about
why we want to achieve this transformation. I can see that we come out
about even for a 5 or 6 byte copy, but your patch would also force it for a
7 byte copy, adding an extra load instruction. That said, the 5 and 6 cases
probably weren't though through that closely.

I'd like to see a testcase which shows the advantages for CSE you describe
in your introduction to the patch; I'm not sure your testcase below gives
that.

> Regtested on aarch64-none-elf and .. issues.

Do I draw a zero on the line on my screen with a sharpie or what ;-)

> Ok for trunk?

Justification and performance impact would be appreciated. Right now I'm not
convinced this is forward progress.

> gcc/
> 2018-06-19  Tamar Christina  
> 
>   * config/aarch64/aarch64.c (aarch64_expand_movmem): Fix mode size.
> 
> gcc/testsuite/
> 2018-06-19  Tamar Christina  
> 
>   * gcc.target/aarch64/struct_cpy.c: New.
> 
> -- 

> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
> index 
> bd0ac2f04d8f43fd54b58ff9581645493b8d0cd1..ed5409403741fa634d977ba15cd22741bb9d1064
>  100644
> --- a/gcc/config/aarch64/aarch64.c
> +++ b/gcc/config/aarch64/aarch64.c
> @@ -16180,26 +16180,29 @@ aarch64_copy_one_block_and_progress_pointers (rtx 
> *src, rtx *dst,
>  bool
>  aarch64_expand_movmem (rtx *operands)
>  {
> -  unsigned int n;
> +  int n, mode_bits;
>rtx dst = operands[0];
>rtx src = operands[1];
>rtx base;
> +  machine_mode cur_mode = BLKmode, next_mode;
>bool speed_p = !optimize_function_for_size_p (cfun);
>  
>/* When optimizing for size, give a better estimate of the length of a
> - memcpy call, but use the default otherwise.  */
> -  unsigned int max_instructions = (speed_p ? 15 : AARCH64_CALL_RATIO) / 2;
> + memcpy call, but use the default otherwise.  Moves larger than 8 bytes
> + will always require an even number of instructions to do now.  And each
> + operation requires both a load+store, so devide the max number by 2.  */
> +  int max_num_moves = (speed_p ? 16 : AARCH64_CALL_RATIO) / 2;

Why did we go from unsigned to signed int?

Thanks,
James