Re: [PATCH][GCC][AArch64] Simplify movmem code by always doing overlapping copies when larger than 8 bytes.
Hi Tamar, On Tue, 3 Jul 2018 at 19:13, James Greenhalgh wrote: > > On Tue, Jun 19, 2018 at 09:09:27AM -0500, Tamar Christina wrote: > > Hi All, > > > > OK. > > Thanks, > James > > > Thanks, > > Tamar > > > > gcc/ > > 2018-06-19 Tamar Christina > > > > * config/aarch64/aarch64.c (aarch64_expand_movmem): Fix mode size. > > > > gcc/testsuite/ > > 2018-06-19 Tamar Christina > > > > * gcc.target/aarch64/struct_cpy.c: New. > > The new test has a typo making it fail to compile: /gcc/testsuite/gcc.target/aarch64/struct_cpy.c:49:7: error: ISO C does not allow extra ';' outside of a function [-Wpedantic] /gcc/testsuite/gcc.target/aarch64/struct_cpy.c:50:7: error: ISO C does not allow extra ';' outside of a function [-Wpedantic] /gcc/testsuite/gcc.target/aarch64/struct_cpy.c:51:7: error: ISO C does not allow extra ';' outside of a function [-Wpedantic] /gcc/testsuite/gcc.target/aarch64/struct_cpy.c:52:7: error: ISO C does not allow extra ';' outside of a function [-Wpedantic] /gcc/testsuite/gcc.target/aarch64/struct_cpy.c:53:7: error: ISO C does not allow extra ';' outside of a function [-Wpedantic] /gcc/testsuite/gcc.target/aarch64/struct_cpy.c:54:7: error: ISO C does not allow extra ';' outside of a function [-Wpedantic] /gcc/testsuite/gcc.target/aarch64/struct_cpy.c:55:7: error: ISO C does not allow extra ';' outside of a function [-Wpedantic] /gcc/testsuite/gcc.target/aarch64/struct_cpy.c:56:7: error: ISO C does not allow extra ';' outside of a function [-Wpedantic] /gcc/testsuite/gcc.target/aarch64/struct_cpy.c:57:7: error: ISO C does not allow extra ';' outside of a function [-Wpedantic] /gcc/testsuite/gcc.target/aarch64/struct_cpy.c:58:8: error: ISO C does not allow extra ';' outside of a function [-Wpedantic] /gcc/testsuite/gcc.target/aarch64/struct_cpy.c:59:8: error: ISO C does not allow extra ';' outside of a function [-Wpedantic] /gcc/testsuite/gcc.target/aarch64/struct_cpy.c:60:8: error: ISO C does not allow extra ';' outside of a function [-Wpedantic] /gcc/testsuite/gcc.target/aarch64/struct_cpy.c:61:8: error: ISO C does not allow extra ';' outside of a function [-Wpedantic] /gcc/testsuite/gcc.target/aarch64/struct_cpy.c:62:8: error: ISO C does not allow extra ';' outside of a function [-Wpedantic] /gcc/testsuite/gcc.target/aarch64/struct_cpy.c:63:8: error: ISO C does not allow extra ';' outside of a function [-Wpedantic] /gcc/testsuite/gcc.target/aarch64/struct_cpy.c:64:8: error: ISO C does not allow extra ';' outside of a function [-Wpedantic] Would you mind fixing it? Thanks, Christophe > > --
Re: [PATCH][GCC][AArch64] Simplify movmem code by always doing overlapping copies when larger than 8 bytes.
On Tue, Jun 19, 2018 at 09:09:27AM -0500, Tamar Christina wrote: > Hi All, OK. Thanks, James > Thanks, > Tamar > > gcc/ > 2018-06-19 Tamar Christina > > * config/aarch64/aarch64.c (aarch64_expand_movmem): Fix mode size. > > gcc/testsuite/ > 2018-06-19 Tamar Christina > > * gcc.target/aarch64/struct_cpy.c: New. > > --
Re: [PATCH][GCC][AArch64] Simplify movmem code by always doing overlapping copies when larger than 8 bytes.
Hi James, Many thanks for the review! The 06/19/2018 22:23, James Greenhalgh wrote: > On Tue, Jun 19, 2018 at 09:09:27AM -0500, Tamar Christina wrote: > > Hi All, > > > > This changes the movmem code in AArch64 that does copy for data between 4 > > and 7 > > bytes to use the smallest possible mode capable of copying the remaining > > bytes. > > > > for a 5 byte copy. > > So... Given that I wrote the code to do the overlapping memmove back in 2014 > https://gcc.gnu.org/ml/gcc-patches/2014-06/msg00585.html I'm confused about > why we want to achieve this transformation. I can see that we come out > about even for a 5 or 6 byte copy, but your patch would also force it for a > 7 byte copy, adding an extra load instruction. That said, the 5 and 6 cases > probably weren't though through that closely. > Actually, the 7 case stays the same. This patch will generate just as before ldr w1, [x0, 48] ldr w0, [x0, 51] str w1, [sp, 8] str w0, [sp, 11] This patch should never generate more loads or stores than it did before, in fact it strictly generates less or the same amount of loads/stores. Just in less code. The reason this is the case is because of how the algorithm works. This can be proved by induction. My hypothesis is that this algorithm never issues more than 2 loads and 2 stores per copies of n % 32 bytes. The case of n - (n % 32), the code will as before satisfy these using loads and stores of TImode mode, issuing as many as required to get to n < 31. Which leaves the cases of n < 31 bytes. the cases n > 24 and n < 31 will be done using overlapping TImode operations now, whereas before they would issue more operations. e.g. n = 25 will now now overlap where before it would have done 16 + 8 + 1. Another example, n == 27 before would have done 16 + 8 + 4 (where the 4 overlaps). Now it will do 16 + 16 where the second 16 overlaps. etc. It does this by always doing the first copy using the largest mode that doesn't over read. So for your 7 bytes example before, this would be a SImode. In the loop it selects the smallest mode that is larger or equal to the remainder. SImode is the smallest mode larger than 3 bytes. So it issues the same two SImode operations as before. I had described this in the comments but if this wasn't clear I can expand on it. > I'd like to see a testcase which shows the advantages for CSE you describe > in your introduction to the patch; I'm not sure your testcase below gives > that. This CSE relies on a mid-end change also being accepted. This would be https://gcc.gnu.org/ml/gcc-patches/2017-11/msg01088.html which was reverted do to it not handling corner cases correctly https://gcc.gnu.org/ml/gcc-patches/2018-04/msg00239.html I will am performing regression testing on the new version and will submit it soon. If we take a more useful example of a program than the ones in the testcase and cover letter: struct struct5 fun5() { return foo5; } where foo5 is a simple 5 byte struct. after SSA this becomes struct struct5 fun5() { D. = foo5 return D.; } Which means two copies will be done, first one into D. and the second one by the return. One handled by the mid-end and one by this back-end code. Before this patch we generated at -O3 16 insn: fun5: adrpx1, .LANCHOR0 add x1, x1, :lo12:.LANCHOR0 mov x0, 0 sub sp, sp, #16 ldrbw2, [x1, 32] ldr w1, [x1, 33] add sp, sp, 16 bfi x0, x2, 0, 8 ubfxx3, x1, 8, 8 bfi x0, x1, 8, 8 ubfxx2, x1, 16, 8 lsr w1, w1, 24 bfi x0, x3, 16, 8 bfi x0, x2, 24, 8 bfi x0, x1, 32, 8 ret With this patch alone 16 insn no regression, but no improvement either: fun5: adrpx2, .LANCHOR0 add x2, x2, :lo12:.LANCHOR0 mov x0, 0 sub sp, sp, #16 ldr w1, [x2, 32] ldrbw2, [x2, 36] add sp, sp, 16 ubfxx4, x1, 8, 8 bfi x0, x1, 0, 8 ubfxx3, x1, 16, 8 lsr w1, w1, 24 bfi x0, x4, 8, 8 bfi x0, x3, 16, 8 bfi x0, x1, 24, 8 bfi x0, x2, 32, 8 ret And with this patch and mid-end patch this becomes 10 insn: fun5: adrpx1, .LANCHOR0 add x1, x1, :lo12:.LANCHOR0 mov x0, 0 sub sp, sp, #16 ldr w2, [x1, 32] ldrbw1, [x1, 36] add sp, sp, 16 bfi x0, x2, 0, 32 bfi x0, x1, 32, 8 ret with the mid-end patch alone and the existing implementation it would be 14 insn: fun5: adrpx1, .LANCHOR0 add x1, x1, :lo12:.LANCHOR0 sub sp, sp, #16 mov x0, 0 ldr w2, [x1, 32] ldr w1, [x1, 33] str w2, [sp, 8] str w1, [sp, 9] ldr
Re: [PATCH][GCC][AArch64] Simplify movmem code by always doing overlapping copies when larger than 8 bytes.
On Tue, Jun 19, 2018 at 09:09:27AM -0500, Tamar Christina wrote: > Hi All, > > This changes the movmem code in AArch64 that does copy for data between 4 and > 7 > bytes to use the smallest possible mode capable of copying the remaining > bytes. > > This means that if we're copying 5 bytes we would issue an SImode and QImode > load instead of two SImode loads. > > This does smaller memory accesses but also gives the mid-end a chance to > realise > that it can CSE the loads in certain circumstances. e.g. when you have > something > like > > return foo; > > where foo is a struct. This would be transformed by the mid-end into SSA form > as > > D. = foo; > > return D.; > > This movmem routine will handle the first copy, but it's usually not needed, > the mid-end would do SImode and QImode stores into X0 for the 5 bytes example > but without the first copies being in the same mode, it doesn't know it > doesn't > need the stores at all. > > This will generate > > fun5: > sub sp, sp, #16 > adrpx0, .LANCHOR0 > add x1, x0, :lo12:.LANCHOR0 > ldr w0, [x0, #:lo12:.LANCHOR0] > str w0, [sp, 8] > ldrbw0, [x1, 4] > strbw0, [sp, 12] > add sp, sp, 16 > ret > > instead of > > fun5: > sub sp, sp, #16 > adrpx0, .LANCHOR0 > add x1, x0, :lo12:.LANCHOR0 > ldr w0, [x0, #:lo12:.LANCHOR0] > str w0, [sp, 8] > ldr w0, [x1, 1] > str w0, [sp, 9] > add sp, sp, 16 > ret > > for a 5 byte copy. So... Given that I wrote the code to do the overlapping memmove back in 2014 https://gcc.gnu.org/ml/gcc-patches/2014-06/msg00585.html I'm confused about why we want to achieve this transformation. I can see that we come out about even for a 5 or 6 byte copy, but your patch would also force it for a 7 byte copy, adding an extra load instruction. That said, the 5 and 6 cases probably weren't though through that closely. I'd like to see a testcase which shows the advantages for CSE you describe in your introduction to the patch; I'm not sure your testcase below gives that. > Regtested on aarch64-none-elf and .. issues. Do I draw a zero on the line on my screen with a sharpie or what ;-) > Ok for trunk? Justification and performance impact would be appreciated. Right now I'm not convinced this is forward progress. > gcc/ > 2018-06-19 Tamar Christina > > * config/aarch64/aarch64.c (aarch64_expand_movmem): Fix mode size. > > gcc/testsuite/ > 2018-06-19 Tamar Christina > > * gcc.target/aarch64/struct_cpy.c: New. > > -- > diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c > index > bd0ac2f04d8f43fd54b58ff9581645493b8d0cd1..ed5409403741fa634d977ba15cd22741bb9d1064 > 100644 > --- a/gcc/config/aarch64/aarch64.c > +++ b/gcc/config/aarch64/aarch64.c > @@ -16180,26 +16180,29 @@ aarch64_copy_one_block_and_progress_pointers (rtx > *src, rtx *dst, > bool > aarch64_expand_movmem (rtx *operands) > { > - unsigned int n; > + int n, mode_bits; >rtx dst = operands[0]; >rtx src = operands[1]; >rtx base; > + machine_mode cur_mode = BLKmode, next_mode; >bool speed_p = !optimize_function_for_size_p (cfun); > >/* When optimizing for size, give a better estimate of the length of a > - memcpy call, but use the default otherwise. */ > - unsigned int max_instructions = (speed_p ? 15 : AARCH64_CALL_RATIO) / 2; > + memcpy call, but use the default otherwise. Moves larger than 8 bytes > + will always require an even number of instructions to do now. And each > + operation requires both a load+store, so devide the max number by 2. */ > + int max_num_moves = (speed_p ? 16 : AARCH64_CALL_RATIO) / 2; Why did we go from unsigned to signed int? Thanks, James