On Tue, Jun 19, 2018 at 09:09:27AM -0500, Tamar Christina wrote: > Hi All, > > This changes the movmem code in AArch64 that does copy for data between 4 and > 7 > bytes to use the smallest possible mode capable of copying the remaining > bytes. > > This means that if we're copying 5 bytes we would issue an SImode and QImode > load instead of two SImode loads. > > This does smaller memory accesses but also gives the mid-end a chance to > realise > that it can CSE the loads in certain circumstances. e.g. when you have > something > like > > return foo; > > where foo is a struct. This would be transformed by the mid-end into SSA form > as > > D.XXXX = foo; > > return D.XXXX; > > This movmem routine will handle the first copy, but it's usually not needed, > the mid-end would do SImode and QImode stores into X0 for the 5 bytes example > but without the first copies being in the same mode, it doesn't know it > doesn't > need the stores at all. > > This will generate > > fun5: > sub sp, sp, #16 > adrp x0, .LANCHOR0 > add x1, x0, :lo12:.LANCHOR0 > ldr w0, [x0, #:lo12:.LANCHOR0] > str w0, [sp, 8] > ldrb w0, [x1, 4] > strb w0, [sp, 12] > add sp, sp, 16 > ret > > instead of > > fun5: > sub sp, sp, #16 > adrp x0, .LANCHOR0 > add x1, x0, :lo12:.LANCHOR0 > ldr w0, [x0, #:lo12:.LANCHOR0] > str w0, [sp, 8] > ldr w0, [x1, 1] > str w0, [sp, 9] > add sp, sp, 16 > ret > > for a 5 byte copy.
So... Given that I wrote the code to do the overlapping memmove back in 2014 https://gcc.gnu.org/ml/gcc-patches/2014-06/msg00585.html I'm confused about why we want to achieve this transformation. I can see that we come out about even for a 5 or 6 byte copy, but your patch would also force it for a 7 byte copy, adding an extra load instruction. That said, the 5 and 6 cases probably weren't though through that closely. I'd like to see a testcase which shows the advantages for CSE you describe in your introduction to the patch; I'm not sure your testcase below gives that. > Regtested on aarch64-none-elf and .. issues. Do I draw a zero on the line on my screen with a sharpie or what ;-) > Ok for trunk? Justification and performance impact would be appreciated. Right now I'm not convinced this is forward progress. > gcc/ > 2018-06-19 Tamar Christina <tamar.christ...@arm.com> > > * config/aarch64/aarch64.c (aarch64_expand_movmem): Fix mode size. > > gcc/testsuite/ > 2018-06-19 Tamar Christina <tamar.christ...@arm.com> > > * gcc.target/aarch64/struct_cpy.c: New. > > -- > diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c > index > bd0ac2f04d8f43fd54b58ff9581645493b8d0cd1..ed5409403741fa634d977ba15cd22741bb9d1064 > 100644 > --- a/gcc/config/aarch64/aarch64.c > +++ b/gcc/config/aarch64/aarch64.c > @@ -16180,26 +16180,29 @@ aarch64_copy_one_block_and_progress_pointers (rtx > *src, rtx *dst, > bool > aarch64_expand_movmem (rtx *operands) > { > - unsigned int n; > + int n, mode_bits; > rtx dst = operands[0]; > rtx src = operands[1]; > rtx base; > + machine_mode cur_mode = BLKmode, next_mode; > bool speed_p = !optimize_function_for_size_p (cfun); > > /* When optimizing for size, give a better estimate of the length of a > - memcpy call, but use the default otherwise. */ > - unsigned int max_instructions = (speed_p ? 15 : AARCH64_CALL_RATIO) / 2; > + memcpy call, but use the default otherwise. Moves larger than 8 bytes > + will always require an even number of instructions to do now. And each > + operation requires both a load+store, so devide the max number by 2. */ > + int max_num_moves = (speed_p ? 16 : AARCH64_CALL_RATIO) / 2; Why did we go from unsigned to signed int? Thanks, James