Re: [PATCH, ARM] Unaligned accesses for builtin memcpy [2/2]
Hi Julian, There are a couple of minor formatting nits. +static int +arm_movmemqi_unaligned (rtx *operands) + /* Inlined memcpy using ldr/str/ldrh/strh can be quite big: try to limit + size of code if optimizing for size. We'll use ldm/stm if src_aligned + or dst_aligned though: allow more interleaving in those cases since the + resulting code can be smaller. */ Watch out the or being misaligned compared to the other text. + /* Note that the loop created by arm_block_move_unaligned_loop may be + subject to loop unrolling, which makes tuning this condition a little + redundant. */ Same with `redundant' On 13 October 2011 18:53, Julian Brown jul...@codesourcery.com wrote: On Wed, 28 Sep 2011 14:33:17 +0100 No, sorry, I don't have any benchmark results available at present. I think we'd have to have terrifically bad luck for it to be a performance degradation, though... Hmmm OK - but please watch out for any bug reports or any test-suite fallout this week with multilibs other than what you might have tested. I re-tested the patch for good measure, in case of bitrot (and the new tests pass with the patch applied, of course). OK to apply now? OK by me with those changes. cheers Ramana
Re: [PATCH, ARM] Unaligned accesses for builtin memcpy [2/2]
On Fri, Oct 14, 2011 at 6:53 AM, Julian Brown jul...@codesourcery.com wrote: On Wed, 28 Sep 2011 14:33:17 +0100 Ramana Radhakrishnan ramana.radhakrish...@linaro.org wrote: On 6 May 2011 14:13, Julian Brown jul...@codesourcery.com wrote: Hi, This is the second of two patches to add unaligned-access support to the ARM backend. It builds on the first patch to provide support for unaligned accesses when expanding block moves (i.e. for builtin memcpy operations). It makes some effort to use load/store multiple instructions where appropriate (when accessing sufficiently-aligned source or destination addresses), and also makes some effort to generate fast code (for -O1/2/3) or small code (for -Os), though some of the heuristics may need tweaking still Sorry it's taken me a while to get around to this one. Do you know what difference this makes to performance on some standard benchmarks on let's say an A9 and an M4 as I see that this gets triggered only when we have less than 64 bytes to copy. ? No, sorry, I don't have any benchmark results available at present. I think we'd have to have terrifically bad luck for it to be a performance degradation, though... I've backported the unaligned struct and memcpy patches to our 4.6 based compilers and benchmarked them. The worst is a 0.84 % drop in performance, the best a 7.17 % improvement, and a geomean of 0.18 %. This was in a Cortex-A9 NEON -O3 configuration. The results are accurate to less than 0.1 %. I'll send you and Ramana the raw results privately. -- Michael
Re: [PATCH, ARM] Unaligned accesses for builtin memcpy [2/2]
On Wed, 28 Sep 2011 14:33:17 +0100 Ramana Radhakrishnan ramana.radhakrish...@linaro.org wrote: On 6 May 2011 14:13, Julian Brown jul...@codesourcery.com wrote: Hi, This is the second of two patches to add unaligned-access support to the ARM backend. It builds on the first patch to provide support for unaligned accesses when expanding block moves (i.e. for builtin memcpy operations). It makes some effort to use load/store multiple instructions where appropriate (when accessing sufficiently-aligned source or destination addresses), and also makes some effort to generate fast code (for -O1/2/3) or small code (for -Os), though some of the heuristics may need tweaking still Sorry it's taken me a while to get around to this one. Do you know what difference this makes to performance on some standard benchmarks on let's say an A9 and an M4 as I see that this gets triggered only when we have less than 64 bytes to copy. ? No, sorry, I don't have any benchmark results available at present. I think we'd have to have terrifically bad luck for it to be a performance degradation, though... Please add a few testcases from the examples that you've shown here to be sure that ldm's are being generated in the right cases. I've added test cases which cover copies with combinations of aligned/unaligned sources/destinations, gated on a new require-effective-target so the tests only run when suitable support is available. I re-tested the patch for good measure, in case of bitrot (and the new tests pass with the patch applied, of course). OK to apply now? Thanks, Julian ChangeLog gcc/ * config/arm/arm.c (arm_block_move_unaligned_straight) (arm_adjust_block_mem, arm_block_move_unaligned_loop) (arm_movmemqi_unaligned): New. (arm_gen_movmemqi): Support unaligned block copies. gcc/testsuite/ * lib/target-supports.exp (check_effective_target_arm_unaligned): New. * gcc.target/arm/unaligned-memcpy-1.c: New. * gcc.target/arm/unaligned-memcpy-2.c: New. * gcc.target/arm/unaligned-memcpy-3.c: New. * gcc.target/arm/unaligned-memcpy-4.c: New. Index: gcc/testsuite/gcc.target/arm/unaligned-memcpy-3.c === --- gcc/testsuite/gcc.target/arm/unaligned-memcpy-3.c (revision 0) +++ gcc/testsuite/gcc.target/arm/unaligned-memcpy-3.c (revision 0) @@ -0,0 +1,21 @@ +/* { dg-do compile } */ +/* { dg-require-effective-target arm_unaligned } */ +/* { dg-options -O2 } */ + +#include string.h + +char src[16]; + +void aligned_src (char *dest) +{ + memcpy (dest, src, 15); +} + +/* Expect a multi-word load for the main part of the copy, but subword + loads/stores for the remainder. */ + +/* { dg-final { scan-assembler-times ldmia 1 } } */ +/* { dg-final { scan-assembler-times ldrh 1 } } */ +/* { dg-final { scan-assembler-times strh 1 } } */ +/* { dg-final { scan-assembler-times ldrb 1 } } */ +/* { dg-final { scan-assembler-times strb 1 } } */ Index: gcc/testsuite/gcc.target/arm/unaligned-memcpy-4.c === --- gcc/testsuite/gcc.target/arm/unaligned-memcpy-4.c (revision 0) +++ gcc/testsuite/gcc.target/arm/unaligned-memcpy-4.c (revision 0) @@ -0,0 +1,18 @@ +/* { dg-do compile } */ +/* { dg-require-effective-target arm_unaligned } */ +/* { dg-options -O2 } */ + +#include string.h + +char src[16]; +char dest[16]; + +void aligned_both (void) +{ + memcpy (dest, src, 15); +} + +/* We know both src and dest to be aligned: expect multiword loads/stores. */ + +/* { dg-final { scan-assembler-times ldmia 1 } } */ +/* { dg-final { scan-assembler-times stmia 1 } } */ Index: gcc/testsuite/gcc.target/arm/unaligned-memcpy-1.c === --- gcc/testsuite/gcc.target/arm/unaligned-memcpy-1.c (revision 0) +++ gcc/testsuite/gcc.target/arm/unaligned-memcpy-1.c (revision 0) @@ -0,0 +1,19 @@ +/* { dg-do compile } */ +/* { dg-require-effective-target arm_unaligned } */ +/* { dg-options -O2 } */ + +#include string.h + +void unknown_alignment (char *dest, char *src) +{ + memcpy (dest, src, 15); +} + +/* We should see three unaligned word loads and store pairs, one unaligned + ldrh/strh pair, and an ldrb/strb pair. Sanity check that. */ + +/* { dg-final { scan-assembler-times @ unaligned 8 } } */ +/* { dg-final { scan-assembler-times ldrh 1 } } */ +/* { dg-final { scan-assembler-times strh 1 } } */ +/* { dg-final { scan-assembler-times ldrb 1 } } */ +/* { dg-final { scan-assembler-times strb 1 } } */ Index: gcc/testsuite/gcc.target/arm/unaligned-memcpy-2.c === --- gcc/testsuite/gcc.target/arm/unaligned-memcpy-2.c (revision 0) +++ gcc/testsuite/gcc.target/arm/unaligned-memcpy-2.c (revision 0) @@ -0,0 +1,21 @@ +/* { dg-do compile } */ +/* { dg-require-effective-target arm_unaligned } */ +/* { dg-options -O2 } */ + +#include string.h + +char dest[16]; + +void
Re: [PATCH, ARM] Unaligned accesses for builtin memcpy [2/2]
On 6 May 2011 14:13, Julian Brown jul...@codesourcery.com wrote: Hi, This is the second of two patches to add unaligned-access support to the ARM backend. It builds on the first patch to provide support for unaligned accesses when expanding block moves (i.e. for builtin memcpy operations). It makes some effort to use load/store multiple instructions where appropriate (when accessing sufficiently-aligned source or destination addresses), and also makes some effort to generate fast code (for -O1/2/3) or small code (for -Os), though some of the heuristics may need tweaking still Sorry it's taken me a while to get around to this one. Do you know what difference this makes to performance on some standard benchmarks on let's say an A9 and an M4 as I see that this gets triggered only when we have less than 64 bytes to copy. ? Please add a few testcases from the examples that you've shown here to be sure that ldm's are being generated in the right cases. cheers Ramana
Re: [PATCH, ARM] Unaligned accesses for builtin memcpy [2/2]
On Fri, 6 May 2011 14:13:32 +0100 Julian Brown jul...@codesourcery.com wrote: Hi, This is the second of two patches to add unaligned-access support to the ARM backend. It builds on the first patch to provide support for unaligned accesses when expanding block moves (i.e. for builtin memcpy operations). It makes some effort to use load/store multiple instructions where appropriate (when accessing sufficiently-aligned source or destination addresses), and also makes some effort to generate fast code (for -O1/2/3) or small code (for -Os), though some of the heuristics may need tweaking still. The preceding patch to this one has now been approved applied: http://gcc.gnu.org/ml/gcc-patches/2011-09/msg00575.html FAOD, is this one OK too? Thanks, Julian
[PATCH, ARM] Unaligned accesses for builtin memcpy [2/2]
Hi, This is the second of two patches to add unaligned-access support to the ARM backend. It builds on the first patch to provide support for unaligned accesses when expanding block moves (i.e. for builtin memcpy operations). It makes some effort to use load/store multiple instructions where appropriate (when accessing sufficiently-aligned source or destination addresses), and also makes some effort to generate fast code (for -O1/2/3) or small code (for -Os), though some of the heuristics may need tweaking still. Examples: #include string.h void foo (char *dest, char *src) { memcpy (dest, src, AMOUNT); } char known[64]; void dst_aligned (char *src) { memcpy (known, src, AMOUNT); } void src_aligned (char *dst) { memcpy (dst, known, AMOUNT); } For -mcpu=cortex-m4 -mthumb -O2 -DAMOUNT=15 we get: foo: ldr r2, [r1, #4]@ unaligned ldr r3, [r1, #8]@ unaligned push{r4} ldr r4, [r1, #0]@ unaligned str r2, [r0, #4]@ unaligned str r4, [r0, #0]@ unaligned str r3, [r0, #8]@ unaligned ldrhr2, [r1, #12] @ unaligned ldrbr3, [r1, #14] @ zero_extendqisi2 strhr2, [r0, #12] @ unaligned strbr3, [r0, #14] pop {r4} bx lr dst_aligned: push{r4} mov r4, r0 movwr3, #:lower16:known ldr r1, [r4, #4]@ unaligned ldr r2, [r4, #8]@ unaligned ldr r0, [r0, #0]@ unaligned movtr3, #:upper16:known stmia r3!, {r0, r1, r2} ldrhr1, [r4, #12] @ unaligned ldrbr2, [r4, #14] @ zero_extendqisi2 strhr1, [r3, #0]@ unaligned strbr2, [r3, #2] pop {r4} bx lr src_aligned: push{r4} movwr3, #:lower16:known movtr3, #:upper16:known mov r4, r0 ldmia r3!, {r0, r1, r2} str r0, [r4, #0]@ unaligned str r1, [r4, #4]@ unaligned str r2, [r4, #8]@ unaligned ldrhr2, [r3, #0]@ unaligned ldrbr3, [r3, #2]@ zero_extendqisi2 strhr2, [r4, #12] @ unaligned strbr3, [r4, #14] pop {r4} bx lr Whereas for -mcpu=cortex-m4 -mthumb -Os -DAMOUNT=15, e.g.: foo: add r3, r1, #12 .L2: ldr r2, [r1], #4@ unaligned cmp r1, r3 str r2, [r0], #4@ unaligned bne .L2 ldrhr3, [r1, #0]@ unaligned strhr3, [r0, #0]@ unaligned ldrbr3, [r1, #2]@ zero_extendqisi2 strbr3, [r0, #2] bx lr Tested (alongside the first patch) with cross to ARM Linux. OK to apply? Thanks, Julian ChangeLog gcc/ * config/arm/arm.c (arm_block_move_unaligned_straight) (arm_adjust_block_mem, arm_block_move_unaligned_loop) (arm_movmemqi_unaligned): New. (arm_gen_movmemqi): Support unaligned block copies. commit 16973f69fce37a2b347ea7daffd6f593aba843d5 Author: Julian Brown jul...@henry7.codesourcery.com Date: Wed May 4 11:26:01 2011 -0700 Optimize block moves when unaligned accesses are permitted. diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c index a18aea6..b6df0d3 100644 --- a/gcc/config/arm/arm.c +++ b/gcc/config/arm/arm.c @@ -10362,6 +10362,335 @@ gen_const_stm_seq (rtx *operands, int nops) return true; } +/* Copy a block of memory using plain ldr/str/ldrh/strh instructions, to permit + unaligned copies on processors which support unaligned semantics for those + instructions. INTERLEAVE_FACTOR can be used to attempt to hide load latency + (using more registers) by doing e.g. load/load/store/store for a factor of 2. + An interleave factor of 1 (the minimum) will perform no interleaving. + Load/store multiple are used for aligned addresses where possible. */ + +static void +arm_block_move_unaligned_straight (rtx dstbase, rtx srcbase, + HOST_WIDE_INT length, + unsigned int interleave_factor) +{ + rtx *regs = XALLOCAVEC (rtx, interleave_factor); + int *regnos = XALLOCAVEC (int, interleave_factor); + HOST_WIDE_INT block_size_bytes = interleave_factor * UNITS_PER_WORD; + HOST_WIDE_INT i, j; + HOST_WIDE_INT remaining = length, words; + rtx halfword_tmp = NULL, byte_tmp = NULL; + rtx dst, src; + bool src_aligned = MEM_ALIGN (srcbase) = BITS_PER_WORD; + bool dst_aligned = MEM_ALIGN (dstbase) = BITS_PER_WORD; + HOST_WIDE_INT srcoffset, dstoffset; + HOST_WIDE_INT src_autoinc, dst_autoinc; + rtx mem, addr; + + gcc_assert (1 = interleave_factor interleave_factor = 4); + + /* Use hard registers if we have aligned source or destination so we can use + load/store multiple with contiguous registers. */ + if (dst_aligned || src_aligned) +for (i = 0; i interleave_factor; i++) + regs[i] = gen_rtx_REG (SImode, i); + else +for (i =