Re: [PATCH, ARM] Unaligned accesses for builtin memcpy [2/2]

2011-10-17 Thread Ramana Radhakrishnan
Hi Julian,


There are a couple of minor formatting nits.

+static int
+arm_movmemqi_unaligned (rtx *operands)



+  /* Inlined memcpy using ldr/str/ldrh/strh can be quite big: try to limit
+ size of code if optimizing for size.  We'll use ldm/stm if 
src_aligned
+   or dst_aligned though: allow more interleaving in those cases since the
+   resulting code can be smaller.  */

Watch out the or being misaligned compared to the other text.

+  /* Note that the loop created by arm_block_move_unaligned_loop may be
+ subject to loop unrolling, which makes tuning this condition a little
+   redundant.  */

Same with `redundant'

On 13 October 2011 18:53, Julian Brown jul...@codesourcery.com wrote:
 On Wed, 28 Sep 2011 14:33:17 +0100

 No, sorry, I don't have any benchmark results available at present. I
 think we'd have to have terrifically bad luck for it to be a
 performance degradation, though...

Hmmm OK - but please watch out for any bug reports or any test-suite
fallout this week with
multilibs other than what you might have tested.


 I re-tested the patch for good measure, in case of bitrot (and the new
 tests pass with the patch applied, of course). OK to apply now?

OK by me with those changes.

cheers
Ramana


Re: [PATCH, ARM] Unaligned accesses for builtin memcpy [2/2]

2011-10-17 Thread Michael Hope
On Fri, Oct 14, 2011 at 6:53 AM, Julian Brown jul...@codesourcery.com wrote:
 On Wed, 28 Sep 2011 14:33:17 +0100
 Ramana Radhakrishnan ramana.radhakrish...@linaro.org wrote:

 On 6 May 2011 14:13, Julian Brown jul...@codesourcery.com wrote:
  Hi,
 
  This is the second of two patches to add unaligned-access support to
  the ARM backend. It builds on the first patch to provide support for
  unaligned accesses when expanding block moves (i.e. for builtin
  memcpy operations). It makes some effort to use load/store multiple
  instructions where appropriate (when accessing sufficiently-aligned
  source or destination addresses), and also makes some effort to
  generate fast code (for -O1/2/3) or small code (for -Os), though
  some of the heuristics may need tweaking still

 Sorry it's taken me a while to get around to this one. Do you know
 what difference this makes to performance on some standard benchmarks
 on let's say an A9 and an M4 as I see that this gets triggered only
 when we have less than 64 bytes to copy. ?

 No, sorry, I don't have any benchmark results available at present. I
 think we'd have to have terrifically bad luck for it to be a
 performance degradation, though...

I've backported the unaligned struct and memcpy patches to our 4.6
based compilers and benchmarked them.  The worst is a 0.84 % drop in
performance, the best a 7.17 % improvement, and a geomean of 0.18 %.
This was in a Cortex-A9 NEON -O3 configuration.  The results are
accurate to less than 0.1 %.

I'll send you and Ramana the raw results privately.

-- Michael


Re: [PATCH, ARM] Unaligned accesses for builtin memcpy [2/2]

2011-10-13 Thread Julian Brown
On Wed, 28 Sep 2011 14:33:17 +0100
Ramana Radhakrishnan ramana.radhakrish...@linaro.org wrote:

 On 6 May 2011 14:13, Julian Brown jul...@codesourcery.com wrote:
  Hi,
 
  This is the second of two patches to add unaligned-access support to
  the ARM backend. It builds on the first patch to provide support for
  unaligned accesses when expanding block moves (i.e. for builtin
  memcpy operations). It makes some effort to use load/store multiple
  instructions where appropriate (when accessing sufficiently-aligned
  source or destination addresses), and also makes some effort to
  generate fast code (for -O1/2/3) or small code (for -Os), though
  some of the heuristics may need tweaking still
 
 Sorry it's taken me a while to get around to this one. Do you know
 what difference this makes to performance on some standard benchmarks
 on let's say an A9 and an M4 as I see that this gets triggered only
 when we have less than 64 bytes to copy. ?

No, sorry, I don't have any benchmark results available at present. I
think we'd have to have terrifically bad luck for it to be a
performance degradation, though...

 Please add a few testcases from the examples that you've shown here to
 be sure that ldm's are being generated in the right cases.

I've added test cases which cover copies with combinations of
aligned/unaligned sources/destinations, gated on a new
require-effective-target so the tests only run when suitable support is
available.

I re-tested the patch for good measure, in case of bitrot (and the new
tests pass with the patch applied, of course). OK to apply now?

Thanks,

Julian

ChangeLog

gcc/
* config/arm/arm.c (arm_block_move_unaligned_straight)
(arm_adjust_block_mem, arm_block_move_unaligned_loop)
(arm_movmemqi_unaligned): New.
(arm_gen_movmemqi): Support unaligned block copies.

gcc/testsuite/
* lib/target-supports.exp (check_effective_target_arm_unaligned):
New.
* gcc.target/arm/unaligned-memcpy-1.c: New.
* gcc.target/arm/unaligned-memcpy-2.c: New.
* gcc.target/arm/unaligned-memcpy-3.c: New.
* gcc.target/arm/unaligned-memcpy-4.c: New.
Index: gcc/testsuite/gcc.target/arm/unaligned-memcpy-3.c
===
--- gcc/testsuite/gcc.target/arm/unaligned-memcpy-3.c	(revision 0)
+++ gcc/testsuite/gcc.target/arm/unaligned-memcpy-3.c	(revision 0)
@@ -0,0 +1,21 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target arm_unaligned } */
+/* { dg-options -O2 } */
+
+#include string.h
+
+char src[16];
+
+void aligned_src (char *dest)
+{
+  memcpy (dest, src, 15);
+}
+
+/* Expect a multi-word load for the main part of the copy, but subword
+   loads/stores for the remainder.  */
+
+/* { dg-final { scan-assembler-times ldmia 1 } } */
+/* { dg-final { scan-assembler-times ldrh 1 } } */
+/* { dg-final { scan-assembler-times strh 1 } } */
+/* { dg-final { scan-assembler-times ldrb 1 } } */
+/* { dg-final { scan-assembler-times strb 1 } } */
Index: gcc/testsuite/gcc.target/arm/unaligned-memcpy-4.c
===
--- gcc/testsuite/gcc.target/arm/unaligned-memcpy-4.c	(revision 0)
+++ gcc/testsuite/gcc.target/arm/unaligned-memcpy-4.c	(revision 0)
@@ -0,0 +1,18 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target arm_unaligned } */
+/* { dg-options -O2 } */
+
+#include string.h
+
+char src[16];
+char dest[16];
+
+void aligned_both (void)
+{
+  memcpy (dest, src, 15);
+}
+
+/* We know both src and dest to be aligned: expect multiword loads/stores.  */
+
+/* { dg-final { scan-assembler-times ldmia 1 } } */
+/* { dg-final { scan-assembler-times stmia 1 } } */
Index: gcc/testsuite/gcc.target/arm/unaligned-memcpy-1.c
===
--- gcc/testsuite/gcc.target/arm/unaligned-memcpy-1.c	(revision 0)
+++ gcc/testsuite/gcc.target/arm/unaligned-memcpy-1.c	(revision 0)
@@ -0,0 +1,19 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target arm_unaligned } */
+/* { dg-options -O2 } */
+
+#include string.h
+
+void unknown_alignment (char *dest, char *src)
+{
+  memcpy (dest, src, 15);
+}
+
+/* We should see three unaligned word loads and store pairs, one unaligned
+   ldrh/strh pair, and an ldrb/strb pair.  Sanity check that.  */
+
+/* { dg-final { scan-assembler-times @ unaligned 8 } } */
+/* { dg-final { scan-assembler-times ldrh 1 } } */
+/* { dg-final { scan-assembler-times strh 1 } } */
+/* { dg-final { scan-assembler-times ldrb 1 } } */
+/* { dg-final { scan-assembler-times strb 1 } } */
Index: gcc/testsuite/gcc.target/arm/unaligned-memcpy-2.c
===
--- gcc/testsuite/gcc.target/arm/unaligned-memcpy-2.c	(revision 0)
+++ gcc/testsuite/gcc.target/arm/unaligned-memcpy-2.c	(revision 0)
@@ -0,0 +1,21 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target arm_unaligned } */
+/* { dg-options -O2 } */
+
+#include string.h
+
+char dest[16];
+
+void 

Re: [PATCH, ARM] Unaligned accesses for builtin memcpy [2/2]

2011-09-28 Thread Ramana Radhakrishnan
On 6 May 2011 14:13, Julian Brown jul...@codesourcery.com wrote:
 Hi,

 This is the second of two patches to add unaligned-access support to
 the ARM backend. It builds on the first patch to provide support for
 unaligned accesses when expanding block moves (i.e. for builtin memcpy
 operations). It makes some effort to use load/store multiple
 instructions where appropriate (when accessing sufficiently-aligned
 source or destination addresses), and also makes some effort to
 generate fast code (for -O1/2/3) or small code (for -Os), though some
 of the heuristics may need tweaking still

Sorry it's taken me a while to get around to this one. Do you know
what difference this makes to performance on some standard benchmarks
on let's say an A9 and an M4 as I see that this gets triggered only
when we have less than 64 bytes to copy. ?

Please add a few testcases from the examples that you've shown here to
be sure that ldm's are being generated in the right cases.

cheers
Ramana


Re: [PATCH, ARM] Unaligned accesses for builtin memcpy [2/2]

2011-09-14 Thread Julian Brown
On Fri, 6 May 2011 14:13:32 +0100
Julian Brown jul...@codesourcery.com wrote:

 Hi,
 
 This is the second of two patches to add unaligned-access support to
 the ARM backend. It builds on the first patch to provide support for
 unaligned accesses when expanding block moves (i.e. for builtin memcpy
 operations). It makes some effort to use load/store multiple
 instructions where appropriate (when accessing sufficiently-aligned
 source or destination addresses), and also makes some effort to
 generate fast code (for -O1/2/3) or small code (for -Os), though some
 of the heuristics may need tweaking still.

The preceding patch to this one has now been approved  applied:

http://gcc.gnu.org/ml/gcc-patches/2011-09/msg00575.html

FAOD, is this one OK too?

Thanks,

Julian


[PATCH, ARM] Unaligned accesses for builtin memcpy [2/2]

2011-05-06 Thread Julian Brown
Hi,

This is the second of two patches to add unaligned-access support to
the ARM backend. It builds on the first patch to provide support for
unaligned accesses when expanding block moves (i.e. for builtin memcpy
operations). It makes some effort to use load/store multiple
instructions where appropriate (when accessing sufficiently-aligned
source or destination addresses), and also makes some effort to
generate fast code (for -O1/2/3) or small code (for -Os), though some
of the heuristics may need tweaking still.

Examples:

#include string.h

void foo (char *dest, char *src)
{
  memcpy (dest, src, AMOUNT);
}

char known[64];

void dst_aligned (char *src)
{
  memcpy (known, src, AMOUNT);
}

void src_aligned (char *dst)
{
  memcpy (dst, known, AMOUNT);
}

For -mcpu=cortex-m4 -mthumb -O2 -DAMOUNT=15 we get:

foo:
ldr r2, [r1, #4]@ unaligned
ldr r3, [r1, #8]@ unaligned
push{r4}
ldr r4, [r1, #0]@ unaligned
str r2, [r0, #4]@ unaligned
str r4, [r0, #0]@ unaligned
str r3, [r0, #8]@ unaligned
ldrhr2, [r1, #12]   @ unaligned
ldrbr3, [r1, #14]   @ zero_extendqisi2
strhr2, [r0, #12]   @ unaligned
strbr3, [r0, #14]
pop {r4}
bx  lr

dst_aligned:
push{r4}
mov r4, r0
movwr3, #:lower16:known
ldr r1, [r4, #4]@ unaligned
ldr r2, [r4, #8]@ unaligned
ldr r0, [r0, #0]@ unaligned
movtr3, #:upper16:known
stmia   r3!, {r0, r1, r2}
ldrhr1, [r4, #12]   @ unaligned
ldrbr2, [r4, #14]   @ zero_extendqisi2
strhr1, [r3, #0]@ unaligned
strbr2, [r3, #2]
pop {r4}
bx  lr

src_aligned:
push{r4}
movwr3, #:lower16:known
movtr3, #:upper16:known
mov r4, r0
ldmia   r3!, {r0, r1, r2}
str r0, [r4, #0]@ unaligned
str r1, [r4, #4]@ unaligned
str r2, [r4, #8]@ unaligned
ldrhr2, [r3, #0]@ unaligned
ldrbr3, [r3, #2]@ zero_extendqisi2
strhr2, [r4, #12]   @ unaligned
strbr3, [r4, #14]
pop {r4}
bx  lr

Whereas for -mcpu=cortex-m4 -mthumb -Os -DAMOUNT=15, e.g.:

foo:
add r3, r1, #12
.L2:
ldr r2, [r1], #4@ unaligned
cmp r1, r3
str r2, [r0], #4@ unaligned
bne .L2
ldrhr3, [r1, #0]@ unaligned
strhr3, [r0, #0]@ unaligned
ldrbr3, [r1, #2]@ zero_extendqisi2
strbr3, [r0, #2]
bx  lr

Tested (alongside the first patch) with cross to ARM Linux. OK to apply?

Thanks,

Julian

ChangeLog

gcc/
* config/arm/arm.c (arm_block_move_unaligned_straight)
(arm_adjust_block_mem, arm_block_move_unaligned_loop)
(arm_movmemqi_unaligned): New.
(arm_gen_movmemqi): Support unaligned block copies.
commit 16973f69fce37a2b347ea7daffd6f593aba843d5
Author: Julian Brown jul...@henry7.codesourcery.com
Date:   Wed May 4 11:26:01 2011 -0700

Optimize block moves when unaligned accesses are permitted.

diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c
index a18aea6..b6df0d3 100644
--- a/gcc/config/arm/arm.c
+++ b/gcc/config/arm/arm.c
@@ -10362,6 +10362,335 @@ gen_const_stm_seq (rtx *operands, int nops)
   return true;
 }
 
+/* Copy a block of memory using plain ldr/str/ldrh/strh instructions, to permit
+   unaligned copies on processors which support unaligned semantics for those
+   instructions.  INTERLEAVE_FACTOR can be used to attempt to hide load latency
+   (using more registers) by doing e.g. load/load/store/store for a factor of 2.
+   An interleave factor of 1 (the minimum) will perform no interleaving. 
+   Load/store multiple are used for aligned addresses where possible.  */
+
+static void
+arm_block_move_unaligned_straight (rtx dstbase, rtx srcbase,
+   HOST_WIDE_INT length,
+   unsigned int interleave_factor)
+{
+  rtx *regs = XALLOCAVEC (rtx, interleave_factor);
+  int *regnos = XALLOCAVEC (int, interleave_factor);
+  HOST_WIDE_INT block_size_bytes = interleave_factor * UNITS_PER_WORD;
+  HOST_WIDE_INT i, j;
+  HOST_WIDE_INT remaining = length, words;
+  rtx halfword_tmp = NULL, byte_tmp = NULL;
+  rtx dst, src;
+  bool src_aligned = MEM_ALIGN (srcbase) = BITS_PER_WORD;
+  bool dst_aligned = MEM_ALIGN (dstbase) = BITS_PER_WORD;
+  HOST_WIDE_INT srcoffset, dstoffset;
+  HOST_WIDE_INT src_autoinc, dst_autoinc;
+  rtx mem, addr;
+  
+  gcc_assert (1 = interleave_factor  interleave_factor = 4);
+  
+  /* Use hard registers if we have aligned source or destination so we can use
+ load/store multiple with contiguous registers.  */
+  if (dst_aligned || src_aligned)
+for (i = 0; i  interleave_factor; i++)
+  regs[i] = gen_rtx_REG (SImode, i);
+  else
+for (i =