[PATCH, ARM] Unaligned accesses for builtin memcpy [2/2]

2011-05-06 Thread Julian Brown
Hi,

This is the second of two patches to add unaligned-access support to
the ARM backend. It builds on the first patch to provide support for
unaligned accesses when expanding block moves (i.e. for builtin memcpy
operations). It makes some effort to use load/store multiple
instructions where appropriate (when accessing sufficiently-aligned
source or destination addresses), and also makes some effort to
generate fast code (for -O1/2/3) or small code (for -Os), though some
of the heuristics may need tweaking still.

Examples:

#include string.h

void foo (char *dest, char *src)
{
  memcpy (dest, src, AMOUNT);
}

char known[64];

void dst_aligned (char *src)
{
  memcpy (known, src, AMOUNT);
}

void src_aligned (char *dst)
{
  memcpy (dst, known, AMOUNT);
}

For -mcpu=cortex-m4 -mthumb -O2 -DAMOUNT=15 we get:

foo:
ldr r2, [r1, #4]@ unaligned
ldr r3, [r1, #8]@ unaligned
push{r4}
ldr r4, [r1, #0]@ unaligned
str r2, [r0, #4]@ unaligned
str r4, [r0, #0]@ unaligned
str r3, [r0, #8]@ unaligned
ldrhr2, [r1, #12]   @ unaligned
ldrbr3, [r1, #14]   @ zero_extendqisi2
strhr2, [r0, #12]   @ unaligned
strbr3, [r0, #14]
pop {r4}
bx  lr

dst_aligned:
push{r4}
mov r4, r0
movwr3, #:lower16:known
ldr r1, [r4, #4]@ unaligned
ldr r2, [r4, #8]@ unaligned
ldr r0, [r0, #0]@ unaligned
movtr3, #:upper16:known
stmia   r3!, {r0, r1, r2}
ldrhr1, [r4, #12]   @ unaligned
ldrbr2, [r4, #14]   @ zero_extendqisi2
strhr1, [r3, #0]@ unaligned
strbr2, [r3, #2]
pop {r4}
bx  lr

src_aligned:
push{r4}
movwr3, #:lower16:known
movtr3, #:upper16:known
mov r4, r0
ldmia   r3!, {r0, r1, r2}
str r0, [r4, #0]@ unaligned
str r1, [r4, #4]@ unaligned
str r2, [r4, #8]@ unaligned
ldrhr2, [r3, #0]@ unaligned
ldrbr3, [r3, #2]@ zero_extendqisi2
strhr2, [r4, #12]   @ unaligned
strbr3, [r4, #14]
pop {r4}
bx  lr

Whereas for -mcpu=cortex-m4 -mthumb -Os -DAMOUNT=15, e.g.:

foo:
add r3, r1, #12
.L2:
ldr r2, [r1], #4@ unaligned
cmp r1, r3
str r2, [r0], #4@ unaligned
bne .L2
ldrhr3, [r1, #0]@ unaligned
strhr3, [r0, #0]@ unaligned
ldrbr3, [r1, #2]@ zero_extendqisi2
strbr3, [r0, #2]
bx  lr

Tested (alongside the first patch) with cross to ARM Linux. OK to apply?

Thanks,

Julian

ChangeLog

gcc/
* config/arm/arm.c (arm_block_move_unaligned_straight)
(arm_adjust_block_mem, arm_block_move_unaligned_loop)
(arm_movmemqi_unaligned): New.
(arm_gen_movmemqi): Support unaligned block copies.
commit 16973f69fce37a2b347ea7daffd6f593aba843d5
Author: Julian Brown jul...@henry7.codesourcery.com
Date:   Wed May 4 11:26:01 2011 -0700

Optimize block moves when unaligned accesses are permitted.

diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c
index a18aea6..b6df0d3 100644
--- a/gcc/config/arm/arm.c
+++ b/gcc/config/arm/arm.c
@@ -10362,6 +10362,335 @@ gen_const_stm_seq (rtx *operands, int nops)
   return true;
 }
 
+/* Copy a block of memory using plain ldr/str/ldrh/strh instructions, to permit
+   unaligned copies on processors which support unaligned semantics for those
+   instructions.  INTERLEAVE_FACTOR can be used to attempt to hide load latency
+   (using more registers) by doing e.g. load/load/store/store for a factor of 2.
+   An interleave factor of 1 (the minimum) will perform no interleaving. 
+   Load/store multiple are used for aligned addresses where possible.  */
+
+static void
+arm_block_move_unaligned_straight (rtx dstbase, rtx srcbase,
+   HOST_WIDE_INT length,
+   unsigned int interleave_factor)
+{
+  rtx *regs = XALLOCAVEC (rtx, interleave_factor);
+  int *regnos = XALLOCAVEC (int, interleave_factor);
+  HOST_WIDE_INT block_size_bytes = interleave_factor * UNITS_PER_WORD;
+  HOST_WIDE_INT i, j;
+  HOST_WIDE_INT remaining = length, words;
+  rtx halfword_tmp = NULL, byte_tmp = NULL;
+  rtx dst, src;
+  bool src_aligned = MEM_ALIGN (srcbase) = BITS_PER_WORD;
+  bool dst_aligned = MEM_ALIGN (dstbase) = BITS_PER_WORD;
+  HOST_WIDE_INT srcoffset, dstoffset;
+  HOST_WIDE_INT src_autoinc, dst_autoinc;
+  rtx mem, addr;
+  
+  gcc_assert (1 = interleave_factor  interleave_factor = 4);
+  
+  /* Use hard registers if we have aligned source or destination so we can use
+ load/store multiple with contiguous registers.  */
+  if (dst_aligned || src_aligned)
+for (i = 0; i  interleave_factor; i++)
+  regs[i] = gen_rtx_REG (SImode, i);
+  else
+for (i

[PATCH, ARM] Fix NEON vset_lane for D registers

2011-05-03 Thread Julian Brown
Hi,

This patch fixes vset_lane intrinsic variants for D-register sized
variables. A typo meant that the wrong lane would be set in many
circumstances.

Tested manually only. OK to apply?

Thanks,

Julian

ChangeLog

gcc/
* config/arm/neon.md (vec_setmode_internal): Fix misplaced
parenthesis in D-register case.

Index: gcc/config/arm/neon.md
===
--- gcc/config/arm/neon.md	(revision 173299)
+++ gcc/config/arm/neon.md	(working copy)
@@ -426,7 +426,7 @@
   (match_operand:SI 2 immediate_operand i)))]
   TARGET_NEON
 {
-  int elt = ffs ((int) INTVAL (operands[2]) - 1);
+  int elt = ffs ((int) INTVAL (operands[2])) - 1;
   if (BYTES_BIG_ENDIAN)
 elt = GET_MODE_NUNITS (MODEmode) - 1 - elt;
   operands[2] = GEN_INT (elt);


Re: [PATCH, ARM] Avoid element-order-dependent operations for quad-word vectors in big-endian mode for NEON

2011-04-05 Thread Julian Brown
On Wed, 9 Feb 2011 12:11:35 +
Julian Brown jul...@codesourcery.com wrote:

 On Wed, 12 Jan 2011 17:38:22 +
 Julian Brown jul...@codesourcery.com wrote:
 
  This version of the patch tweaks target-supports.exp to say that
  various operations are not available in big-endian mode (removing
  some of the FAILs from the previous version -- though in big-endian
  mode without -mvectorize-with-neon-quad, some tests have
  transitioned from PASS to XPASS. I'm not sure that's worth worrying
  about).
  
  The main part of the patch remains unchanged.
 
 Ping?

Ping?

(Patch: http://gcc.gnu.org/ml/gcc-patches/2011-01/msg00768.html)

Thanks,

Julian


Re: [patch] Fix PR48183, NEON ICE in emit-rtl.c:immed_double_const() under -g

2011-03-24 Thread Julian Brown
On Thu, 24 Mar 2011 10:57:06 +
Richard Sandiford richard.sandif...@linaro.org wrote:

 Chung-Lin Tang clt...@codesourcery.com writes:
  PR48183 is a case where ARM NEON instrinsics, under -O -g, produce
  debug insns that tries to expand OImode (32-byte integer) zero
  constants, much too large to represent as two HOST_WIDE_INTs; as
  the internals manual indicates, such large constants are not
  supported in general, and ICEs on the GET_MODE_BITSIZE(mode) ==
  2*HOST_BITS_PER_WIDE_INT assertion.
 
  This patch allows the cases where the large integer constant is
  still representable using a single CONST_INT, such as zero(0).
  Bootstrapped and tested on i686 and x86_64, cross-tested on ARM,
  all without regressions. Okay for trunk?
 
  Thanks,
  Chung-Lin
 
  2011-03-20  Chung-Lin Tang  clt...@codesourcery.com
 
  * emit-rtl.c (immed_double_const): Allow wider than
  2*HOST_BITS_PER_WIDE_INT mode constants when they are
  representable as a single const_int RTX.
 
 I realise this might be seen as a good expedient fix, but it makes
 me a bit uneasy.  Not a very constructive rationale, sorry.

FWIW I also had a fix for this issue, which is equivalent to
Chung-Lin's patch apart from only allowing constant-zero (attached).
That's not really a vote from me for this approach, but maybe limiting
the extent to which we pretend to support wide-integer constants like
this is sensible, if we do go that way.

Julian--- gcc/expr.c	(revision 314639)
+++ gcc/expr.c	(working copy)
@@ -8458,6 +8458,18 @@ expand_expr_real_1 (tree exp, rtx target
   return decl_rtl;
 
 case INTEGER_CST:
+  if (GET_MODE_BITSIZE (mode)  2 * HOST_BITS_PER_WIDE_INT)
+	{
+	  /* FIXME: We can't generally represent wide integer constants,
+	 but GCC sometimes tries to initialise wide integer values (such
+	 as used by the ARM NEON support) with zero.  Handle that as a
+	 special case here.  */
+	  if (initializer_zerop (exp))
+	return CONST0_RTX (mode);
+
+	  gcc_unreachable ();
+	}
+
   temp = immed_double_const (TREE_INT_CST_LOW (exp),
  TREE_INT_CST_HIGH (exp), mode);
 


Re: [ARM] Neon / Ocaml question

2010-01-11 Thread Julian Brown
On Mon, 11 Jan 2010 09:52:59 +
Ramana Radhakrishnan ramana.radhakrish...@arm.com wrote:

 cam-bc3-b12:ramrad01 68  ocamlc -c neon-schedgen.ml 
 File neon-schedgen.ml, line 51, characters 0-10:
 Unbound module Utils
 
 It sounds like a configuration issue but given my rather rusty ocaml
 skills - I'm not sure where to look. Googling around doesn't show me
 anything obvious. I see this both with v. 3.09.3 and v 3.11 (on
 karmic).  

This is apparently due to a missing source file, utils.ml, but
unfortunately I have no idea what has happened to it (I'm not the
original author). Luckily it only seems to be used for a single
function definition (find_with_result), so we can just re-implement
that.

Another thing which might bite you is that recent OCaml versions don't
like hyphens in filenames: replacing them with underscores works OK
though (i.e. neon_schedgen.ml).

Compiling neon-schedgen.ml with the attached patch, the parts of
cortex-a8-neon.md below the line:

;; The remainder of this file is auto-generated by neon-schedgen.

are generated identically.

Would you like to try this out and see how you get on with it?
Followups set to gcc-patches.

Thanks,

Julian

ChangeLog

gcc/
* config/arm/neon-schedgen.ml (Utils): Don't try to open missing
module.
(find_with_result): New.
Index: neon-schedgen.ml
===
--- neon-schedgen.ml	(revision 155808)
+++ neon-schedgen.ml	(working copy)
@@ -48,7 +48,14 @@
  and at present we do not emit specific guards.)
 *)
 
-open Utils
+let find_with_result fn lst =
+  let rec scan = function
+[] - raise Not_found
+  | l::ls - 
+  match fn l with
+Some result - result
+  | _ - scan ls in
+  scan lst
 
 let n1 = 1 and n2 = 2 and n3 = 3 and n4 = 4 and n5 = 5 and n6 = 6
 and n7 = 7 and n8 = 8 and n9 = 9


Re: Using a umulhisi3

2009-06-03 Thread Julian Brown
On Wed, 3 Jun 2009 21:39:34 +1200
Michael Hope micha...@juju.net.nz wrote:

 How does the combine stage work?  It looks like it could get multiple
 potential matches for a set of RTLs.  Does it use some type of costing
 function to pick between them?  Can I tell combine that a umulhisi3 is
 cheaper than a mulsi3?

You could try defining TARGET_RTX_COSTS, if you haven't already.

Julian


Re: __builtin_return_address for ARM

2009-02-27 Thread Julian Brown
On Thu, 26 Feb 2009 15:54:14 +
Andrew Haley a...@redhat.com wrote:

 Paul Brook wrote:
Well, but wouldn't it still be nice if
  __builtin_return_address(N) was implemented for N0 by libcalling
  into the unwinder for you?  Obviously this would still have to
  return NULL at runtime when you're running on a DW2 target without
  any EH frame data present in memory (and I guess it wouldn't work
  on SjLj targets either), but wouldn't it still be a nice
  convenience feature for users?
  
  There are sufficiently many caveats and system specific bits of
  weirdness that you probably just have to know what you're doing (or
  rely on backtrace(3) to do it for you).
  
  IMHO builtins are for things that you can't do in normal C. So 
  __builtin_return_address(0) makes a lot of sense. Having it start
  guessing how to do N0 much less so.
 
 I suggest we could contribute a version of backtrace.c for ARM to
 glibc. An example to follow is libc/sysdeps/ia64/backtrace.c.

GLIBC already knows how to do backtracing if the ARM-specific unwind
tables are present (.ARM.exidx, etc.), using _Unwind_Backtrace.

Unfortunately backtraces don't currently terminate cleanly if code
without unwind data is reached: CodeSourcery are currently working on
fixing the linker so that non-unwindable regions are marked properly,
which we consider essential to making this feature usable.

Of course, you'll need to compile all your code with -funwind-tables for
this to work. We haven't measured the size impact of this yet: we're
planning on optimising the unwind tables by merging duplicate entries
whenever possible, so hopefully it won't be too bad.

Just a heads-up to avoid duplicate effort!

Cheers,

Julian


Re: __builtin_return_address for ARM

2009-02-27 Thread Julian Brown
On Fri, 27 Feb 2009 13:32:11 +
Julian Brown jul...@codesourcery.com wrote:

 GLIBC already knows how to do backtracing if the ARM-specific unwind
 tables are present (.ARM.exidx, etc.), using _Unwind_Backtrace.

I'm told this probably isn't true for upstream GLIBC -- but we
definitely have a patch somewhere to make GLIBC backtrace use
_Unwind_Backtrace, which we'll submit upstream in due course. Sorry for
the misinformation!

Cheers,

Julian


Re: How to implement conditional execution

2008-06-27 Thread Julian Brown
On Fri, 27 Jun 2008 15:52:22 +0530
Mohamed Shafi [EMAIL PROTECTED] wrote:

 If the condition in the 'if' instruction is satisfied the processor
 will execute the next instruction or it will replace with a nop. So
 this means that i can instructions similar to:
 
 if eq Rx, Ry
   add Rx, Ry
 add Rx, 2

 Will it be possible to implement this in the Gcc backend ?
 Does any other targets have similar instructions?

This is very much like (a simpler version of) the ARM Thumb-2 IT
instruction. Look how config/arm/thumb2.md handles that. I think the
basic idea should be that you should define conditional instruction
patterns which emit assembly for both instructions simultaneously, e.g.
(excuse my pseudocode):

  (define_insn ...
[(...)]
if eq Rx, Ry\;add Rx, Ry)

then there's no possibility for scheduling or other optimisations to
split the second instruction away from the first.

Julian


Re: core changes for mep port

2007-03-28 Thread Julian Brown

Steven Bosscher wrote:

All of this feels (to me anyway) like adding a lot of code to the
middle end to support MEP specific arch features.  I understand it is
in the mission statement that more ports is a goal for GCC, but I
wonder if this set of changes is worth the maintenance burden...


FWIW, it sounds to me like this feature may also be useful for current 
iterations of the ARM NEON extension (which we're planning to submit 
support for quite soon). NEON supports various operations on DImode 
quantities, but we don't use them for normal code at present because 
moving values from NEON back to ARM core registers is relatively slow, 
so we want to avoid doing that as far as possible.


So, if there was a way of specifying that a particular value should be 
kept in a NEON register, that'd be a good thing, I think.


Cheers,

Julian


Re: core changes for mep port

2007-03-28 Thread Julian Brown

Steven Bosscher wrote:

On 3/28/07, Julian Brown [EMAIL PROTECTED] wrote:

Steven Bosscher wrote:
 All of this feels (to me anyway) like adding a lot of code to the
 middle end to support MEP specific arch features.  I understand it is
 in the mission statement that more ports is a goal for GCC, but I
 wonder if this set of changes is worth the maintenance burden...

FWIW, it sounds to me like this feature may also be useful for current
iterations of the ARM NEON extension (which we're planning to submit
support for quite soon). NEON supports various operations on DImode
quantities, but we don't use them for normal code at present because
moving values from NEON back to ARM core registers is relatively slow,
so we want to avoid doing that as far as possible.

So, if there was a way of specifying that a particular value should be
kept in a NEON register, that'd be a good thing, I think.


And if you use this coprocessor hackery, it will be exactly what Ian
opposed in his first reply: As far as I can see you're using new
modes to drive register class preferences.


Quite possibly. I don't really know enough about how any of this works 
to say much useful, it just seemed like another potential use for the 
feature (albeit a rather esoteric one) if it does go in.


Cheers,

Julian


Re: GCC 4.0 RC2 Available

2005-04-18 Thread Julian Brown
On 2005-04-18, Mark Mitchell [EMAIL PROTECTED] wrote:

 RC2 is available here:

   ftp://gcc.gnu.org/pub/gcc/prerelease-4.0.0-20050417/

 As before, I'd very much appreciate it if people would test these bits
 on primary and secondary platforms, post test results with the
 contrib/test_summary script, and send me a message saying whether or
 not there are any regressions, together with a pointer to the results.

Results for arm-none-elf, cross-compiled from i686-pc-linux-gnu (Debian)
for C and C++ are here:

http://gcc.gnu.org/ml/gcc-testresults/2005-04/msg01301.html

Relative to RC1, there are several new tests which pass, and:

g++.dg/warn/Wdtor1.C (test for excess errors)

works whereas it didn't before.

Julian



Re: GCC 4.0 RC1 Available

2005-04-11 Thread Julian Brown
On 2005-04-11, Julian Brown [EMAIL PROTECTED] wrote:
 On 2005-04-10, Mark Mitchell [EMAIL PROTECTED] wrote:

  * The DejaGNU testsuite has been run, and compared with a run of 
 the testsuite on the previous release of GCC, and no regressions are 
 observed.

 If you are willing to help, please download the release candidate, build 
 it on appropriate platforms, and post testresults by using 
 contrib/test_summary.  Please use the release candidate itself, *not* 
 the CVS 4.0 release branch, as part of the goal is to ensure that the 
 packaging scripts are working.

 For arm-none-elf (cross from i686-pc-linux-gnu), with binutils and newlib
 from CVS:

   http://gcc.gnu.org/ml/gcc-testresults/2005-04/msg00800.html

 And, for comparison, 3.4.3 tests:

   http://gcc.gnu.org/ml/gcc-testresults/2005-04/msg00799.html

 Quite a few of the 4.0 RC1 tests FAIL, though I'm not sure how many of
 these are regressions, and how many are just new tests which fail.

In more detail, for gcc.sum:

Tests that now fail, but worked before:

gcc.c-torture/execute/bitfld-1.c execution,  -O0
gcc.c-torture/execute/bitfld-1.c execution,  -O1
gcc.c-torture/execute/bitfld-1.c execution,  -O2
gcc.c-torture/execute/bitfld-1.c execution,  -O3 -fomit-frame-pointer
gcc.c-torture/execute/bitfld-1.c execution,  -O3 -g
gcc.c-torture/execute/bitfld-1.c execution,  -Os
gcc.c-torture/execute/builtin-constant.c execution,  -O1
gcc.dg/array-5.c bad vla handling (test for bogus messages, line 40)
gcc.dg/bitfld-2.c  (test for warnings, line 14)
gcc.dg/bitfld-2.c  (test for warnings, line 15)
gcc.dg/bitfld-2.c  (test for warnings, line 20)
gcc.dg/bitfld-2.c  (test for warnings, line 21)
gcc.dg/builtins-18.c (test for excess errors)
gcc.dg/builtins-20.c (test for excess errors)
gcc.dg/const-elim-1.c scan-assembler-not L\\$?C[^A-Z]
gcc.dg/cpp/trad/include.c (test for excess errors)
gcc.dg/redecl-1.c  (test for errors, line 67)
gcc.dg/sequence-pt-1.c sequence point warning (test for warnings, line 59)
gcc.dg/uninit-1.c uninitialized variable warning (test for bogus messages,
line 16)
gcc.dg/uninit-2.c uninitialized variable warning (test for bogus messages,
line 28)
gcc.dg/uninit-3.c uninitialized variable warning (test for bogus messages,
line 11)
gcc.dg/uninit-8.c uninitialized variable warning (test for bogus messages,
line 14)
gcc.dg/Wunreachable-1.c (test for excess errors)


For g++.sum:

Tests that now fail, but worked before:

g++.dg/other/error8.C duplicate error messages (test for bogus messages, line
8)
g++.dg/other/error8.C duplicate error messages (test for bogus messages, line
9)
g++.dg/rtti/tinfo1.C scan-assembler-not
.section[^\n\r]*_ZTIP9CTemplateIhE[^\n\r
]*
g++.dg/template/nested3.C  (test for errors, line 12)
g++.dg/template/nested3.C  (test for errors, line 14)
g++.dg/template/nested3.C  (test for errors, line 25)
g++.dg/template/nested3.C  (test for errors, line 8)
g++.old-deja/g++.jason/cond.C  (test for errors, line 20)
g++.old-deja/g++.jason/cond.C  (test for errors, line 22)
g++.old-deja/g++.jason/cond.C  (test for errors, line 25)
g++.old-deja/g++.jason/cond.C  (test for errors, line 27)
g++.old-deja/g++.oliva/expr2.C execution test
g++.old-deja/g++.oliva/template10.C  (test for errors, line 22)
g++.old-deja/g++.other/decl5.C  (test for warnings, line 55)
g++.old-deja/g++.other/decl5.C  (test for warnings, line 56)


For libstdc++.sum:

Tests that now fail, but worked before:

27_io/basic_filebuf/open/char/9507.cc (test for excess errors)


That's a total of about 39 regressions, I think. I also got quite a few
Old tests that passed, that have disappeared results. Is that expected?

Julian



Re: Different sized data and code pointers

2005-03-03 Thread Julian Brown
On 2005-03-02, Thomas Gill [EMAIL PROTECTED] wrote:
 Paul Schlie wrote:

 With the arguable exception of function pointers (which need not be literal
 address) all pointers are presumed to point to data, not code; therefore
 may be simplest to define pointers as being 16-bits, and call functions
 indirectly through a lookup table constructed at link time from program
 memory, assuming it's readable via some mechanism; as the call penalty
 incurred would likely be insignificant relative to the potential complexity
 of attempting to support 24-bit code pointers in the rare circumstances
 they're typically used, on an otherwise native 16-bit machine.

 Thanks for the response.

 Suppose we don't have enough space to burn on a layer of indirection for
 every function pointer. Do I take it that there's really not a clean way
 to make GCC treat function pointers as 24 bit while still treating data
 pointers as 16 bits?

FWIW, a port I did used indirection for all function pointers, albeit
for a different reason, and I can report that it seems to work OK in
practice with a little linker magic. It wasn't really production-quality
code though, I admit.

Perhaps the indirection table can safely hold only those functions whose
address is taken? (Or maybe that was assumed anyway?)

Julian

-- 
Julian Brown
CodeSourcery, LLC



<    4   5   6   7   8   9