[PATCH, ARM] Unaligned accesses for builtin memcpy [2/2]
Hi, This is the second of two patches to add unaligned-access support to the ARM backend. It builds on the first patch to provide support for unaligned accesses when expanding block moves (i.e. for builtin memcpy operations). It makes some effort to use load/store multiple instructions where appropriate (when accessing sufficiently-aligned source or destination addresses), and also makes some effort to generate fast code (for -O1/2/3) or small code (for -Os), though some of the heuristics may need tweaking still. Examples: #include string.h void foo (char *dest, char *src) { memcpy (dest, src, AMOUNT); } char known[64]; void dst_aligned (char *src) { memcpy (known, src, AMOUNT); } void src_aligned (char *dst) { memcpy (dst, known, AMOUNT); } For -mcpu=cortex-m4 -mthumb -O2 -DAMOUNT=15 we get: foo: ldr r2, [r1, #4]@ unaligned ldr r3, [r1, #8]@ unaligned push{r4} ldr r4, [r1, #0]@ unaligned str r2, [r0, #4]@ unaligned str r4, [r0, #0]@ unaligned str r3, [r0, #8]@ unaligned ldrhr2, [r1, #12] @ unaligned ldrbr3, [r1, #14] @ zero_extendqisi2 strhr2, [r0, #12] @ unaligned strbr3, [r0, #14] pop {r4} bx lr dst_aligned: push{r4} mov r4, r0 movwr3, #:lower16:known ldr r1, [r4, #4]@ unaligned ldr r2, [r4, #8]@ unaligned ldr r0, [r0, #0]@ unaligned movtr3, #:upper16:known stmia r3!, {r0, r1, r2} ldrhr1, [r4, #12] @ unaligned ldrbr2, [r4, #14] @ zero_extendqisi2 strhr1, [r3, #0]@ unaligned strbr2, [r3, #2] pop {r4} bx lr src_aligned: push{r4} movwr3, #:lower16:known movtr3, #:upper16:known mov r4, r0 ldmia r3!, {r0, r1, r2} str r0, [r4, #0]@ unaligned str r1, [r4, #4]@ unaligned str r2, [r4, #8]@ unaligned ldrhr2, [r3, #0]@ unaligned ldrbr3, [r3, #2]@ zero_extendqisi2 strhr2, [r4, #12] @ unaligned strbr3, [r4, #14] pop {r4} bx lr Whereas for -mcpu=cortex-m4 -mthumb -Os -DAMOUNT=15, e.g.: foo: add r3, r1, #12 .L2: ldr r2, [r1], #4@ unaligned cmp r1, r3 str r2, [r0], #4@ unaligned bne .L2 ldrhr3, [r1, #0]@ unaligned strhr3, [r0, #0]@ unaligned ldrbr3, [r1, #2]@ zero_extendqisi2 strbr3, [r0, #2] bx lr Tested (alongside the first patch) with cross to ARM Linux. OK to apply? Thanks, Julian ChangeLog gcc/ * config/arm/arm.c (arm_block_move_unaligned_straight) (arm_adjust_block_mem, arm_block_move_unaligned_loop) (arm_movmemqi_unaligned): New. (arm_gen_movmemqi): Support unaligned block copies. commit 16973f69fce37a2b347ea7daffd6f593aba843d5 Author: Julian Brown jul...@henry7.codesourcery.com Date: Wed May 4 11:26:01 2011 -0700 Optimize block moves when unaligned accesses are permitted. diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c index a18aea6..b6df0d3 100644 --- a/gcc/config/arm/arm.c +++ b/gcc/config/arm/arm.c @@ -10362,6 +10362,335 @@ gen_const_stm_seq (rtx *operands, int nops) return true; } +/* Copy a block of memory using plain ldr/str/ldrh/strh instructions, to permit + unaligned copies on processors which support unaligned semantics for those + instructions. INTERLEAVE_FACTOR can be used to attempt to hide load latency + (using more registers) by doing e.g. load/load/store/store for a factor of 2. + An interleave factor of 1 (the minimum) will perform no interleaving. + Load/store multiple are used for aligned addresses where possible. */ + +static void +arm_block_move_unaligned_straight (rtx dstbase, rtx srcbase, + HOST_WIDE_INT length, + unsigned int interleave_factor) +{ + rtx *regs = XALLOCAVEC (rtx, interleave_factor); + int *regnos = XALLOCAVEC (int, interleave_factor); + HOST_WIDE_INT block_size_bytes = interleave_factor * UNITS_PER_WORD; + HOST_WIDE_INT i, j; + HOST_WIDE_INT remaining = length, words; + rtx halfword_tmp = NULL, byte_tmp = NULL; + rtx dst, src; + bool src_aligned = MEM_ALIGN (srcbase) = BITS_PER_WORD; + bool dst_aligned = MEM_ALIGN (dstbase) = BITS_PER_WORD; + HOST_WIDE_INT srcoffset, dstoffset; + HOST_WIDE_INT src_autoinc, dst_autoinc; + rtx mem, addr; + + gcc_assert (1 = interleave_factor interleave_factor = 4); + + /* Use hard registers if we have aligned source or destination so we can use + load/store multiple with contiguous registers. */ + if (dst_aligned || src_aligned) +for (i = 0; i interleave_factor; i++) + regs[i] = gen_rtx_REG (SImode, i); + else +for (i
[PATCH, ARM] Fix NEON vset_lane for D registers
Hi, This patch fixes vset_lane intrinsic variants for D-register sized variables. A typo meant that the wrong lane would be set in many circumstances. Tested manually only. OK to apply? Thanks, Julian ChangeLog gcc/ * config/arm/neon.md (vec_setmode_internal): Fix misplaced parenthesis in D-register case. Index: gcc/config/arm/neon.md === --- gcc/config/arm/neon.md (revision 173299) +++ gcc/config/arm/neon.md (working copy) @@ -426,7 +426,7 @@ (match_operand:SI 2 immediate_operand i)))] TARGET_NEON { - int elt = ffs ((int) INTVAL (operands[2]) - 1); + int elt = ffs ((int) INTVAL (operands[2])) - 1; if (BYTES_BIG_ENDIAN) elt = GET_MODE_NUNITS (MODEmode) - 1 - elt; operands[2] = GEN_INT (elt);
Re: [PATCH, ARM] Avoid element-order-dependent operations for quad-word vectors in big-endian mode for NEON
On Wed, 9 Feb 2011 12:11:35 + Julian Brown jul...@codesourcery.com wrote: On Wed, 12 Jan 2011 17:38:22 + Julian Brown jul...@codesourcery.com wrote: This version of the patch tweaks target-supports.exp to say that various operations are not available in big-endian mode (removing some of the FAILs from the previous version -- though in big-endian mode without -mvectorize-with-neon-quad, some tests have transitioned from PASS to XPASS. I'm not sure that's worth worrying about). The main part of the patch remains unchanged. Ping? Ping? (Patch: http://gcc.gnu.org/ml/gcc-patches/2011-01/msg00768.html) Thanks, Julian
Re: [patch] Fix PR48183, NEON ICE in emit-rtl.c:immed_double_const() under -g
On Thu, 24 Mar 2011 10:57:06 + Richard Sandiford richard.sandif...@linaro.org wrote: Chung-Lin Tang clt...@codesourcery.com writes: PR48183 is a case where ARM NEON instrinsics, under -O -g, produce debug insns that tries to expand OImode (32-byte integer) zero constants, much too large to represent as two HOST_WIDE_INTs; as the internals manual indicates, such large constants are not supported in general, and ICEs on the GET_MODE_BITSIZE(mode) == 2*HOST_BITS_PER_WIDE_INT assertion. This patch allows the cases where the large integer constant is still representable using a single CONST_INT, such as zero(0). Bootstrapped and tested on i686 and x86_64, cross-tested on ARM, all without regressions. Okay for trunk? Thanks, Chung-Lin 2011-03-20 Chung-Lin Tang clt...@codesourcery.com * emit-rtl.c (immed_double_const): Allow wider than 2*HOST_BITS_PER_WIDE_INT mode constants when they are representable as a single const_int RTX. I realise this might be seen as a good expedient fix, but it makes me a bit uneasy. Not a very constructive rationale, sorry. FWIW I also had a fix for this issue, which is equivalent to Chung-Lin's patch apart from only allowing constant-zero (attached). That's not really a vote from me for this approach, but maybe limiting the extent to which we pretend to support wide-integer constants like this is sensible, if we do go that way. Julian--- gcc/expr.c (revision 314639) +++ gcc/expr.c (working copy) @@ -8458,6 +8458,18 @@ expand_expr_real_1 (tree exp, rtx target return decl_rtl; case INTEGER_CST: + if (GET_MODE_BITSIZE (mode) 2 * HOST_BITS_PER_WIDE_INT) + { + /* FIXME: We can't generally represent wide integer constants, + but GCC sometimes tries to initialise wide integer values (such + as used by the ARM NEON support) with zero. Handle that as a + special case here. */ + if (initializer_zerop (exp)) + return CONST0_RTX (mode); + + gcc_unreachable (); + } + temp = immed_double_const (TREE_INT_CST_LOW (exp), TREE_INT_CST_HIGH (exp), mode);
Re: [ARM] Neon / Ocaml question
On Mon, 11 Jan 2010 09:52:59 + Ramana Radhakrishnan ramana.radhakrish...@arm.com wrote: cam-bc3-b12:ramrad01 68 ocamlc -c neon-schedgen.ml File neon-schedgen.ml, line 51, characters 0-10: Unbound module Utils It sounds like a configuration issue but given my rather rusty ocaml skills - I'm not sure where to look. Googling around doesn't show me anything obvious. I see this both with v. 3.09.3 and v 3.11 (on karmic). This is apparently due to a missing source file, utils.ml, but unfortunately I have no idea what has happened to it (I'm not the original author). Luckily it only seems to be used for a single function definition (find_with_result), so we can just re-implement that. Another thing which might bite you is that recent OCaml versions don't like hyphens in filenames: replacing them with underscores works OK though (i.e. neon_schedgen.ml). Compiling neon-schedgen.ml with the attached patch, the parts of cortex-a8-neon.md below the line: ;; The remainder of this file is auto-generated by neon-schedgen. are generated identically. Would you like to try this out and see how you get on with it? Followups set to gcc-patches. Thanks, Julian ChangeLog gcc/ * config/arm/neon-schedgen.ml (Utils): Don't try to open missing module. (find_with_result): New. Index: neon-schedgen.ml === --- neon-schedgen.ml (revision 155808) +++ neon-schedgen.ml (working copy) @@ -48,7 +48,14 @@ and at present we do not emit specific guards.) *) -open Utils +let find_with_result fn lst = + let rec scan = function +[] - raise Not_found + | l::ls - + match fn l with +Some result - result + | _ - scan ls in + scan lst let n1 = 1 and n2 = 2 and n3 = 3 and n4 = 4 and n5 = 5 and n6 = 6 and n7 = 7 and n8 = 8 and n9 = 9
Re: Using a umulhisi3
On Wed, 3 Jun 2009 21:39:34 +1200 Michael Hope micha...@juju.net.nz wrote: How does the combine stage work? It looks like it could get multiple potential matches for a set of RTLs. Does it use some type of costing function to pick between them? Can I tell combine that a umulhisi3 is cheaper than a mulsi3? You could try defining TARGET_RTX_COSTS, if you haven't already. Julian
Re: __builtin_return_address for ARM
On Thu, 26 Feb 2009 15:54:14 + Andrew Haley a...@redhat.com wrote: Paul Brook wrote: Well, but wouldn't it still be nice if __builtin_return_address(N) was implemented for N0 by libcalling into the unwinder for you? Obviously this would still have to return NULL at runtime when you're running on a DW2 target without any EH frame data present in memory (and I guess it wouldn't work on SjLj targets either), but wouldn't it still be a nice convenience feature for users? There are sufficiently many caveats and system specific bits of weirdness that you probably just have to know what you're doing (or rely on backtrace(3) to do it for you). IMHO builtins are for things that you can't do in normal C. So __builtin_return_address(0) makes a lot of sense. Having it start guessing how to do N0 much less so. I suggest we could contribute a version of backtrace.c for ARM to glibc. An example to follow is libc/sysdeps/ia64/backtrace.c. GLIBC already knows how to do backtracing if the ARM-specific unwind tables are present (.ARM.exidx, etc.), using _Unwind_Backtrace. Unfortunately backtraces don't currently terminate cleanly if code without unwind data is reached: CodeSourcery are currently working on fixing the linker so that non-unwindable regions are marked properly, which we consider essential to making this feature usable. Of course, you'll need to compile all your code with -funwind-tables for this to work. We haven't measured the size impact of this yet: we're planning on optimising the unwind tables by merging duplicate entries whenever possible, so hopefully it won't be too bad. Just a heads-up to avoid duplicate effort! Cheers, Julian
Re: __builtin_return_address for ARM
On Fri, 27 Feb 2009 13:32:11 + Julian Brown jul...@codesourcery.com wrote: GLIBC already knows how to do backtracing if the ARM-specific unwind tables are present (.ARM.exidx, etc.), using _Unwind_Backtrace. I'm told this probably isn't true for upstream GLIBC -- but we definitely have a patch somewhere to make GLIBC backtrace use _Unwind_Backtrace, which we'll submit upstream in due course. Sorry for the misinformation! Cheers, Julian
Re: How to implement conditional execution
On Fri, 27 Jun 2008 15:52:22 +0530 Mohamed Shafi [EMAIL PROTECTED] wrote: If the condition in the 'if' instruction is satisfied the processor will execute the next instruction or it will replace with a nop. So this means that i can instructions similar to: if eq Rx, Ry add Rx, Ry add Rx, 2 Will it be possible to implement this in the Gcc backend ? Does any other targets have similar instructions? This is very much like (a simpler version of) the ARM Thumb-2 IT instruction. Look how config/arm/thumb2.md handles that. I think the basic idea should be that you should define conditional instruction patterns which emit assembly for both instructions simultaneously, e.g. (excuse my pseudocode): (define_insn ... [(...)] if eq Rx, Ry\;add Rx, Ry) then there's no possibility for scheduling or other optimisations to split the second instruction away from the first. Julian
Re: core changes for mep port
Steven Bosscher wrote: All of this feels (to me anyway) like adding a lot of code to the middle end to support MEP specific arch features. I understand it is in the mission statement that more ports is a goal for GCC, but I wonder if this set of changes is worth the maintenance burden... FWIW, it sounds to me like this feature may also be useful for current iterations of the ARM NEON extension (which we're planning to submit support for quite soon). NEON supports various operations on DImode quantities, but we don't use them for normal code at present because moving values from NEON back to ARM core registers is relatively slow, so we want to avoid doing that as far as possible. So, if there was a way of specifying that a particular value should be kept in a NEON register, that'd be a good thing, I think. Cheers, Julian
Re: core changes for mep port
Steven Bosscher wrote: On 3/28/07, Julian Brown [EMAIL PROTECTED] wrote: Steven Bosscher wrote: All of this feels (to me anyway) like adding a lot of code to the middle end to support MEP specific arch features. I understand it is in the mission statement that more ports is a goal for GCC, but I wonder if this set of changes is worth the maintenance burden... FWIW, it sounds to me like this feature may also be useful for current iterations of the ARM NEON extension (which we're planning to submit support for quite soon). NEON supports various operations on DImode quantities, but we don't use them for normal code at present because moving values from NEON back to ARM core registers is relatively slow, so we want to avoid doing that as far as possible. So, if there was a way of specifying that a particular value should be kept in a NEON register, that'd be a good thing, I think. And if you use this coprocessor hackery, it will be exactly what Ian opposed in his first reply: As far as I can see you're using new modes to drive register class preferences. Quite possibly. I don't really know enough about how any of this works to say much useful, it just seemed like another potential use for the feature (albeit a rather esoteric one) if it does go in. Cheers, Julian
Re: GCC 4.0 RC2 Available
On 2005-04-18, Mark Mitchell [EMAIL PROTECTED] wrote: RC2 is available here: ftp://gcc.gnu.org/pub/gcc/prerelease-4.0.0-20050417/ As before, I'd very much appreciate it if people would test these bits on primary and secondary platforms, post test results with the contrib/test_summary script, and send me a message saying whether or not there are any regressions, together with a pointer to the results. Results for arm-none-elf, cross-compiled from i686-pc-linux-gnu (Debian) for C and C++ are here: http://gcc.gnu.org/ml/gcc-testresults/2005-04/msg01301.html Relative to RC1, there are several new tests which pass, and: g++.dg/warn/Wdtor1.C (test for excess errors) works whereas it didn't before. Julian
Re: GCC 4.0 RC1 Available
On 2005-04-11, Julian Brown [EMAIL PROTECTED] wrote: On 2005-04-10, Mark Mitchell [EMAIL PROTECTED] wrote: * The DejaGNU testsuite has been run, and compared with a run of the testsuite on the previous release of GCC, and no regressions are observed. If you are willing to help, please download the release candidate, build it on appropriate platforms, and post testresults by using contrib/test_summary. Please use the release candidate itself, *not* the CVS 4.0 release branch, as part of the goal is to ensure that the packaging scripts are working. For arm-none-elf (cross from i686-pc-linux-gnu), with binutils and newlib from CVS: http://gcc.gnu.org/ml/gcc-testresults/2005-04/msg00800.html And, for comparison, 3.4.3 tests: http://gcc.gnu.org/ml/gcc-testresults/2005-04/msg00799.html Quite a few of the 4.0 RC1 tests FAIL, though I'm not sure how many of these are regressions, and how many are just new tests which fail. In more detail, for gcc.sum: Tests that now fail, but worked before: gcc.c-torture/execute/bitfld-1.c execution, -O0 gcc.c-torture/execute/bitfld-1.c execution, -O1 gcc.c-torture/execute/bitfld-1.c execution, -O2 gcc.c-torture/execute/bitfld-1.c execution, -O3 -fomit-frame-pointer gcc.c-torture/execute/bitfld-1.c execution, -O3 -g gcc.c-torture/execute/bitfld-1.c execution, -Os gcc.c-torture/execute/builtin-constant.c execution, -O1 gcc.dg/array-5.c bad vla handling (test for bogus messages, line 40) gcc.dg/bitfld-2.c (test for warnings, line 14) gcc.dg/bitfld-2.c (test for warnings, line 15) gcc.dg/bitfld-2.c (test for warnings, line 20) gcc.dg/bitfld-2.c (test for warnings, line 21) gcc.dg/builtins-18.c (test for excess errors) gcc.dg/builtins-20.c (test for excess errors) gcc.dg/const-elim-1.c scan-assembler-not L\\$?C[^A-Z] gcc.dg/cpp/trad/include.c (test for excess errors) gcc.dg/redecl-1.c (test for errors, line 67) gcc.dg/sequence-pt-1.c sequence point warning (test for warnings, line 59) gcc.dg/uninit-1.c uninitialized variable warning (test for bogus messages, line 16) gcc.dg/uninit-2.c uninitialized variable warning (test for bogus messages, line 28) gcc.dg/uninit-3.c uninitialized variable warning (test for bogus messages, line 11) gcc.dg/uninit-8.c uninitialized variable warning (test for bogus messages, line 14) gcc.dg/Wunreachable-1.c (test for excess errors) For g++.sum: Tests that now fail, but worked before: g++.dg/other/error8.C duplicate error messages (test for bogus messages, line 8) g++.dg/other/error8.C duplicate error messages (test for bogus messages, line 9) g++.dg/rtti/tinfo1.C scan-assembler-not .section[^\n\r]*_ZTIP9CTemplateIhE[^\n\r ]* g++.dg/template/nested3.C (test for errors, line 12) g++.dg/template/nested3.C (test for errors, line 14) g++.dg/template/nested3.C (test for errors, line 25) g++.dg/template/nested3.C (test for errors, line 8) g++.old-deja/g++.jason/cond.C (test for errors, line 20) g++.old-deja/g++.jason/cond.C (test for errors, line 22) g++.old-deja/g++.jason/cond.C (test for errors, line 25) g++.old-deja/g++.jason/cond.C (test for errors, line 27) g++.old-deja/g++.oliva/expr2.C execution test g++.old-deja/g++.oliva/template10.C (test for errors, line 22) g++.old-deja/g++.other/decl5.C (test for warnings, line 55) g++.old-deja/g++.other/decl5.C (test for warnings, line 56) For libstdc++.sum: Tests that now fail, but worked before: 27_io/basic_filebuf/open/char/9507.cc (test for excess errors) That's a total of about 39 regressions, I think. I also got quite a few Old tests that passed, that have disappeared results. Is that expected? Julian
Re: Different sized data and code pointers
On 2005-03-02, Thomas Gill [EMAIL PROTECTED] wrote: Paul Schlie wrote: With the arguable exception of function pointers (which need not be literal address) all pointers are presumed to point to data, not code; therefore may be simplest to define pointers as being 16-bits, and call functions indirectly through a lookup table constructed at link time from program memory, assuming it's readable via some mechanism; as the call penalty incurred would likely be insignificant relative to the potential complexity of attempting to support 24-bit code pointers in the rare circumstances they're typically used, on an otherwise native 16-bit machine. Thanks for the response. Suppose we don't have enough space to burn on a layer of indirection for every function pointer. Do I take it that there's really not a clean way to make GCC treat function pointers as 24 bit while still treating data pointers as 16 bits? FWIW, a port I did used indirection for all function pointers, albeit for a different reason, and I can report that it seems to work OK in practice with a little linker magic. It wasn't really production-quality code though, I admit. Perhaps the indirection table can safely hold only those functions whose address is taken? (Or maybe that was assumed anyway?) Julian -- Julian Brown CodeSourcery, LLC