[Bug target/102125] (ARM Cortex-M3 and newer) missed optimization. memcpy not needed operations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102125 --- Comment #11 from CVS Commits --- The master branch has been updated by Richard Biener : https://gcc.gnu.org/g:d9792f8d227cdd409c2b082ef0685b47ccfaa334 commit r12-7786-gd9792f8d227cdd409c2b082ef0685b47ccfaa334 Author: Richard Biener Date: Wed Mar 23 14:53:49 2022 +0100 target/102125 - alternative memcpy folding improvement The following extends the heuristical memcpy folding path with the ability to use misaligned accesses on strict-alignment targets just like the size-based path does. That avoids regressing the following testcase on arm uint64_t bar64(const uint8_t *rData1) { uint64_t buffer; memcpy(, rData1, sizeof(buffer)); return buffer; } when r12-3482-g5f6a6c91d7c592 is reverted. 2022-03-23 Richard Biener PR target/102125 * gimple-fold.cc (gimple_fold_builtin_memory_op): Allow the use of movmisalign when either the source or destination decl is properly aligned.
[Bug target/102125] (ARM Cortex-M3 and newer) missed optimization. memcpy not needed operations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102125 Richard Earnshaw changed: What|Removed |Added Resolution|--- |FIXED Status|NEW |RESOLVED --- Comment #10 from Richard Earnshaw --- Fixed on master branch.
[Bug target/102125] (ARM Cortex-M3 and newer) missed optimization. memcpy not needed operations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102125 --- Comment #9 from CVS Commits --- The master branch has been updated by Richard Earnshaw : https://gcc.gnu.org/g:5f6a6c91d7c592cb49f7c519f289777eac09bb74 commit r12-3482-g5f6a6c91d7c592cb49f7c519f289777eac09bb74 Author: Richard Earnshaw Date: Fri Sep 3 17:06:15 2021 +0100 gimple: allow more folding of memcpy [PR102125] The current restriction on folding memcpy to a single element of size MOVE_MAX is excessively cautious on most machines and limits some significant further optimizations. So relax the restriction provided the copy size does not exceed MOVE_MAX * MOVE_RATIO and that a SET insn exists for moving the value into machine registers. Note that there were already checks in place for having misaligned move operations when one or more of the operands were unaligned. On Arm this now permits optimizing uint64_t bar64(const uint8_t *rData1) { uint64_t buffer; memcpy(, rData1, sizeof(buffer)); return buffer; } from ldr r2, [r0]@ unaligned sub sp, sp, #8 ldr r3, [r0, #4]@ unaligned strdr2, [sp] ldrdr0, [sp] add sp, sp, #8 to mov r3, r0 ldr r0, [r0]@ unaligned ldr r1, [r3, #4]@ unaligned PR target/102125 - (ARM Cortex-M3 and newer) missed optimization. memcpy not needed operations gcc/ChangeLog: PR target/102125 * gimple-fold.c (gimple_fold_builtin_memory_op): Allow folding memcpy if the size is not more than MOVE_MAX * MOVE_RATIO.
[Bug target/102125] (ARM Cortex-M3 and newer) missed optimization. memcpy not needed operations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102125 --- Comment #8 from CVS Commits --- The master branch has been updated by Richard Earnshaw : https://gcc.gnu.org/g:f0cfd070b68772eaaa19a3b711fbd9e85b244240 commit r12-3481-gf0cfd070b68772eaaa19a3b711fbd9e85b244240 Author: Richard Earnshaw Date: Fri Sep 3 16:53:13 2021 +0100 arm: expand handling of movmisalign for DImode [PR102125] DImode is currently handled only for machines with vector modes enabled, but this is unduly restrictive and is generally better done in core registers. gcc/ChangeLog: PR target/102125 * config/arm/arm.md (movmisaligndi): New define_expand. * config/arm/vec-common.md (movmisalign): Iterate over VDQ mode.
[Bug target/102125] (ARM Cortex-M3 and newer) missed optimization. memcpy not needed operations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102125 --- Comment #7 from CVS Commits --- The master branch has been updated by Richard Earnshaw : https://gcc.gnu.org/g:408e8b906632f215f6652b8851bba612cde07c25 commit r12-3480-g408e8b906632f215f6652b8851bba612cde07c25 Author: Richard Earnshaw Date: Thu Sep 9 10:56:01 2021 +0100 rtl: directly handle MEM in gen_highpart [PR102125] gen_lowpart_general handles forming a lowpart of a MEM by using adjust_address to rework and validate a new version of the MEM. Do the same for gen_highpart rather than calling simplify_gen_subreg for this case. gcc/ChangeLog: PR target/102125 * emit-rtl.c (gen_highpart): Use adjust_address to handle MEM rather than calling simplify_gen_subreg.
[Bug target/102125] (ARM Cortex-M3 and newer) missed optimization. memcpy not needed operations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102125 --- Comment #6 from Richard Earnshaw --- (In reply to Richard Biener from comment #2) > One common source of missed optimizations is gimple_fold_builtin_memory_op > which has [...] Yes, this is the source of the problem. I wonder if this should be scaled by something like MOVE_RATIO to get a more acceptable limit, especially at higher optimization levels.
[Bug target/102125] (ARM Cortex-M3 and newer) missed optimization. memcpy not needed operations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102125 --- Comment #5 from Richard Earnshaw --- Testcase was not quite complete. Extending it to: typedef unsigned long long uint64_t; typedef unsigned long uint32_t; typedef unsigned char uint8_t; uint64_t bar64(const uint8_t *rData1) { uint64_t buffer; __builtin_memcpy(, rData1, sizeof(buffer)); return buffer; } uint32_t bar32(const uint8_t *rData1) { uint32_t buffer; __builtin_memcpy(, rData1, sizeof(buffer)); return buffer; } and then looking at the optimized tree output we see: ;; Function bar64 (bar64, funcdef_no=0, decl_uid=4196, cgraph_uid=1, symbol_order=0) uint64_t bar64 (const uint8_t * rData1) { uint64_t buffer; uint64_t _4; [local count: 1073741824]: __builtin_memcpy (, rData1_2(D), 8); _4 = buffer; buffer ={v} {CLOBBER}; return _4; } ;; Function bar32 (bar32, funcdef_no=1, decl_uid=4200, cgraph_uid=2, symbol_order=1) uint32_t bar32 (const uint8_t * rData1) { unsigned int _3; [local count: 1073741824]: _3 = MEM [(char * {ref-all})rData1_2(D)]; return _3; } So in the 32-bit case we've eliminated the memcpy at the tree level, but failed to do that for 64-bit objects. We probably need to add 64-bit support to the movmisalign pattern.
[Bug target/102125] (ARM Cortex-M3 and newer) missed optimization. memcpy not needed operations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102125 --- Comment #4 from Piotr --- mov r3, r0 ldr r0, [r0] @ unaligned ldr r1, [r3, #4] @ unaligned bx lr can be optimized even more ldr r1, [r0, #4] @ unaligned ldr r0, [r0] @ unaligned bx lr
[Bug target/102125] (ARM Cortex-M3 and newer) missed optimization. memcpy not needed operations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102125 Andrew Pinski changed: What|Removed |Added See Also||https://gcc.gnu.org/bugzill ||a/show_bug.cgi?id=91674 --- Comment #3 from Andrew Pinski --- I suspect PR 91674 is the same.
[Bug target/102125] (ARM Cortex-M3 and newer) missed optimization. memcpy not needed operations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102125 Richard Biener changed: What|Removed |Added Component|c |target Last reconfirmed||2021-08-30 Target||arm Keywords||missed-optimization Ever confirmed|0 |1 Status|UNCONFIRMED |NEW --- Comment #2 from Richard Biener --- One common source of missed optimizations is gimple_fold_builtin_memory_op which has /* If we can perform the copy efficiently with first doing all loads and then all stores inline it that way. Currently efficiently means that we can load all the memory into a single integer register which is what MOVE_MAX gives us. */ src_align = get_pointer_alignment (src); dest_align = get_pointer_alignment (dest); if (tree_fits_uhwi_p (len) && compare_tree_int (len, MOVE_MAX) <= 0 ... /* If the destination pointer is not aligned we must be able to emit an unaligned store. */ && (dest_align >= GET_MODE_ALIGNMENT (mode) || !targetm.slow_unaligned_access (mode, dest_align) || (optab_handler (movmisalign_optab, mode) != CODE_FOR_nothing))) where here likely the MOVE_MAX limit applies (it is 4). Since we actually do need to perform two loads the code seems to do what is intended (but that's of course "bad" for 64bit copies on 32bit archs and likewise for 128bit copies on 64bit archs). It's usually too late for RTL memcpy expansion to fully elide stack storage.