[Bug target/102125] (ARM Cortex-M3 and newer) missed optimization. memcpy not needed operations

2022-03-23 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102125

--- Comment #11 from CVS Commits  ---
The master branch has been updated by Richard Biener :

https://gcc.gnu.org/g:d9792f8d227cdd409c2b082ef0685b47ccfaa334

commit r12-7786-gd9792f8d227cdd409c2b082ef0685b47ccfaa334
Author: Richard Biener 
Date:   Wed Mar 23 14:53:49 2022 +0100

target/102125 - alternative memcpy folding improvement

The following extends the heuristical memcpy folding path with the
ability to use misaligned accesses on strict-alignment targets just
like the size-based path does.  That avoids regressing the following
testcase on arm

uint64_t bar64(const uint8_t *rData1)
{
uint64_t buffer;
memcpy(, rData1, sizeof(buffer));
return buffer;
}

when r12-3482-g5f6a6c91d7c592 is reverted.

2022-03-23  Richard Biener  

PR target/102125
* gimple-fold.cc (gimple_fold_builtin_memory_op): Allow the
use of movmisalign when either the source or destination
decl is properly aligned.

[Bug target/102125] (ARM Cortex-M3 and newer) missed optimization. memcpy not needed operations

2021-09-13 Thread rearnsha at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102125

Richard Earnshaw  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|NEW |RESOLVED

--- Comment #10 from Richard Earnshaw  ---
Fixed on master branch.

[Bug target/102125] (ARM Cortex-M3 and newer) missed optimization. memcpy not needed operations

2021-09-13 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102125

--- Comment #9 from CVS Commits  ---
The master branch has been updated by Richard Earnshaw :

https://gcc.gnu.org/g:5f6a6c91d7c592cb49f7c519f289777eac09bb74

commit r12-3482-g5f6a6c91d7c592cb49f7c519f289777eac09bb74
Author: Richard Earnshaw 
Date:   Fri Sep 3 17:06:15 2021 +0100

gimple: allow more folding of memcpy [PR102125]

The current restriction on folding memcpy to a single element of size
MOVE_MAX is excessively cautious on most machines and limits some
significant further optimizations.  So relax the restriction provided
the copy size does not exceed MOVE_MAX * MOVE_RATIO and that a SET
insn exists for moving the value into machine registers.

Note that there were already checks in place for having misaligned
move operations when one or more of the operands were unaligned.

On Arm this now permits optimizing

uint64_t bar64(const uint8_t *rData1)
{
uint64_t buffer;
memcpy(, rData1, sizeof(buffer));
return buffer;
}

from
ldr r2, [r0]@ unaligned
sub sp, sp, #8
ldr r3, [r0, #4]@ unaligned
strdr2, [sp]
ldrdr0, [sp]
add sp, sp, #8

to
mov r3, r0
ldr r0, [r0]@ unaligned
ldr r1, [r3, #4]@ unaligned

PR target/102125 - (ARM Cortex-M3 and newer) missed optimization. memcpy
not needed operations

gcc/ChangeLog:

PR target/102125
* gimple-fold.c (gimple_fold_builtin_memory_op): Allow folding
memcpy if the size is not more than MOVE_MAX * MOVE_RATIO.

[Bug target/102125] (ARM Cortex-M3 and newer) missed optimization. memcpy not needed operations

2021-09-13 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102125

--- Comment #8 from CVS Commits  ---
The master branch has been updated by Richard Earnshaw :

https://gcc.gnu.org/g:f0cfd070b68772eaaa19a3b711fbd9e85b244240

commit r12-3481-gf0cfd070b68772eaaa19a3b711fbd9e85b244240
Author: Richard Earnshaw 
Date:   Fri Sep 3 16:53:13 2021 +0100

arm: expand handling of movmisalign for DImode [PR102125]

DImode is currently handled only for machines with vector modes
enabled, but this is unduly restrictive and is generally better done
in core registers.

gcc/ChangeLog:

PR target/102125
* config/arm/arm.md (movmisaligndi): New define_expand.
* config/arm/vec-common.md (movmisalign): Iterate over VDQ
mode.

[Bug target/102125] (ARM Cortex-M3 and newer) missed optimization. memcpy not needed operations

2021-09-13 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102125

--- Comment #7 from CVS Commits  ---
The master branch has been updated by Richard Earnshaw :

https://gcc.gnu.org/g:408e8b906632f215f6652b8851bba612cde07c25

commit r12-3480-g408e8b906632f215f6652b8851bba612cde07c25
Author: Richard Earnshaw 
Date:   Thu Sep 9 10:56:01 2021 +0100

rtl: directly handle MEM in gen_highpart [PR102125]

gen_lowpart_general handles forming a lowpart of a MEM by using
adjust_address to rework and validate a new version of the MEM.
Do the same for gen_highpart rather than calling simplify_gen_subreg
for this case.

gcc/ChangeLog:

PR target/102125
* emit-rtl.c (gen_highpart): Use adjust_address to handle
MEM rather than calling simplify_gen_subreg.

[Bug target/102125] (ARM Cortex-M3 and newer) missed optimization. memcpy not needed operations

2021-08-31 Thread rearnsha at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102125

--- Comment #6 from Richard Earnshaw  ---
(In reply to Richard Biener from comment #2)
> One common source of missed optimizations is gimple_fold_builtin_memory_op
> which has [...]

Yes, this is the source of the problem.  I wonder if this should be scaled by
something like MOVE_RATIO to get a more acceptable limit, especially at higher
optimization levels.

[Bug target/102125] (ARM Cortex-M3 and newer) missed optimization. memcpy not needed operations

2021-08-31 Thread rearnsha at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102125

--- Comment #5 from Richard Earnshaw  ---
Testcase was not quite complete.  Extending it to:

typedef unsigned long long uint64_t;
typedef unsigned long uint32_t;
typedef unsigned char uint8_t;
uint64_t bar64(const uint8_t *rData1)
{
uint64_t buffer;
__builtin_memcpy(, rData1, sizeof(buffer));
return buffer;
}

uint32_t bar32(const uint8_t *rData1)
{
uint32_t buffer;
__builtin_memcpy(, rData1, sizeof(buffer));
return buffer;
}

and then looking at the optimized tree output we see:


;; Function bar64 (bar64, funcdef_no=0, decl_uid=4196, cgraph_uid=1,
symbol_order=0)

uint64_t bar64 (const uint8_t * rData1)
{
  uint64_t buffer;
  uint64_t _4;

   [local count: 1073741824]:
  __builtin_memcpy (, rData1_2(D), 8);
  _4 = buffer;
  buffer ={v} {CLOBBER};
  return _4;

}



;; Function bar32 (bar32, funcdef_no=1, decl_uid=4200, cgraph_uid=2,
symbol_order=1)

uint32_t bar32 (const uint8_t * rData1)
{
  unsigned int _3;

   [local count: 1073741824]:
  _3 = MEM  [(char * {ref-all})rData1_2(D)];
  return _3;

}

So in the 32-bit case we've eliminated the memcpy at the tree level, but failed
to do that for 64-bit objects.

We probably need to add 64-bit support to the movmisalign pattern.

[Bug target/102125] (ARM Cortex-M3 and newer) missed optimization. memcpy not needed operations

2021-08-30 Thread jankowski938 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102125

--- Comment #4 from Piotr  ---
mov r3, r0
ldr r0, [r0]  @ unaligned
ldr r1, [r3, #4]  @ unaligned
bx  lr

can be optimized even more 

ldr r1, [r0, #4]  @ unaligned
ldr r0, [r0]  @ unaligned
bx  lr

[Bug target/102125] (ARM Cortex-M3 and newer) missed optimization. memcpy not needed operations

2021-08-30 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102125

Andrew Pinski  changed:

   What|Removed |Added

   See Also||https://gcc.gnu.org/bugzill
   ||a/show_bug.cgi?id=91674

--- Comment #3 from Andrew Pinski  ---
I suspect PR 91674 is the same.

[Bug target/102125] (ARM Cortex-M3 and newer) missed optimization. memcpy not needed operations

2021-08-30 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102125

Richard Biener  changed:

   What|Removed |Added

  Component|c   |target
   Last reconfirmed||2021-08-30
 Target||arm
   Keywords||missed-optimization
 Ever confirmed|0   |1
 Status|UNCONFIRMED |NEW

--- Comment #2 from Richard Biener  ---
One common source of missed optimizations is gimple_fold_builtin_memory_op
which has

  /* If we can perform the copy efficiently with first doing all loads
 and then all stores inline it that way.  Currently efficiently
 means that we can load all the memory into a single integer
 register which is what MOVE_MAX gives us.  */
  src_align = get_pointer_alignment (src);
  dest_align = get_pointer_alignment (dest);
  if (tree_fits_uhwi_p (len)
  && compare_tree_int (len, MOVE_MAX) <= 0
...
  /* If the destination pointer is not aligned we must be able
 to emit an unaligned store.  */
  && (dest_align >= GET_MODE_ALIGNMENT (mode)
  || !targetm.slow_unaligned_access (mode, dest_align)
  || (optab_handler (movmisalign_optab, mode)
  != CODE_FOR_nothing)))

where here likely the MOVE_MAX limit applies (it is 4).  Since we actually
do need to perform two loads the code seems to do what is intended (but
that's of course "bad" for 64bit copies on 32bit archs and likewise for
128bit copies on 64bit archs).

It's usually too late for RTL memcpy expansion to fully elide stack storage.