[Bug target/69460] ARM Cortex M0 produces suboptimal code vs Cortex M3

2017-05-04 Thread strntydog at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69460

--- Comment #6 from strntydog at gmail dot com ---
I have built GCC 7.1.0 and have tested this optimization bug against that.  It
persists.  Further, the new target cortx-m23 is affected by the bug, exactly
the same as Cortex M0/M0+ and M1

The new cortex-m33 target behaves the same as the cortex-m3, in that it
produces legal code for the cortex-m23/m0/m0+/m1 but it is much better
optimised.

[Bug target/69460] ARM Cortex M0 produces suboptimal code vs Cortex M3

2017-05-01 Thread strntydog at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69460

--- Comment #5 from strntydog at gmail dot com ---
I also just calculated the number of cycles each function takes:

Test 1 - 50% More CPU Cycles
Test 2 - 25% More CPU Cycles
Test 3 - 5% More CPU Cycles
Test 4 - 39% More CPU Cycles
Test 5 - 6% More CPU Cycles
Test 6 - 46% More CPU Cycles

This assumes Zero Wait state access to memory, any wait states will make these
differences worse, as the excess cycles are the result of extra flash accesses.

So, even Test 3 and 5 which have the same code size will run ~5% slower than it
should, which is significant.  But the worst cases will be dramatically slower.

This bug leads not only to slower execution, but that has a direct impact on
Power Efficiency and battery life in battery powered devices (which are a
target market for M0/M0+ processors).

These are extremely common and simple memory access patterns, and every single
M0/M0+ program will be negatively effected by it.

[Bug target/69460] ARM Cortex M0 produces suboptimal code vs Cortex M3

2017-04-30 Thread strntydog at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69460

strntydog at gmail dot com changed:

   What|Removed |Added

Version|5.2.0   |6.3.1

--- Comment #4 from strntydog at gmail dot com ---
Ok, so i just tested to see if this problem with Cortex M0/M0+ code generation
persists in GCC 6.3.1, which is the latest GCC Binary distributed by the Arm
Embedded folks.  And it does.

To put the Optimisation failure into perspective, this is the difference
between the 6 tests in the test case:

Test 1 - Code Size is 40% Bigger for M0, and the Function is 114% bigger.
Test 2 - Code Size is 20% bigger for M0, and the Function is 44% bigger.
Test 3 - Code Size is same between M0 and M3, but the Function is 43% bigger.
Test 4 - Code Size is 40% Bigger for M0, and the Function is 86% bigger.
Test 5 - Code Size is same between M0 and M3, but the Function is 14% bigger.
Test 6 - Code Size is 38% Bigger for M0, and the Function is 100% bigger.

These are HUGE.  

This means that on average these function will run about 22% slower than they
should and consume 67% more FLASH space than they should. But worst case from
my tests could be over twice as large as they need to be and need 40% more
instructions to achieve the same thing.

This problem is easily shown to occur when accessing memory location at known
addresses, something which microcontroller programs do all the time. This
problem effects every single M0 Application written which is compiled with GCC,
wasting Flash and running slower.

Note: Code Size refers to the number of instructions in the function, and the
function size is the code size plus its Literal data.  Code size is a measure
of performance on the M0, because more instructions means more cycles to
execute. And Function size is a measure of flash wastage.

[Bug target/69460] ARM Cortex M0 produces suboptimal code vs Cortex M3

2016-01-25 Thread rearnsha at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69460

Richard Earnshaw  changed:

   What|Removed |Added

   Keywords||missed-optimization
 Target||arm
 Status|UNCONFIRMED |NEW
   Last reconfirmed||2016-01-25
  Component|rtl-optimization|target
 Ever confirmed|0   |1

--- Comment #3 from Richard Earnshaw  ---
Confirmed.  This is a costing issue in the thumb1 code generation path.  I
think it's done this way to try to avoid creating long-lived expressions which
can harm register allocation.  In this case it means that the post-register
allocation pass fails to spot the simplifying address expressions.

One possible way of addressing this might be to reflect the true costs of
expressions once register allocation has completed.  By then we know we can't
create any new register allocation issues.