[Bug rtl-optimization/78664] LRA must honor REG_ALLOC_ORDER to pick reload registers

2024-05-17 Thread vmakarov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78664

--- Comment #2 from Vladimir Makarov  ---
During register assignment subpass LRA processes hard regs from
ira_class_hard_regs.  Under the same conditions (e.g. costs), LRA chooses regs
processed first.

ira_class_hard_regs contains regs according REG_ALLOC_ORDER.

LRA has code for balanced use of hard regs (controlled by hook
register_usage_leveling_p).  But now by default it is switched off.  Probably
the issue occurred when the code was switched on. 

To be sure I tried the test case and I was not able to reproduce the problem.

So I think the problem has been solved.

[Bug rtl-optimization/115013] [15 Regression] LRA: PR114810 fix result in ICE in the RISC-V Vector

2024-05-10 Thread vmakarov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115013

Vladimir Makarov  changed:

   What|Removed |Added

 CC||vmakarov at gcc dot gnu.org

--- Comment #2 from Vladimir Makarov  ---
Sorry for troubles.  I've started to work on this PR.  ETA for the fix is
Monday.

[Bug rtl-optimization/114415] [13 Regression] wrong code with -Oz -fno-dce -fno-forward-propagate -flive-range-shrinkage -fweb since r13-1826

2024-05-09 Thread vmakarov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114415

--- Comment #11 from Vladimir Makarov  ---
(In reply to Jakub Jelinek from comment #10)
> Vlad, do you plan to backport this to 13.3?  One of the 2 release blockers
> we have for that release.

Ok, I'll port it to releases/gcc-13 branch today.  The patch should be safe.

[Bug target/114942] [14/15 Regression] ICE on valid code at -O1 with "-fno-tree-sra -fno-guess-branch-probability": in extract_constrain_insn, at recog.cc:2713

2024-05-08 Thread vmakarov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114942

--- Comment #5 from Vladimir Makarov  ---
I've started to work on this PR.  I hope a patch will be ready on this or the
next week.

[Bug rtl-optimization/114766] ^ constraint modifier unexpectedly affects register class selection.

2024-04-24 Thread vmakarov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114766

--- Comment #3 from Vladimir Makarov  ---
(In reply to Tamar Christina from comment #2)
> (In reply to Vladimir Makarov from comment #1)
> > (In reply to Tamar Christina from comment #0)
> > > The documentation for ^ states:
> >
> > If it works for you, we could try to use the patch (although it needs some
> > investigation how other targets uses the hint).  In any case, the
> > documentation should be modified or made more clear depending on applying or
> > not applying the patch.
> 
> Yeah, using the patch gives us the behavior we expected, we added a
> workaround for now so we can investigate what other targets do in GCC 15.
> 
> But while looking at this we also got some unexpected behavior with using ?
> 

> 
>   r103 costs: W8_W11_REGS:2000 W12_W15_REGS:2000 TAILCALL_ADDR_REGS:2000
> STUB_REGS:2000 GENERAL_REGS:2000 FP_LO8_REGS:0 FP_LO_REGS:0 FP_REGS:0
> POINTER_AND_FP_REGS:7000 MEM:9000
> 
> In this particular pattern the ? isn't needed so we're removing it, but the
> behavior is still unexpected.

'?' is a very old hint (unlike ^ and @).  It is used by all targets for many
years.  IRA cost calculation uses exactly the same algorithm as it was in now
non-existing regclass.c file.  Changing code related to processing '?' would
have very unpredictable consequences for many targets.  After many years
working on RA, I am still surprised how fragile code calculating costs and reg
classes and how insignificant changes can result in a cascade of GCC test
failures.

There are many factors which still can result in zero cost code even when '?'
is used.  You can try to use more one '?' and see what happens.  If the cost is
still zero, I could look what is going on in the cost calculation.

[Bug rtl-optimization/114810] [14 Regression] internal compiler error: in lra_split_hard_reg_for, at lra-assigns.cc:1868 (unable to find a register to spill) {*andndi3_doubleword_bmi} with -m32 -msta

2024-04-22 Thread vmakarov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114810

--- Comment #9 from Vladimir Makarov  ---
(In reply to Uroš Bizjak from comment #7)
>
> 
> Please note that the insn is defined as:
> 
> (define_insn_and_split "*andn3_doubleword_bmi"
>   [(set (match_operand: 0 "register_operand" "=,r,r")
>   (and:
> (not: (match_operand: 1 "register_operand" "r,0,r"))
> (match_operand: 2 "nonimmediate_operand" "ro,ro,0")))
>(clobber (reg:CC FLAGS_REG))]
> 
> where the problematic alternative (=,r,ro) allows a memory input in its
> operand 2 constraint. The allocator could spill a DImode value to a stack in
> advance and reload the value from the memory in this particular alternative.

That is not how LRA (and the old reload) works.  If an operand matches the
constraint (r in ro), it does not change its location (do reloads).

In general, it is possible to implement reloads for operands already matched to
a constraint but this would significantly complicate already too complicated
code.  And probably heuristics based on reload costs would reject such reloads
anyway.

I probably could implement reg starvation recognition in process_alt_operand
and penalize the alternative and most probably it will not affect other
targets.  Still it is not easy because of different possible class subsets or
intersections.

Still I think Jakub's solution is reasonable at this stage.  If I implement my
proposed solution we could commit it after the release.

[Bug rtl-optimization/114810] [14 Regression] internal compiler error: in lra_split_hard_reg_for, at lra-assigns.cc:1868 (unable to find a register to spill) {*andndi3_doubleword_bmi} with -m32 -msta

2024-04-22 Thread vmakarov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114810

--- Comment #6 from Vladimir Makarov  ---
(In reply to Uroš Bizjak from comment #4)
> An interesting observation, when the insn is defined only with problematic
> alternative:
> 
> (define_insn_and_split "*andn3_doubleword_bmi"
>   [(set (match_operand: 0 "register_operand" "=")
>   (and:
> (not: (match_operand: 1 "register_operand" "r"))
> (match_operand: 2 "nonimmediate_operand" "ro")))
>(clobber (reg:CC FLAGS_REG))]
> 
> the compilation succeeds, and a spill to memory is emitted:
> 
> 
> (insn 1170 65 1177 7 (set (mem/c:DI (plus:SI (reg/f:SI 6 bp)
> (const_int -168 [0xff58])) [71 %sfp+-144 S8 A64])
> (reg:DI 0 ax [orig:217 _13 ] [217])) "pr114810.C":296:36 84
> {*movdi_internal}
>  (nil))
> 
> ...
> 
> (insn 987 1154  7 (parallel [
> (set (reg:DI 3 bx [453])
> (and:DI (not:DI (reg:DI 0 ax [452]))
> (mem/c:DI (plus:SI (reg/f:SI 6 bp)
> (const_int -168 [0xff58])) [71
> %sfp+-144 S8 A64])))
> (clobber (reg:CC 17 flags))
> ]) "pr114810.C":296:6 703 {*andndi3_doubleword_bmi}
>  (nil))

The problem is that the alternative assumes 3 DI values live simultaneously. 
This means 6 regs and we have only 6 available ones. One input reg is assigned
to 0 another one is to 3.  So we have [01]2[34]5, where regs in brackets are
taken by the operands.  Although there are still 2 regs but they can not be
used as they are not adjacent.

The one solution is to somehow penalize the chosen alternative by changing
alternative heuristics in lra-constraints.cc.  But it definitely can affect
other targets in some unpredicted way.  So the solution is too risky especially
at this stage.  Also it might be possible that there is no alternative with
less 3 living pseudos for some different insn case.

I don't see non-risky solution right now.  I'll be thinking how to better fix
this.

[Bug rtl-optimization/114766] ^ constraint modifier unexpectedly affects register class selection.

2024-04-19 Thread vmakarov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114766

--- Comment #1 from Vladimir Makarov  ---
(In reply to Tamar Christina from comment #0)
> The documentation for ^ states:
> 
> "This constraint is analogous to ‘?’ but it disparages slightly the
> alternative only if the operand with the ‘^’ needs a reload."
> 
> 
> The penalty here seems incorrect, and removing it seems to get the
> constraint to work properly.
> So the question is, is it a bug, or are we using it incorrectly? or a
> documentation bug?

The current behavior of '^' is how it was originally planned. 

With this point of view I would say that it is a documentation ambiguity. 
documentation of '?' also does not clearly say about its affect on the cost
calculation and as a consequence on choosing register class.

On the other hand, I don't know what is really needed.  If you need what you
expected, please try the following patch:

diff --git a/gcc/ira-costs.cc b/gcc/ira-costs.cc
index c86c5a16563..04d2f21b023 100644
--- a/gcc/ira-costs.cc
+++ b/gcc/ira-costs.cc
@@ -771,10 +771,6 @@ record_reg_classes (int n_alts, int n_ops, rtx *ops,
  c = *++p;
  break;

-   case '^':
- alt_cost += 2;
- break;
-
case '?':
  alt_cost += 2;
  break;

If it works for you, we could try to use the patch (although it needs some
investigation how other targets uses the hint).  In any case, the documentation
should be modified or made more clear depending on applying or not applying the
patch.

[Bug target/114415] [13/14 Regression] wrong code with -Oz -fno-dce -fno-forward-propagate -flive-range-shrinkage -fweb since r13-1826

2024-04-04 Thread vmakarov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114415

--- Comment #5 from Vladimir Makarov  ---
After some considerations, I've decided to fix it in the scheduler.

Such approach solves the problem for all targets and schedulers, still
permitting live range shrinkage (important for space optimizations) and
scheduling for all option cases.

I see the only downside in adding additional dependencies for stack pointer
modifications.  But such insns have very small latency time and increase
critical paths only quite a bit.

I'll push the patch today or tomorrow.

[Bug target/114415] [13/14 Regression] wrong code with -Oz -fno-dce -fno-forward-propagate -flive-range-shrinkage -fweb since r13-1826

2024-04-03 Thread vmakarov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114415

--- Comment #4 from Vladimir Makarov  ---
(In reply to Jakub Jelinek from comment #3)
> BTW, with additional -mno-red-zone there is still movement of these insns,
> 

The problem is even bigger.  Live range splitting uses a standard insn
dependency calculation of the scheduler.  The scheduler can not recognize that
there should be dependencies between insns 60/66 and 65/71:

   60: {sp:DI=sp:DI-0x40;clobber flags:CC;} 
   65:
{r262:DI=0;r161:DI=r132:DI<<0x2+r129:DI;r162:DI=r132:DI<<0x2+r140:DI;[r129:DI]=[r140:DI];use
r132:DI;}   
   66: {sp:DI=r129:DI-0x40;clobber flags:CC;}   
   71:
{r263:DI=0;r165:DI=r132:DI<<0x2+r136:DI;r166:DI=r132:DI<<0x2+r128:DI;[r136:DI]=[r128:DI];use
r132:DI;}   

Therefore it moves insns 65 and 71 before insn 60 during live range shrinkage. 
It is wrong when there is no red zone but, even worse, split3 inserts the
following

  446: [--sp:DI]=0x10
  447: cx:DI=[sp:DI++]

between insns 65 and 71 before insn 60 which makes code wrong (rewriting memory
updated by insn 65) even if there is a red zone.

The analogous problem could occur in sched1 or/and sched2 when we don't use
live range shrinkage.

There is no way that the scheduler find and create specific deps 60->65, 66->65
(too much analysis is required to find insn 65 or 71 works with the stack)

I see two possible solutions:
  1. prohibit sched1, sched2, and live range shrinkage when accumulating args
is used
  2. create deps between any stack modification insns and memory modification
insns

The first one is easier and affects only one target (although the same problem
can be on other targets).  Still probably the same should be done for selective
scheduler.

The second one is the safest approach solving problems on all targets but may
affect performance of other targets.  The fix can require more time to
implement.

I'll think a bit about the possible fixes and inform you tomorrow.

[Bug c++/114480] g++: internal compiler error: Segmentation fault signal terminated program cc1plus

2024-03-27 Thread vmakarov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114480

--- Comment #11 from Vladimir Makarov  ---
My finding is that RA is not a problem for GCC speed with -O1 and up.

RA in -O0 does really consume a big portion of GCC compiler time.  The
biggest part of RA in -O0 is actually spent in life analysis.  It is
difficult to implement a modest RA w/o life analysis as it will
results in huge stack slot generation (not knowing pseudo lives
basically means allocating stack slot for each pseudo).

The problem with the test is a huge number of pseudos (or IRA
objects).  This results in a big sparse set (which can be hardly
placed in L3 cache) and bad cache behaviour.

I tried to use a bitmap instead of sparse set, but GCC crashed after
allocating 48GB memory.  Sbitmap works better and improves IRA time by
12%.  But it works worse for other more frequently use cases.

So I don't think that RA behaviour can be improved for this case.

[Bug target/99829] MVE: ICE in lra_assign at -O3

2024-03-13 Thread vmakarov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99829

--- Comment #7 from Vladimir Makarov  ---
(In reply to Maxim Kuvyrkov from comment #5)
> 
> Where did you see the timeouts, btw?

Sorry, I glanced at c logs and interpreted it wrongly.  Please, discard my
previous comment.

I should been more accurate with reading the PR.  I've tried c compiler instead
of c++ one.  Therefore I did not reproduce the bug.  But the bug is really
present for c++ compiler.

I'll work on this PR and try to fix this on this or the next week.

[Bug target/99829] MVE: ICE in lra_assign at -O3

2024-03-12 Thread vmakarov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99829

--- Comment #4 from Vladimir Makarov  ---
(In reply to Maxim Kuvyrkov from comment #3)
> Hi Vladimir,
> 
> Could you take a look at this, please?

I already got a message from automatic linaro tester yesterday about the new
test failures and looked at them.

I was not able to reproduce them but after I looked at the provided log files. 
I see that the tests failed because of timeout.

My recent patch resulted in LRA doing a bit more job and therefore the tests
(all with -O3) failed because of the timeout.

I'd recommend to increase the timeout threshold for the tester.

[Bug target/113790] [14 Regression][riscv64] ICE in curr_insn_transform, at lra-constraints.cc:4294 since r14-4944-gf55cdce3f8dd85

2024-03-08 Thread vmakarov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113790

Vladimir Makarov  changed:

   What|Removed |Added

 CC||vmakarov at gcc dot gnu.org

--- Comment #1 from Vladimir Makarov  ---
I've reproduced it and started to work on it.  I hope to fix it today or on
Monday.

[Bug target/113510] [14 Regression] [ARM Thumb] ICE in extract_constrain_insn with CPU cortex-m23

2024-01-23 Thread vmakarov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113510

Vladimir Makarov  changed:

   What|Removed |Added

 CC||vmakarov at gcc dot gnu.org

--- Comment #4 from Vladimir Makarov  ---
This is not a RA bug.  I believe it is a bug in peephole optimization.

Right after LRA and before peephole2 we have:

(insn 28 13 15 2 (set (reg:SI 12 ip [127])
(const_int 8 [0x8]))
"../../gcc/gcc/testsuite/gcc.c-torture/compile/nested-3.c":17:10 959
{*thumb1_movsi_insn}
 (nil))
(insn 15 28 16 2 (set (reg:SI 12 ip [127])
(plus:SI (reg:SI 12 ip [127])
(reg/f:SI 13 sp)))
"../../gcc/gcc/testsuite/gcc.c-torture/compile/nested-3.c":17:10 935
{*thumb1_addsi3}
 (nil))

and peephole2 combines these two insns into

(insn 39 13 16 2 (set (reg:SI 12 ip [127])
(plus:SI (reg/f:SI 13 sp)
(const_int 8 [0x8])))
"../../gcc/gcc/testsuite/gcc.c-torture/compile/nested-3.c":17:10 -1
 (nil))

which is wrong as output and the 1st operand of *thumb1_addsi3 should be a low
register but r12 is not a low register.  If peephole2 took this into account,
it would have not combined the 2 insns and we would have not this PR.

[Bug rtl-optimization/113048] [13/14 Regression] ICE: in lra_split_hard_reg_for, at lra-assigns.cc:1862 (unable to find a register to spill) {*andndi3_doubleword_bmi} with -march=cascadelake since r13

2024-01-15 Thread vmakarov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113048

--- Comment #7 from Vladimir Makarov  ---
I believe this PR was recently fixed by
https://gcc.gnu.org/git/gitweb.cgi?p=gcc.git;h=a729b6e002fe76208f33fdcdee49d6a310a1940e

[Bug middle-end/113354] Regression/14: unable to find a register to spill on mips

2024-01-12 Thread vmakarov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113354

Vladimir Makarov  changed:

   What|Removed |Added

 CC||vmakarov at gcc dot gnu.org

--- Comment #2 from Vladimir Makarov  ---
Thank you for reporting this.  The issue is not in the patch itself.  The patch
simply triggered a hidden bug.

The insn in the question looks like

1657: {r3001:SI=r291:SI*r294:SI+r3002:SI;clobber r2788:SI;clobber r2390:SI;}

On the 1st subpass we choose alternative with the following constraints

(0) l  (1) d  (2) d  (3) l  (4) X  (5) X {*mul_acc_si}

On the second subpass we choose alternative

(0) l  (1) d  (2) d  (3) l  (4) X  (5) X {*mul_acc_si}

p2788 happened to get MD0 and it prevents p3001 to get MD0 too.   p2788 can be
in any location for this alternative but LRA assignment subpass does not take
this into account.

I'll try to fix this hidden bug on the beginning of the next week.

[Bug rtl-optimization/112918] [m68k] [LRA] ICE: maximum number of generated reload insns per insn achieved (90)

2023-12-21 Thread vmakarov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112918

--- Comment #15 from Vladimir Makarov  ---
The patch resulted in 2 new PRs about ICE when building glibc.  So I reverted
the patch.

I'll continue work on this PR right after the winter holidays.

[Bug rtl-optimization/113098] [14 Regression] LRA ICE building glibc for mips

2023-12-21 Thread vmakarov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113098

--- Comment #1 from Vladimir Makarov  ---
The patch causing this was reverted.

[Bug rtl-optimization/113097] [14 Regression] LRA ICE building glibc for arc

2023-12-21 Thread vmakarov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113097

--- Comment #1 from Vladimir Makarov  ---
Joseph, thank you for reporting this.  I've just reverted the patch causing
this.

I'll use this report for work on another version of the patch.

[Bug target/112918] [m68k] [LRA] ICE: maximum number of generated reload insns per insn achieved (90)

2023-12-15 Thread vmakarov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112918

--- Comment #12 from Vladimir Makarov  ---
I've been working on the PR this week.  The problem for this case is in that
for subreg reload LRA can not narrow reg class more from ALL_REGS to
GENERAL_REGS and then to data regs or address regs.

The patch will be ready today but I am going to test it well and submit it on
Monday as it changes a sensitive part of LRA and might be risky.

[Bug rtl-optimization/112875] [14 Regression] ICE: in lra_eliminate_regs_1, at lra-eliminations.cc:670 with -Oz -frounding-math -fno-dce -fno-trapping-math -fno-tree-dce -fno-tree-dse -g

2023-12-08 Thread vmakarov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112875

--- Comment #3 from Vladimir Makarov  ---
(In reply to Jakub Jelinek from comment #2)
> Started with r14-53-g675b1a7f113adb1d737adaf78b4fd90be7a0ed1a

I reproduced it and hope to fix it today.

[Bug target/112445] [14 Regression] ICE: in lra_split_hard_reg_for, at lra-assigns.cc:1861 unable to find a register to spill: {*umulditi3_1} with -O -march=cascadelake -fwrapv since r14-4968-g89e5d90

2023-11-22 Thread vmakarov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112445

--- Comment #7 from Vladimir Makarov  ---
(In reply to Jakub Jelinek from comment #5)
> Just changing
> --- i386.md.xx2023-11-22 09:47:22.746637132 +0100
> +++ i386.md   2023-11-22 20:38:07.216218697 +0100
> @@ -9984,7 +9984,7 @@
>[(set (match_operand: 0 "register_operand" "=r,A")
>   (mult:
> (zero_extend:
> - (match_operand:DWIH 1 "register_operand" "%d,a"))
> + (match_operand:DWIH 1 "register_operand" "%d,0"))
> (zero_extend:
>   (match_operand:DWIH 2 "nonimmediate_operand" "rm,rm"
> (clobber (reg:CC FLAGS_REG))]
> makes the testcase pass.  A question is how RA treats 0 constraint when the
> two operands have different modes, if it is basically the same as a in that

LRA treats the same way as reload pass.  It is the same hard reg for LE target.
 For BE they are different if they require different number of hard regs.


> case, meaning that the first input operand will never be in %rdx even when
> the A constraint contains %rax and %rdx registers (but the double-word mode
> implies it must be low part in %rax high part in $rdx).

I looked at the testcase.  It seems it can be fixed by different placement of
splitting insns.  So I believe the bug will stay and can be latent if we fix
the PR by some other way.

I'll start to work on this bug on Monday as I will be absent the next two days.

[Bug middle-end/111497] [11/12/13/14 Regression] ICE building mariadb on i686 since r8-470

2023-11-13 Thread vmakarov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111497

--- Comment #7 from Vladimir Makarov  ---
(In reply to Jakub Jelinek from comment #6)
> Is this backportable to release branches or too risky?

I don't think it is risky.  LRA was designed to have unshared rtl.  So copying
rtl in LRA is not risky.

[Bug target/112337] arm: ICE in arm_effective_regno when compiling for MVE

2023-11-08 Thread vmakarov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112337

Vladimir Makarov  changed:

   What|Removed |Added

 CC||vmakarov at gcc dot gnu.org

--- Comment #7 from Vladimir Makarov  ---
(In reply to Alex Coplan from comment #6)
> Confirmed. Here's a slightly cleaned up reproducer that doesn't warn:
> 
> #pragma GCC arm "arm_mve_types.h"
> int32x4_t h(void *p) { return __builtin_mve_vldrwq_sv4si(p); }
> void g(int32x4_t);
> void f(int, int, int, short, int *p) {
>   int *bias = p;
>   for (;;) {
> int32x4_t d = h(bias);
> bias += 4;
> g(d);
>   }
> }
> 
> ICEs with -O2 -march=armv8.1-m.main+mve -mfloat-abi=hard on the trunk.

Looking at the dump, I can guess INC/DEC operand is not a reg after IRA
temporary transformation.  It can be fixed in arm.cc by checking that the
operand is reg instead of using the assert but it could be wrong because the
documentation says the operand should be a reg.  Also such solution would not
work for possible problem on other targets.

Could you provide me preprocessed test file. I'll try to find a solution as
soon as possible.

[Bug rtl-optimization/109035] meaningless memory store on RISC-V and LoongArch

2023-11-02 Thread vmakarov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109035

--- Comment #7 from Vladimir Makarov  ---
For last 2 weeks I pushed several patches for better dealing with equivalences
in RA.

It seems the patches solves the current PR.  I checked the test code generation
for loongarch and aarch64 and did not find spilled pseudos which are reported
here.

I think the PR should closed as fixed.

[Bug rtl-optimization/112107] [14 Regression] bootstrap failure on i686-linux: gcc/ira-build.o differs

2023-10-27 Thread vmakarov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112107

--- Comment #9 from Vladimir Makarov  ---
(In reply to Sergei Trofimovich from comment #8)
> bootstrap with default options did not fail for me either. I had to use
> --enable-checking=release to trigger the failure. I wonder if it exposes the
> failure for you as well.

Yes, with --enable-checking=release I managed to reproduce a failure on clean
bootstrap.

BTW, thank you for the reproducer.  It was easier to start from it.

[Bug rtl-optimization/112107] [14 Regression] bootstrap failure on i686-linux: gcc/ira-build.o differs

2023-10-27 Thread vmakarov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112107

--- Comment #7 from Vladimir Makarov  ---
Sorry for inconvenience because of my patch.

I reproduced the bug with the reproducer using stage1 gcc although strangely
the standard bootstrap works ok for me on i686 debian.

I think I know the reason for this bug.  I'll fix it today.

[Bug rtl-optimization/111971] [12/13/14 regression] ICE: maximum number of generated reload insns per insn achieved (90) since r12-6803-g85419ac59724b7

2023-10-26 Thread vmakarov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111971

--- Comment #6 from Vladimir Makarov  ---
(In reply to Andrew Pinski from comment #4)
> But r1 is the argument register.

It is even worse, r1 is a stack pointer.  Still the compilation should not
finish by LRA failure.

I've just started to work on this problem. I hope a patch fixing this will be
committed on this week or at the beginning of the next week.

[Bug testsuite/111427] [14 regression] gfortran.dg/vect/pr60510.f fails after r14-3999-g3c834d85f2ec42

2023-09-29 Thread vmakarov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111427

--- Comment #3 from Vladimir Makarov  ---
Sorry for the inconvenience caused by the patch. I reverted this patch
yesterday.

[Bug middle-end/111497] [11/12/13/14 Regression] ICE building mariadb on i686 since r8-470

2023-09-22 Thread vmakarov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111497

--- Comment #4 from Vladimir Makarov  ---
I've reproduced the bug. The problem is in combination of splitting pseudo live
range and sharing rtl.

I hope to fix this on the next Monday or Tuesday.

[Bug middle-end/111427] [14 regression] gfortran.dg/vect/pr60510.f fails after r14-3999-g3c834d85f2ec42

2023-09-22 Thread vmakarov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111427

--- Comment #1 from Vladimir Makarov  ---
Unfortunately, I did not manage to reproduce the bug.

[Bug target/111225] ICE in curr_insn_transform, unable to generate reloads for xor, since r14-2447-g13c556d6ae84be

2023-08-30 Thread vmakarov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111225

--- Comment #3 from Vladimir Makarov  ---
I've reproduced the bug.

Just removing `else if (spilled_pseudo_p (op))` for CT_SPECIAL_MEMORY will
break a lot targets but this is right that this code is a reason for the bug.

I have ideas how to fix it and I'll fix it on the next week.

[Bug rtl-optimization/110093] [12/13/14 Regression][avr] Move frenzy leading to code bloat

2023-08-30 Thread vmakarov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110093

--- Comment #5 from Vladimir Makarov  ---
(In reply to Georg-Johann Lay from comment #4)
>
> 
> So are you saying that the bug is actually in lower-subreg.cc ?

No. lower subreg is fine.

Sorry to be unclear.  To generate a better code for the current test case (or
analogous cases) we need live analysis on sub-register level.  Currently it is
done only on whole pseudo-register level.

  First of all DFA (data flow analysis framework) should be modified.  As I
showed DFA wrongly calculate that pseudo r44 lives at the start of BB2,
although it is not (r44 value is not used before insn #37).  It is a big job. 
The problem is also that the active development of DFA stopped long time ago
and their developers do not work on gcc anymore.

  Secondly, after DFA modification RA (and may be other optimizations) should
be modified to work with this information on BB-level.  It is a medium size
project for me and probably it would take 2-3 months of my work time.

So looking at this situation, I would suggest to make -fno-split-wide-types a
default for AVR target to solve this and and analogous PRs.  May be it is not
necessary for good performance of real avr applications.  I am not user AVR and
can not say how severe this problem for the real applications.

[Bug rtl-optimization/110093] [12/13/14 Regression][avr] Move frenzy leading to code bloat

2023-08-29 Thread vmakarov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110093

--- Comment #3 from Vladimir Makarov  ---
I worked on avr issues quite some time.  And here is my findings.
Before IRA we have start of BB2:

;; lr  in14 [r14] 15 [r15] 16 [r16] 17 [r17] 18 [r18] 19 [r19] 20 [r20]
21 [r21] 22 [r22] 23 [r23] 24 [r24] 25 [r25] 28 [r28] 32 [__SP_L__] 34 [argL]
44 45 46

   33: r51:QI=r22:QI
   REG_DEAD r22:QI
   34: r52:QI=r23:QI
  REG_DEAD r23:QI
   35: r53:QI=r24:QI
  REG_DEAD r24:QI
   36: r54:QI=r25:QI
  REG_DEAD r25:QI
   37: r44:SI#0=r51:QI
  REG_DEAD r51:QI
   38: r44:SI#1=r52:QI
  REG_DEAD r52:QI
   39: r44:SI#2=r53:QI
  REG_DEAD r53:QI
   40: r44:SI#3=r54:QI
  REG_DEAD r54:QI

According GCC pseudo r44 conflicts with r51, r52 ...  In reality it is
not.  I could modify BB live analysis in IRA although it is a lot of
work.

But there is a bigger problem. A lot of passes including IRA uses
data-flow analysis framework for global life analysis and it does not
work on subreg level.  You can see that r44 still lives (lr in) at the
beginning of BB2.  DFA is not my responsibility but I can say
modifying DFA this way is a huge project as it will affect a lot of
targets.

Instead, as AVR regs are very small, I propose to avoid the above RTL
code by switching off subreg3 pass (or -fsplit-wide-types) for AVR by
default as it was for gcc-8.

There is still one minor problem: an additional reg-reg move generation for the
test case in comparison with gcc-8.  I'll try to fix it.

[Bug rtl-optimization/110034] The first popped allcono doesn't take precedence over later popped in ira coloring

2023-08-24 Thread vmakarov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110034

--- Comment #4 from Vladimir Makarov  ---
Thank you for providing the test case.

To be honest I don't see why assigning to hr3 to r134 is better.
Currently we have the following assignments:

hr9->r134; hr3->r173; hr3->r124

and the related preferences:

  cp11:a18(r134)<->a29(r173)@125:shuffle
  pref3:a29(r173)<-hr3@2000
  pref4:a0(r124)<-hr3@125

This removes cost 2000 (pref3) and cost 125 (pref4) and adds cost 125
(cp11).  The profit is 2000

If we started with r173, we would have the following assignments:

hr3->r173; hr3->r134; ->r124

This would remove cost 2000 (pref3) and cost 125 (cp11) and add cost
125 (pref).  The profit would be the same 2000.

Choice of heuristics is very time consuming.  I spent a lot of time to
try and benchmark numerous ones.  I clearly remember that introduction
of pseudo threads for colorable busket gave visible performance
improvement.  Currently we assign pseudos from a thread with the
biggest frequency first (r173 and r134) and a pseudo (r134) with the
biggest frequency first from the same thread.  I think it is logical.

Also it is always possible to find a test (not this case) where
heuristics give some undesirable results.  RA is NP-complete task even
in the simplest formulation. We can not get the optimal solution for
reasonable time.

Still I am open to change any heuristic if somebody can show that it
improves performance for some credible benchmark (I prefer SPEC2007)
on major GCC targets.

[Bug rtl-optimization/51041] register allocation of SSE register in loop with across eh edges

2023-07-07 Thread vmakarov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51041

--- Comment #4 from Vladimir Makarov  ---
I believe it is the same problem as PR110215 which was solved recently by
checking whether pseudo values are used in the exception handler and the
handler does not return control flow back to the function code.

So I guess this problem was solved too.

[Bug rtl-optimization/110215] RA fails to allocate register when loop invariant lives across calls and eh

2023-06-14 Thread vmakarov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110215

--- Comment #4 from Vladimir Makarov  ---
(In reply to Richard Biener from comment #3)
> 
> 
> We don't have any pass after reload that would perform loop invatiant motion,
> I'm not sure how this situation is handled in general in RA - is a post-RA
> pass optimizing the spill/reload placement "globally" usually done?

LRA does not do placement of reload insns.  Global RA is supposed to do this
when it forms regions for the allocation.

I've been working on this issue.  I hope the fix will be ready on this week.

[Bug target/109541] [12/13/14 regression] ICE in extract_constrain_insn on when building rhash-1.4.3

2023-06-05 Thread vmakarov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109541

--- Comment #16 from Vladimir Makarov  ---
Sam, thank you for your help.  I've reproduced the problem on your machine.

The fix most probably will be ready this week.

[Bug target/108703] insn does not satisfy its constraints: movhi_insn at -O1

2023-05-31 Thread vmakarov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108703

Vladimir Makarov  changed:

   What|Removed |Added

 CC||vmakarov at gcc dot gnu.org

--- Comment #3 from Vladimir Makarov  ---
Here is my analysis of the problem/

Before IRA we already have:

(insn 10 7 11 2 (set (reg:HI 33 %f1)
(reg:HI 35 %f3)) "/home/vmakarov/testcase.c":8:3 114 {*movhi_insn}
 (expr_list:REG_EQUAL (const_int 13107 [0x])
(nil)))

LRA considers the insn is correct and does not check constraints as it
is simple move and its cost is 2.  This is standard convention for ignoring
constraints since the very early versions of reload pass.  And as I remember,
it is described somewhere in GCC documentation.

I think we should avoid to generate such insn from the start because
ignoring the reload convention will result in many unexpected
consequences where LRA speed slowdown probably would be a minor negative
consequence.

[Bug target/109541] [12/13/14 regression] ICE in extract_constrain_insn on when building rhash-1.4.3

2023-05-31 Thread vmakarov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109541

--- Comment #9 from Vladimir Makarov  ---
(In reply to Eric Botcazou from comment #7)
> The problem is that LRA assigns a floating-point register to the PIC
> pseudo-register (pic_offset_table_rtx) and the SPARC back-end is not
> prepared for it.
> 
> Vladimir, would it be feasible to prevent this from happening?

Sorry, I can not reproduce it on gcc-11, gcc-12, and master using -O1
-mcpu=niagara4 -fpic -c a-sha512.i.

Fortunately, I can reproduce
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108703.  So I'll start on PR108703
first.

[Bug target/109541] [12/13/14 regression] ICE in extract_constrain_insn on when building rhash-1.4.3

2023-05-12 Thread vmakarov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109541

--- Comment #8 from Vladimir Makarov  ---
(In reply to Eric Botcazou from comment #7)
> The problem is that LRA assigns a floating-point register to the PIC
> pseudo-register (pic_offset_table_rtx) and the SPARC back-end is not
> prepared for it.
> 
> Vladimir, would it be feasible to prevent this from happening?

Sure.  I'll work on this after my vacation (in one week).

[Bug rtl-optimization/90706] [10/11/12/13 Regression] Useless code generated for stack / register operations on AVR

2023-03-31 Thread vmakarov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90706

--- Comment #21 from Vladimir Makarov  ---
(In reply to CVS Commits from comment #20)
> The releases/gcc-12 branch has been updated by Vladimir Makarov
> :
> 
> https://gcc.gnu.org/g:88792f04e5c63025506244b9ac7186a3cc10c25a
> 
> 

The trunk with the patch behaved good for a few weeks.  So I backported it to
gcc-12 branch.  GCC-12 branch with the patch was successfully tested and
bootstrapped on x86-64.

[Bug target/109137] [12 regression] Compiling ffmpeg with -m32 on x86_64-pc-linux-gnu hangs on libavcodec/h264_cabac.c since r12-9086-g489c81db7d4f75

2023-03-24 Thread vmakarov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109137

--- Comment #20 from Vladimir Makarov  ---
(In reply to CVS Commits from comment #19)
> The master branch has been updated by Jakub Jelinek :
> 
> https://gcc.gnu.org/g:0d9e52675c009139a14182d92ddb446ba2feabce
> 
> commit r13-6846-g0d9e52675c009139a14182d92ddb446ba2feabce
> Author: Jakub Jelinek 
> Date:   Fri Mar 24 09:42:18 2023 +0100
> 
> testsuite: Fix up gcc.target/i386/pr109137.c testcase [PR109137]
> 
> The testcase has a couple of small problems:
> 1) had -m32 in dg-options, that should never be done, instead the test
>should be guarded on ia32
> 2) adds -fPIC unconditionally (that should be guarded on fpic effective
>target)
> 3) using #include  for a RA test seems unnecessary,
> __builtin_memset
>handles it without the header

Thank you for the test correction, Jakub.

[Bug target/109137] [12/13 regression] Compiling ffmpeg with -m32 on x86_64-pc-linux-gnu hangs on libavcodec/h264_cabac.c since r12-9086-g489c81db7d4f75

2023-03-21 Thread vmakarov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109137

--- Comment #15 from Vladimir Makarov  ---
I've reproduced hanging up but for the particular commit. I also reproduced
internal compiler error on the current master.

I'll try to fix the both problems on this week.

[Bug rtl-optimization/109179] [13 Regression] addkf3-sw.c:51:1: internal compiler error: RTL check: expected elt 3 type 'e' or 'u', have '0'

2023-03-17 Thread vmakarov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109179

--- Comment #21 from Vladimir Makarov  ---
(In reply to Jakub Jelinek from comment #20)
> That LGTM, but Vlad is the maintainer here...

It looks ok for me too.

[Bug rtl-optimization/109179] [13 Regression] addkf3-sw.c:51:1: internal compiler error: RTL check: expected elt 3 type 'e' or 'u', have '0'

2023-03-17 Thread vmakarov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109179

--- Comment #14 from Vladimir Makarov  ---
(In reply to Peter Bergner from comment #13)
> (In reply to Peter Bergner from comment #12)
> > I'll try moving the test up earlier and testing with that.
> 
> So this fixes the ICEs on the two test cases above.  I'll try a full
> bootstrap with it.
> 
> --- a/gcc/lra-constraints.cc
> +++ b/gcc/lra-constraints.cc
> @@ -5014,6 +5014,10 @@ combine_reload_insn (rtx_insn *from, rtx_insn *to)
>enum reg_class to_class, from_class;
>int n, nop;
>signed char changed_nops[MAX_RECOG_OPERANDS + 1];
> +
> +  if (!NONDEBUG_INSN_P (to))
> +return false;
> +
>lra_insn_recog_data_t id = lra_get_insn_recog_data (to);
>struct lra_static_insn_data *static_id = id->insn_static_data;

Peter, sorry for troubles and thank you for working on it.  The patch is ok for
me.  Could you commit the patch if the bootstrap is ok.

[Bug rtl-optimization/109179] [13 Regression] addkf3-sw.c:51:1: internal compiler error: RTL check: expected elt 3 type 'e' or 'u', have '0'

2023-03-17 Thread vmakarov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109179

--- Comment #9 from Vladimir Makarov  ---
(In reply to Jakub Jelinek from comment #7)
> So perhaps:
> --- gcc/lra-constraints.cc.jj 2023-03-17 16:09:09.162136438 +0100
> +++ gcc/lra-constraints.cc2023-03-17 21:37:04.799285670 +0100
> @@ -5020,7 +5020,9 @@ combine_reload_insn (rtx_insn *from, rtx
>/* Check conditions for second memory reload and original insn:  */
>if ((targetm.secondary_memory_needed
> == hook_bool_mode_reg_class_t_reg_class_t_false)
> -  || NEXT_INSN (from) != to || CALL_P (to)
> +  || NEXT_INSN (from) != to
> +  || !NONDEBUG_INSN_P (to)
> +  || CALL_P (to)
>|| id->used_insn_alternative == LRA_UNKNOWN_ALT
>|| (set = single_set (from)) == NULL_RTX)
>  return false;
> ?

Yes, that is what I am trying to do.  For me only question why is LRA works
there on notes.   LRA pushes only real insns to work stack.

[Bug rtl-optimization/109179] addkf3-sw.c:51:1: internal compiler error: RTL check: expected elt 3 type 'e' or 'u', have '0'

2023-03-17 Thread vmakarov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109179

--- Comment #6 from Vladimir Makarov  ---
Peter, thank you for reporting.  I'll try to fix it today or revert it.

[Bug rtl-optimization/109052] Unnecessary reload with -mfpmath=both

2023-03-17 Thread vmakarov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109052

--- Comment #6 from Vladimir Makarov  ---
(In reply to Uroš Bizjak from comment #5)
> (In reply to Vladimir Makarov from comment #4)
> 
> > So I think the current patch is probably an adequate solution.
> 
> Perhaps the compiler should also try to swap input operands to fit the
> combined insn when commutative operands are allowed. This would solve the
> testcase from Comment #2:
> 

Yes.  I am agree.  The base code can be improved further.
Another improvement could be combining secondary memory reload for output.

I'd like to watch what the effect of the current patch would be first.  
Although I tested the patch on many targets as usually for LRA the patch might
result in some troubles on some targets.  But I hope nothing bad will happen.

[Bug rtl-optimization/109052] Unnecessary reload with -mfpmath=both

2023-03-17 Thread vmakarov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109052

--- Comment #4 from Vladimir Makarov  ---
The complete solution would be running combine pass also after LRA. I am not
sure how frequently the 2nd pass will improve the code.  Also probably it might
create some troubles the fix of which will require another LRA pass.  The most
generalized solution would be an approach of combined optimizations (integrated
insn scheduling, RA, and code selection) but in practice it makes the
integrated optimization too complicated.

Less complicated solution could be implementation of combining secondary memory
reload insns in postreload pass but implementing this in LRA is better because
we increase possibility to assign hard regs to other pseudos as we don't need
to allocate hard register to a pseudo which goes away. 

So I think the current patch is probably an adequate solution.

[Bug target/108141] [13 Regression] gcc.target/i386/pr64110.c FAIL since r13-4727 on ia32

2023-03-03 Thread vmakarov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108141

--- Comment #7 from Vladimir Makarov  ---
(In reply to Jakub Jelinek from comment #6)
> The change has been reverted, so this is no longer a regression.

Just for the info.  The patch I reverted resulted in wrong calculation of
pressure classes (there was a single pressure class ALL_REGS).  This affected
register pressure calculation and as a consequence using one region only.  W/o
the patch IRA uses regional register allocation for the loops in the test.

I pushed another patch for PR90706.  I hope it will not create such problems as
the previous patch.

[Bug rtl-optimization/108999] Maybe LRA produce inaccurate hardware register occupancy information for subreg operand

2023-03-03 Thread vmakarov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108999

Vladimir Makarov  changed:

   What|Removed |Added

 CC||vmakarov at gcc dot gnu.org

--- Comment #1 from Vladimir Makarov  ---
Thank you for filling this PR up.

I am going to fix this on the next week.

[Bug target/108145] [13 regression] ICE in from_reg_br_prob_base, at profile-count.h:259

2023-02-23 Thread vmakarov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108145

--- Comment #6 from Vladimir Makarov  ---
FYI, I think my patch did not cause this problem.

I've just check fresh trunk (w/o my patch and the compilation still fails).

So the PR probably should be still open.

[Bug rtl-optimization/108774] [13 Regression] ICE: in get_equiv, at lra-constraints.cc:534 with -Os -ftrapv -mcmodel=large

2023-02-13 Thread vmakarov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108774

--- Comment #1 from Vladimir Makarov  ---
Thank you for reporting this.  I'll try to fix it as soon as possible, today or
tomorrow.

[Bug middle-end/108754] [13 Regression] multiple testsuite errors with r13-5761-g10827a92f1a8c3

2023-02-10 Thread vmakarov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108754

--- Comment #9 from Vladimir Makarov  ---
(In reply to Hans-Peter Nilsson from comment #8)
> My test-run with the suggested change on top of r13-5761-g10827a92f1a8c3
> came out clean (all regressions resolved, no new ones added) so I'll close
> this issue.  Thanks for promptly taking care of this!

Thank you for your help.  And sorry for the inconveniences because of my patch.
It is hard to do changes in RA as they might affect different targets in some
unexpected way.

[Bug middle-end/108754] [13 Regression] multiple testsuite errors with r13-5761-g10827a92f1a8c3

2023-02-10 Thread vmakarov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108754

--- Comment #4 from Vladimir Makarov  ---
(In reply to Hans-Peter Nilsson from comment #3)
> (In reply to Vladimir Makarov from comment #1)
> > I think the problem is that cris uses the old reload pass.  Could you check
> > the following patch:
> 
> Will do, thanks!

OK.  I'll submit the patch then.

[Bug middle-end/108754] [13 Regression] multiple testsuite errors with r13-5761-g10827a92f1a8c3

2023-02-10 Thread vmakarov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108754

--- Comment #1 from Vladimir Makarov  ---
I think the problem is that cris uses the old reload pass.  Could you check the
following patch:

diff --git a/gcc/ira.cc b/gcc/ira.cc
index d0b6ea062e8..9f9af808f63 100644
--- a/gcc/ira.cc
+++ b/gcc/ira.cc
@@ -3773,7 +3773,7 @@ update_equiv_regs (void)
{
  note = set_unique_reg_note (insn, REG_EQUIV,
replacement);
}
- else
+ else if (ira_use_lra_p)
{
  /* We still can use this equivalence for caller save
 optimization in LRA.  Mark this.  */

[Bug tree-optimization/108500] [11/12 Regression] -O -finline-small-functions results in "internal compiler error: Segmentation fault" on a very large program (700k function calls)

2023-02-10 Thread vmakarov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108500

--- Comment #20 from Vladimir Makarov  ---
(In reply to Richard Biener from comment #14)
> Thanks for the new testcase.  With -O0 (and a --enable-checking=release
> built compiler) this builds in ~11 minutes (on a Ryzen 9 7900X) with
> 
>  integrated RA  :  38.96 (  6%)   1.94 ( 20%)  42.00 ( 
> 6%)  3392M ( 23%)
>  LRA non-specific   :  18.93 (  3%)   1.24 ( 13%)  23.78 ( 
> 4%)   450M (  3%)
>  LRA virtuals elimination   :   5.67 (  1%)   0.05 (  1%)   5.75 ( 
> 1%)   457M (  3%)
>  LRA reload inheritance : 318.25 ( 49%)   0.24 (  2%) 318.51 (
> 48%) 0  (  0%)
>  LRA create live ranges : 199.24 ( 31%)   0.12 (  1%) 199.38 (
> 30%)   228M (  2%)
> 645.67user 10.29system 11:04.42elapsed 98%CPU (0avgtext+0avgdata
> 30577844maxresident)k
> 3936200inputs+1091808outputs (122053major+10664929minor)pagefaults 0swaps
>

I've tried test-1M.i with -O0 for clang-14.  It took about 12hours on E5-2697
v3 vs about 30min for GCC.  The most time (99%) of clang is spent in "fast
register allocator":

  Total Execution Time: 42103.9395 seconds (42243.9819 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  ---
Name ---
  41533.7657 ( 99.5%)  269.5347 ( 78.6%)  41803.3005 ( 99.3%)  41942.4177 (
99.3%)  Fast Register Allocator
  139.1669 (  0.3%)  16.4785 (  4.8%)  155.6454 (  0.4%)  156.3196 (  0.4%) 
X86 DAG->DAG Instruction Selection

I've tried the same for -O1.  Again gcc took about 30min and I stopped clang
(with another used RA algorithm) after 120hours.

So the situation with RA is not so bad for GCC.  But in any case I'll try to
improve the speed for this case.

> so register allocation taking all of the time.  There's maybe the possibility
> to gate some of its features on the # of BBs or insns (or whatever the actual
> "bad" thing is - I didn't look closer yet).
> 
> It also seems to use 30GB of peak memory at -O0 ...
> 

I see only 3GB.  Improving this is hard task.  The IRA for -O0 uses very simple
algorithm with usage of very few resources.  We could use even simpler method
(assigning memory only for all pseudos) but I think it does not worth to do as
the generated code will be much bigger and probably will be 1.5-2 times slower.

[Bug rtl-optimization/103541] unnecessary spills around const functions calls

2023-02-03 Thread vmakarov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103541

Vladimir Makarov  changed:

   What|Removed |Added

 CC||vmakarov at gcc dot gnu.org

--- Comment #4 from Vladimir Makarov  ---
Honza, thank you for reporting this.  Fixing just the following code will not
solve the problem as LRA uses only equiv expression valid for the whole
function.

>   ret = valid_combine;
>   if (!MEM_READONLY_P (memref)
>   && !RTL_CONST_OR_PURE_CALL_P (insn))
> return valid_none;
> 

By the way, the old reload pass still works on the test and producing the same
code as LRA currently, also reserving stack slot and using it around the call
instead of reload from a.

I've been working on this problem and I hope the fix will be ready on the next
week.

[Bug tree-optimization/108552] Linux i386 kernel 5.14 memory corruption for pre_compound_page() when gcov is enabled

2023-01-27 Thread vmakarov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108552

--- Comment #35 from Vladimir Makarov  ---
(In reply to Jakub Jelinek from comment #34)
> Seems right now DECL_NONALIASED is only used on these coverage vars and on
> Fortran caf tokens, so perhaps a quick workaround would be on the LRA side
> never reread stuff from MEMs with VAR_P && DECL_NONALIASED MEM_EXPRs.  CCing
> Vlad on that.

The following patch can do this:

diff --git a/gcc/lra-constraints.cc b/gcc/lra-constraints.cc
index 7bffbc07ee2..d80a6a9f41d 100644   
--- a/gcc/lra-constraints.cc
+++ b/gcc/lra-constraints.cc
@@ -515,6 +515,7 @@ get_equiv (rtx x)   
 {  
   int regno;   
   rtx res; 
+  tree expr;   

   if (! REG_P (x) || (regno = REGNO (x)) < FIRST_PSEUDO_REGISTER   
   || ! ira_reg_equiv[regno].defined_p  
@@ -525,6 +526,10 @@ get_equiv (rtx x)  
 {  
   if (targetm.cannot_substitute_mem_equiv_p (res)) 
return x;   
+  if ((expr = MEM_EXPR (res)) != NULL  
+ && (expr = get_base_address (expr)) != NULL   
+ && VAR_P (expr) && DECL_NONALIASED (expr))
+   return x;   
   return res;  
 }  
   if ((res = ira_reg_equiv[regno].constant) != NULL_RTX)

[Bug rtl-optimization/108388] LRA generates RTL that violates constraints

2023-01-20 Thread vmakarov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108388

--- Comment #1 from Vladimir Makarov  ---
Thank you for reporting this.  I've been working on this PR.  I believe the PR
reveals the problem not only for PDP11.  I guess the same can happen for some
other targets.

I hope the patch will be ready the next week as it requires a good testing for
several major targets.  Unfortunately, practically any change in LRA might have
unexpected effect on other targets.

[Bug rtl-optimization/90706] [10/11/12/13 Regression] Useless code generated for stack / register operations on AVR

2022-12-16 Thread vmakarov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90706

--- Comment #17 from Vladimir Makarov  ---
I've reverted my patch as it resulted in two new PRs.  I'll do more work on
this PR and I'll start this job in Jan.

[Bug rtl-optimization/90706] [10/11/12/13 Regression] Useless code generated for stack / register operations on AVR

2022-12-13 Thread vmakarov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90706

Vladimir Makarov  changed:

   What|Removed |Added

 CC||vmakarov at gcc dot gnu.org

--- Comment #14 from Vladimir Makarov  ---
What I see is the input to RA was significantly changed sing gcc-8 (see
insns marked by !).  A lot of subregs is generated now and there is no
promotion of (argument) hard regs (insns 44-47) because of
https://gcc.gnu.org/legacy-ml/gcc-patches/2018-10/msg01356.html.


1: NOTE_INSN_DELETED 1: NOTE_INSN_DELETED
4: NOTE_INSN_BASIC_BLOCK 2   4: NOTE_INSN_BASIC_BLOCK 2
2: r44:SF=r22:SF44: r56:QI=r22:QI
  REG_DEAD r22:SF  REG_DEAD r22:QI
3: NOTE_INSN_FUNCTION_BEG   45: r57:QI=r23:QI
6: r45:QI=0x1  REG_DEAD r23:QI
  REG_EQUAL 0x1 46: r58:QI=r24:QI
7: r18:SF=0.0  REG_DEAD r24:QI
!   8: r22:SF=r44:SF47: r59:QI=r25:QI
  REG_DEAD r44:SF  REG_DEAD r25:QI
9: r24:QI=call [`__gtsf2'] argc:0   48: r52:QI=r56:QI
  REG_DEAD r25:QI  REG_DEAD r56:QI
  REG_DEAD r23:QI   49: r53:QI=r57:QI
  REG_DEAD r22:QI  REG_DEAD r57:QI
  REG_DEAD r18:SF   50: r54:QI=r58:QI
  REG_CALL_DECL `__gtsf2'  REG_DEAD r58:QI
  REG_EH_REGION 0x8000  51: r55:QI=r59:QI
   10: NOTE_INSN_DELETED   REG_DEAD r59:QI
   11: cc0=cmp(r24:QI,0) 3: NOTE_INSN_FUNCTION_BEG
  REG_DEAD r24:QI6: r46:QI=0x1
   12: pc={(cc0>0)?L14:pc} REG_EQUAL 0x1
  REG_BR_PROB 633507684  7: r18:SF=0.0
   22: NOTE_INSN_BASIC_BLOCK 3!  52: clobber r60:SI
   13: r45:QI=0   !  53: r60:SI#0=r52:QI
  REG_EQUAL 0  REG_DEAD r52:QI
   14: L14:   !  54: r60:SI#1=r53:QI
   23: NOTE_INSN_BASIC_BLOCK 4 REG_DEAD r53:QI
   19: r24:QI=r45:QI  !  55: r60:SI#2=r54:QI
  REG_DEAD r45:QI  REG_DEAD r54:QI
   20: use r24:QI !  56: r60:SI#3=r55:QI
   REG_DEAD r55:QI
  !  57: r22:SF=r60:SI#0
   REG_DEAD r60:SI
 9: r24:QI=call [`__gtsf2']
argc:0
   REG_DEAD r25:QI
   REG_DEAD r23:QI
   REG_DEAD r22:QI
   REG_DEAD r18:SF
   REG_CALL_DECL `__gtsf2'
   REG_EH_REGION
0x8000
34: r50:QI=r24:QI
   REG_DEAD r24:QI
10: NOTE_INSN_DELETED
11: pc={(r50:QI>0)?L13:pc}
   REG_DEAD r50:QI
   REG_BR_PROB 633507684
21: NOTE_INSN_BASIC_BLOCK 3
12: r46:QI=0
   REG_EQUAL 0
13: L13:
22: NOTE_INSN_BASIC_BLOCK 4
18: r24:QI=r46:QI
   REG_DEAD r46:QI
19: use r24:QI

Currently, GCC generates the following AVR code:

check:
push r28
push r29
rcall .
rcall .
push __tmp_reg__
in r28,__SP_L__
in r29,__SP_H__
/* prologue: function */
/* frame size = 5 */
/* stack size = 7 */
.L__stack_usage = 7
ldi r18,lo8(1)
std Y+5,r18
ldi r18,0
ldi r19,0
ldi r20,0
ldi r21,0
!   std Y+1,r22
!   std Y+2,r23
!   std Y+3,r24
!   std Y+4,r25
!   ldd r

[Bug target/106462] LRA on mips64el: unable to reload (subreg:SI (reg:DI)) constrained by "f"

2022-11-18 Thread vmakarov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106462

--- Comment #2 from Vladimir Makarov  ---
I built mips64el-linux-gnuabi64 but using -mabi=64 -msingle-float for it gives

cc1: error: unsupported combination: -mgp64 -mno-odd-spreg

Did I miss something?

[Bug rtl-optimization/104637] [10/11 Regression] ICE: maximum number of LRA assignment passes is achieved (30) with -Og -fno-forward-propagate -mavx since r9-5221-gd8fcab689435a29d

2022-06-14 Thread vmakarov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104637

--- Comment #14 from Vladimir Makarov  ---
I've just ported the two patches to gcc-10 and gcc-11 release branches.

gcc-10 required additional work besides just cherry-picking.

The patches were successfully bootstrapped and tested on x86-64.

[Bug target/105136] [11/12 regression] Missed optimization regression with 32-bit adds and shifts

2022-04-20 Thread vmakarov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105136

--- Comment #4 from Vladimir Makarov  ---
I am just saying trivial things here that RA is a NP-complete task and there is
no optimal solution for all tests.  For GCC it is even more complicated as RA
solves code selection tasks too.  Basically we have for this test

p91=di
p92=si
...
p89=p92+p87 (dead p92)
p97=p91>>const (dead p91)
p83=flags?p87:p89 (dead p87, p89)
ax=p83

RA creates the following relations (to propagate assignment costs) for pseudos

p83(ax preferred)---p87---p91(di preferred)
\
 \--p89---p92(si preferred)

Only assignment ax for p89 can create the desired code.  Relation costs of
p87--p91 and p89--p92 or p83--p87 and p83--p89 are the same even if we use
--param ira-consider-dup-in-all-alts=1.

To get the right guaranteed solution we need some greedy algorithm which will
take a lot of time to work and check results not only at the end of IRA but at
the end LRA.

I can revert meaningful changes of the patch which resulted in this
degradation.  But as I can see this creates 3 new test failures for tests
avx512fp16-conjugation-1.c and avx512fp16vl-conjugation-1.c.  Also I can not
guarantee that such change will not result in more serious benchmark (e.g.
SPEC) degradation.

But in any I can try to do this.  Although I am not sure taht it is worth to do
this at this stage of gcc-12 release work.

Richard and Jakub, what your thoughts about reverting my patch in question?

[Bug middle-end/105032] Compiling inline ASM x86 causing GCC stuck in an endless loop with 100% CPU usage

2022-03-30 Thread vmakarov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105032

--- Comment #12 from Vladimir Makarov  ---
GCC-11 branch needs a bit different patch.  I'll commit a modified patch to
gcc-11 branch on Friday.

[Bug middle-end/105032] Compiling inline ASM x86 causing GCC stuck in an endless loop with 100% CPU usage

2022-03-30 Thread vmakarov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105032

--- Comment #10 from Vladimir Makarov  ---
I've reproduced the bug also on the trunk.  The loop in question assumes a
specific order for reload insns.  In this case order of insns involving the
reload pseudos is violated because the pseudo is also used for inheritance.

We can change the loop condition to guarantee its finish independently of the
reload insns order.  It might results in failure of hard reg live range
splitting for the pseudo. Permitting hard reg splitting for reload pseudo
involved in inheritance is questionable with LRA correct work and generated
code efficiency.  So it has no sense for me to do this.

The patch will be pushed to trunk right after finishing testing.

[Bug middle-end/105032] Compiling inline ASM x86 causing GCC stuck in an endless loop with 100% CPU usage

2022-03-29 Thread vmakarov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105032

--- Comment #9 from Vladimir Makarov  ---
Cycling is the worst what can happen to compiler (even crash is better).
This is the highest priority PR right now for me.  I can not say why the cycle
does not finish.  It should as it works only for reload pseudos.  I'll
investigate it more.

In any case I hope to fix it on this week.  Sorry for inconvenience.

[Bug rtl-optimization/104961] [9/10/11/12 Regression] compilation never (?) finishes at -Og

2022-03-17 Thread vmakarov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104961

--- Comment #2 from Vladimir Makarov  ---
I've reproduced the bug.  The mentioned patch is not the cause but a trigger. 
The origin of the problem is actually a removal of hard reg propagation before
RA which happened about year ago.

I hope the fix will be ready on Friday or Monday.

[Bug rtl-optimization/104637] [9/10/11/12 Regression] ICE: maximum number of LRA assignment passes is achieved (30) with -Og -fno-forward-propagate -mavx since r9-5221-gd8fcab689435a29d

2022-03-02 Thread vmakarov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104637

--- Comment #6 from Vladimir Makarov  ---
(In reply to CVS Commits from comment #5)
> The master branch has been updated by Jakub Jelinek :
> 
> https://gcc.gnu.org/g:d7b4c8feee11ea04b83f9996654c96b130588570
> 
> commit r12-7449-gd7b4c8feee11ea04b83f9996654c96b130588570
> Author: Jakub Jelinek 
> Date:   Wed Mar 2 11:04:35 2022 +0100
> 
> testsuite: Fix up pr104637 testcase [PR104637]
> 
> This testcase FAILs everywhere for 3 reasons:
> 1) the testcase can't work on ia32, where sizeof (long double) == 12
>and as it is not a power of 2, we disallow creating vectors with such
>elements, -mx32 and -m64 are fine
> 2) the testcase emits a lot of -Wdiv-by-zero warnings, I've just added
>-Wno-div-by-zero to dg-options
> 3) my fault, when tweaking the testcase I've missed 33 initializers of
>a 32 element vector which didn't change anything on the ICE, but is
>still reported
> 
> This patch fixes all of it, tested with
> RUNTESTFLAGS='--target_board=unix\{-m32,-m64\} i386.exp=pr104637.c'
> both without the LRA fix where it ICEs and with it where it passes
> everywhere.
> 
> 2022-03-02  Jakub Jelinek  
> 
> PR rtl-optimization/104637
> * gcc.target/i386/pr104637.c: Don't run on ia32.  Add
> -Wno-div-by-zero
> to dg-options.
> (foo): Remove extraneous initializer.

Sorry, I should have been more careful with using the original test.

And thank you for fixing this, Jakub.

[Bug target/104686] [12 Regression] Huge compile-time regression building SPEC 2017 538.imagick_r with -march=skylake

2022-03-01 Thread vmakarov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104686

--- Comment #19 from Vladimir Makarov  ---
(In reply to Richard Biener from comment #16)
> it doesn't make a difference for this testcase but profiling shows that
> allocnos_conflict_p is quite expensive so it's best to do it after the other
> continue checks like the following.  I also notice that the comment of
> allocnos_conflict_p says
> 
> /* Return TRUE if allocnos A1 and A2 conflicts. Here we are
>interesting only in conflicts of allocnos with intersected allocno
>classes. */
> 
> so doing it after the ira_reg_classes_intersect_p check makes even more
> sense(?)
> 
> diff --git a/gcc/ira-color.cc b/gcc/ira-color.cc
> index 8b6db1bb417..a5fd79484eb 100644
> --- a/gcc/ira-color.cc
> +++ b/gcc/ira-color.cc
> @@ -1572,15 +1572,14 @@ update_conflict_hard_regno_costs (int *costs, enum
> reg_class aclass,
> else
>   gcc_unreachable ();
>  
> +   another_aclass = ALLOCNO_CLASS (another_allocno);
> if (another_allocno == from
> +   || ALLOCNO_ASSIGNED_P (another_allocno)
> +   || ALLOCNO_COLOR_DATA (another_allocno)->may_be_spilled_p
> +   || ! ira_reg_classes_intersect_p[aclass][another_aclass]
> || allocnos_conflict_p (another_allocno, start))
>   continue;
>  
> -   another_aclass = ALLOCNO_CLASS (another_allocno);
> -   if (! ira_reg_classes_intersect_p[aclass][another_aclass]
> -   || ALLOCNO_ASSIGNED_P (another_allocno)
> -   || ALLOCNO_COLOR_DATA (another_allocno)->may_be_spilled_p)
> - continue;
> class_size = ira_class_hard_regs_num[another_aclass];
> ira_allocate_and_copy_costs
>   (_UPDATED_CONFLICT_HARD_REG_COSTS (another_allocno),
> 
> 

If it is allocnos_conflict_p takes significant time, this change definitely has
sense.  On my estimation it will decrease allocnos_conflict_p calls in about 4
times (assuming fp and int reg classes and half allocnos already assigned).

In any case, the above change is profitable as allocnos_conflict_p practically
always takes more time than the condition tests moved up.

> Now, what's more odd is that we sometimes have a nice bitmap representation
> for the conflicts but we always iterate.  So it _seems_ we should be able
> to do sth like
> 
> diff --git a/gcc/ira-color.cc b/gcc/ira-color.cc
> index 8b6db1bb417..682d1ef7562 100644
> --- a/gcc/ira-color.cc
> +++ b/gcc/ira-color.cc
> @@ -1352,9 +1352,23 @@ allocnos_conflict_p (ira_allocno_t a1, ira_allocno_t
> a2)
>  {
>obj = ALLOCNO_OBJECT (a1, word);
>/* Take preferences of conflicting allocnos into account.  */
> -  FOR_EACH_OBJECT_CONFLICT (obj, conflict_obj, oci)
> -   if (OBJECT_ALLOCNO (conflict_obj) == a2)
> - return true;
> +  if  (!OBJECT_CONFLICT_VEC_P (obj))
> +   {
> + for (int w2 = 0; w2 < ALLOCNO_NUM_OBJECTS (a2); w2++)
> +   {
> + ira_object_t obj2 = ALLOCNO_OBJECT (a2, w2);
> + if (OBJECT_CONFLICT_ID (obj2) >= OBJECT_MIN (obj)
> + && OBJECT_CONFLICT_ID (obj2) <= OBJECT_MAX (obj)
> + && TEST_MINMAX_SET_BIT (OBJECT_CONFLICT_BITVEC (obj),
> + OBJECT_CONFLICT_ID (obj2),
> + OBJECT_MIN (obj), OBJECT_MAX
> (obj)))
> +   return true;
> +   }
> +   }
> +  else
> +   FOR_EACH_OBJECT_CONFLICT (obj, conflict_obj, oci)
> + if (OBJECT_ALLOCNO (conflict_obj) == a2)
> +   return true;
>  }
>return false;
>  }  
> 
> which reduces compile-time from 10s to 1s for me ... the above should
> be split out so we can "optimally" use the bit test for
> object vs. allocno when possible.
> 
> Vlad - any thoughts about the above two things?  Shall I try to polish and
> optimize the bit test or would you be willing to pick those two speedups up?

This change also has sense.  Usually for big functions conflict sets are very
sparse and bit vectors are not used.  But it seems this is not the case for the
PR.

Please, polish and optimize the change as you proposed and I approve the final
version promptly.

Thank you for working on this PR, Richard.

[Bug target/104637] [9/10/11/12 Regression] ICE: maximum number of LRA assignment passes is achieved (30) with -Og -fno-forward-propagate -mavx since r9-5221-gd8fcab689435a29d

2022-02-25 Thread vmakarov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104637

--- Comment #3 from Vladimir Makarov  ---
(In reply to Jakub Jelinek from comment #2)
> If I change the testcase to following (so that it doesn't rely on
> __builtin_convertvector), it started ICEing with
> r0-122162-gb7aa4e9afcd3da4f09d6f982a663ea2094b1f2cf
> typedef short __attribute__((__vector_size__ (64))) U;
> typedef unsigned long long __attribute__((__vector_size__ (32))) V;
> typedef long double __attribute__((__vector_size__ (64))) F;
> 
> int i;
> U u;
> F f;
> 
> void
> foo (char a, char b, _Complex char c, V v)
> {
>   u = (U) { u[0] / 0, u[1] / 0, u[2] / 0, u[3] / 0, u[4] / 0, u[5] / 0, u[6]
> / 0, u[7] / 0,
>   u[8] / 0, u[0] / 0, u[9] / 0, u[10] / 0, u[11] / 0, u[12] / 0, 
> u[13] /
> 0, u[14] / 0, u[15] / 0,
>   u[16] / 0, u[17] / 0, u[18] / 0, u[19] / 0, u[20] / 0, u[21] / 0, 
> u[22]
> / 0, u[23] / 0,
>   u[24] / 0, u[25] / 0, u[26] / 0, u[27] / 0, u[28] / 0, u[29] / 0, 
> u[30]
> / 0, u[31] / 0 };
>   c += i;
>   f = (F) { v[0], v[1], v[2], v[3] };
>   i = (char) (__imag__ c + i);
> }
> 
> In any case, I don't see anything wrong on the GIMPLE side and it isn't
> clear on reloading which insn it is ICEing.

It is a pitfall of LRA hard reg split subpass.  It is a small subpass used as
the last resort for LRA when it can not assign a hard reg to a reload pseudo by
other ways (e.g. by spilling non-reload pseudos).  For simplicity the subpass
works on one split base (as each split changes pseudo live range info).  To
solve the problem the subpass should make as many splits as possible.  This
requires to check overlapping hard reg splits.

In other words, the subpass should be considerably modified.  I hope to commit
the patch on the next week.

[Bug rtl-optimization/104400] [12 Regression] v850e lra/reload failure after recent change

2022-02-10 Thread vmakarov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104400

--- Comment #3 from Vladimir Makarov  ---
(In reply to Jeffrey A. Law from comment #2)
> NP on the timing.  My biggest concern (as always) is whether or not this is
> a generic issue or a bug in the v850 target files.  The former is obviously
> much more important.
> 
> If it starts to look like a target issue, then feel free to punt it to me. 
> While I don't know the v850 fp bits, I have retained a fair amount of
> generic v850 knowledge over the decades :-)

It is my patch pitfall for very unusual v850 insn constraint 'e!r' where e is
even general reg and subset of r.

I have a patch to fix this and after testing it I'll commit it today or
tomorrow.

[Bug rtl-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22

2022-02-10 Thread vmakarov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178

--- Comment #30 from Vladimir Makarov  ---
(In reply to Richard Biener from comment #29)
> (In reply to Vladimir Makarov from comment #28)
> > Could somebody benchmark the following patch on zen2 470.lbm.
> 
> Code generation changes quite a bit, with the patch the offending function
> is 16 bytes larger.  I see no large immediate moves to GPRs anymore but
> there is still a lot of spilling of XMMs to GPRs.  Performance is
> unchanged by the patch:
> 
> 470.lbm 13740128107 S   13740128107 S
> 470.lbm 13740128107 *   13740128107 S
> 470.lbm 13740128107 S   13740128107 *
> 
> 

Thank you very much for testing the patch, Richard.  The results mean no go for
the patch to me.

> Without knowing much of the code I wonder if we can check whether the move
> will be to a reg in GENERAL_REGS?  That is, do we know whether there are
> (besides some special constants like zero), immediate moves to the
> destination register class?
>

There are no such info from the target code.  Ideally we need to have the cost
of loading *particular* immediate value into register class on the same cost
basis
as load/store.  Still to use this info efficiently choosing alternatives should
be based on costs not on the hints and some machine independent general
heuristics (as now).


> That said, given the result on LBM I'd not change this at this point.
> 
> Honza wanted to look at the move pattern to try to mitigate the
> GPR spilling of XMMs.
> 
> I do think that we need to take costs into account at some point and get
> rid of the reload style hand-waving with !?* in the move patterns.

In general I am agree with the direction but it will be quite hard to do.  I
know it well from my experience to change register class cost calculation
algorithm in IRA (the experimental code can be found on the branch ira-select).
I expect huge number of test failures and some benchmark performance
degradation practically for any targets and a big involvement of target
maintainers to fix them.  Although it is possible to try to do this for one
target at the time.

[Bug target/104117] [9,10,11,12 Regression] Darwin ppc64 uses invalid non-PIC address to access constants (in PIC code).

2022-02-09 Thread vmakarov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104117

--- Comment #21 from Vladimir Makarov  ---
(In reply to Iain Sandoe from comment #20)
> (In reply to Iain Sandoe from comment #15)
> > (In reply to Vladimir Makarov from comment #13)
> > > I think there are two code spots whose pitfalls resulted in the PR.
> 
> > > --- a/gcc/config/rs6000/rs6000.c
> > > +++ b/gcc/config/rs6000/rs6000.c
> > > @@ -8202,7 +8202,7 @@ legitimate_lo_sum_address_p (machine_mode mode, rtx 
> > > x,
> > > int strict)
> > >  {
> > >bool large_toc_ok;
> > > 
> > > -  if (DEFAULT_ABI == ABI_V4 && flag_pic)
> > > +  if ((DEFAULT_ABI == ABI_V4 || DEFAULT_ABI == ABI_DARWIN) && 
> > > flag_pic)
> > > return false;
> 
> On testing, this is not sufficient - one ends up with ICEs when we reject a
> valid (UNSPEC-wrapped) address here.  So I think that the slightly more
> elaborate target changes are required - but the LRA change seems fine!
> 
> ... reg-straps on this old h/w take > 1 day .. so some more time will be
> needed for a complete answer.

Ian, you have my approval for LRA changes in advance for committing them into
the master and the branches when the overall patch is ready.  Thank you for
working on machine-dependent parts of the patch.

[Bug rtl-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22

2022-02-09 Thread vmakarov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178

--- Comment #28 from Vladimir Makarov  ---
Could somebody benchmark the following patch on zen2 470.lbm.

diff --git a/gcc/lra-constraints.cc b/gcc/lra-constraints.cc
index 9cee17479ba..76619aca8eb 100644
--- a/gcc/lra-constraints.cc
+++ b/gcc/lra-constraints.cc
@@ -5084,7 +5089,9 @@ lra_constraints (bool first_p)
 (x, lra_get_allocno_class (i)) == NO_REGS))
|| contains_symbol_ref_p (x
  ira_reg_equiv[i].defined_p = false;
-   if (contains_reg_p (x, false, true))
+   if (contains_reg_p (x, false, true)
+   || (CONST_DOUBLE_P (x)
+   && maybe_ge (GET_MODE_SIZE (GET_MODE (x)), 8)))
  ira_reg_equiv[i].profitable_p = false;
if (get_equiv (reg) != reg)
  bitmap_ior_into (equiv_insn_bitmap,
_reg_info[i].insn_bitmap);

If it improves the performance, I'll commit this patch.

The expander unconditionally uses memory pool for double constants.  I think
the analogous treatment could be done for equiv double constants in LRA.

I know only x86_64 permits 64-bit constants as immediate for moving them into
general regs.  As double fp operations is not done in general regs in the most
cases, they should be moved into fp regs and this is costly as Jan wrote.  So
it has sense to prohibit using equiv double constant values in LRA
unconditionally.  If in the future we have a target which can move double
immediate into fp regs we can introduce some target hooks to deal with equiv
double constant.  But right now I think there is no need for the new hook.

[Bug rtl-optimization/104400] [12 Regression] v850e lra/reload failure after recent change

2022-02-09 Thread vmakarov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104400

--- Comment #1 from Vladimir Makarov  ---
Thank you for reporting this, Jeff.

I've reproduced the bug.  I hope to fix this on this week.

[Bug target/104117] [9,10,11,12 Regression] Darwin ppc64 uses invalid non-PIC address to access constants (in PIC code).

2022-02-04 Thread vmakarov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104117

--- Comment #13 from Vladimir Makarov  ---
I think there are two code spots whose pitfalls resulted in the PR.

The first one is in rs6000.cc::legitimate_lo_sum_address_p which permits wrong
pic low-sum address.

Another one is in lra-constraints.cc::process_address_1 which permits put wrong
low-sum address in reg and use the reg in memory.

The following patch solves the problem:

diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
index 5404fb18755..306f67f26c4 100644
--- a/gcc/config/rs6000/rs6000.c
+++ b/gcc/config/rs6000/rs6000.c
@@ -8202,7 +8202,7 @@ legitimate_lo_sum_address_p (machine_mode mode, rtx x,
int strict)
 {
   bool large_toc_ok;

-  if (DEFAULT_ABI == ABI_V4 && flag_pic)
+  if ((DEFAULT_ABI == ABI_V4 || DEFAULT_ABI == ABI_DARWIN) && flag_pic)
return false;
   /* LRA doesn't use LEGITIMIZE_RELOAD_ADDRESS as it usually calls
 push_reload from reload pass code.  LEGITIMIZE_RELOAD_ADDRESS
diff --git a/gcc/lra-constraints.c b/gcc/lra-constraints.c
index 30d088afbca..998e82be54f 100644
--- a/gcc/lra-constraints.c
+++ b/gcc/lra-constraints.c
@@ -3517,21 +3517,8 @@ process_address_1 (int nop, bool check_only_p,
  *ad.inner = gen_rtx_LO_SUM (Pmode, new_reg, addr);
  if (!valid_address_p (op, , cn))
{
- /* Try to put lo_sum into register.  */
- insn = emit_insn (gen_rtx_SET
-   (new_reg,
-gen_rtx_LO_SUM (Pmode, new_reg,
addr)));
- code = recog_memoized (insn);
- if (code >= 0)
-   {
- *ad.inner = new_reg;
- if (!valid_address_p (op, , cn))
-   {
- *ad.inner = addr;
- code = -1;
-   }
-   }
-
+ *ad.inner = addr;
+ code = -1;
}
}
  if (code < 0)

The patch was successfully tested on x86-64/ppc64 under Linux.

[Bug rtl-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22

2022-01-28 Thread vmakarov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178

--- Comment #27 from Vladimir Makarov  ---
(In reply to Richard Biener from comment #17)
> So in .reload we have (with unpatched trunk)
> 
>   401: NOTE_INSN_BASIC_BLOCK 6
>   462: ax:DF=[`*.LC0']
>   REG_EQUAL 9.8506899724167309977929107844829559326171875e-1
>   407: xmm2:DF=ax:DF
>   463: ax:DF=[`*.LC0']
>   REG_EQUAL 9.8506899724167309977929107844829559326171875e-1
>   408: xmm4:DF=ax:DF
> 
> why??!  We can load .LC0 into xmm4 directly.  IRA sees
> 
>   401: NOTE_INSN_BASIC_BLOCK 6
>   407: r118:DF=r482:DF
>   408: r119:DF=r482:DF
> 
> now I cannot really decipher IRA or LRA dumps but my guess would be that
> inheritance (causing us to load from LC0) interferes badly with register
> class assignment?
> 
> Changing pseudo 482 in operand 1 of insn 407 on equiv
> 9.8506899724167309977929107844829559326171875e-1
> ...
>   alt=21,overall=9,losers=1,rld_nregs=1
>  Choosing alt 21 in insn 407:  (0) v  (1) r {*movdf_internal}
>   Creating newreg=525, assigning class GENERAL_REGS to r525
>   407: r118:DF=r525:DF
> Inserting insn reload before:
>   462: r525:DF=[`*.LC0']
>   REG_EQUAL 9.8506899724167309977929107844829559326171875e-1
> 
> we should have preferred alt 14 I think (0) v (1) m, but that has
> 
>   alt=14,overall=13,losers=1,rld_nregs=0
> 0 Spill pseudo into memory: reject+=3
> Using memory insn operand 0: reject+=3
> 0 Non input pseudo reload: reject++
> 1 Non-pseudo reload: reject+=2
> 1 Non input pseudo reload: reject++
> alt=15,overall=28,losers=3 -- refuse
> 0 Costly set: reject++
> alt=16: Bad operand -- refuse
> 0 Costly set: reject++
> 1 Costly loser: reject++
> 1 Non-pseudo reload: reject+=2
> 1 Non input pseudo reload: reject++
> alt=17,overall=17,losers=2 -- refuse
> 0 Costly set: reject++
> 1 Spill Non-pseudo into memory: reject+=3
> Using memory insn operand 1: reject+=3
> 1 Non input pseudo reload: reject++
> alt=18,overall=14,losers=1 -- refuse
> 0 Spill pseudo into memory: reject+=3
> Using memory insn operand 0: reject+=3
> 0 Non input pseudo reload: reject++
> 1 Costly loser: reject++
> 1 Non-pseudo reload: reject+=2
> 1 Non input pseudo reload: reject++
> alt=19,overall=29,losers=3 -- refuse
> 0 Non-prefered reload: reject+=600
> 0 Non input pseudo reload: reject++
> alt=20,overall=607,losers=1 -- refuse
> 1 Non-pseudo reload: reject+=2
> 1 Non input pseudo reload: reject++
> 
> I'm not sure I can decipher the reasoning but I don't understand how it
> doesn't seem to anticipate the cost of reloading the GPR in the alternative
> it chooses?
> 
> Vlad?

All this diagnostics is just description of voodoo from the old reload pass. 
LRA choosing alternative the same way as the old reload pass (I doubt that any
other approach will not break all existing targets).  Simply the old reload
pass does not report its decisions in the dump.

LRA code (lra-constraints.cc::process_alt_operands) choosing the insn
alternatives (as the old reload pass) does not use any memory or register move
costs.  Instead, the alternative is chosen by heuristics and insn constraints
hints (like ? !). The only case where these costs are used, when we have
reg:=reg and the register move costs for this is 2.  In this case LRA(reload)
does not bother to check the insn constraints.

[Bug rtl-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22

2022-01-28 Thread vmakarov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178

--- Comment #26 from Vladimir Makarov  ---
(In reply to Richard Biener from comment #7)
> make costs in a way that IRA/LRA prefer re-materialization of constants
> from the constant pool over spilling to GPRs (if that's possible at all -
> Vlad?)

LRA rematerialization can not rematerialize constant value from memory pool. 
It can rematerialize value of expression only consisting of other pseudos
(currently assigned to hard regs) and constants.

I guess rematerialization pass can be extended to work for constants from
constant memory pool.  It is pretty doable project opposite to
rematerialization of any memory which would require a lot analysis including
aliasing and complicated cost calculation benefits.  May be somebody could pick
this project up.

[Bug middle-end/103616] [9/10/11/12 Regression] ICE on ceph with systemtap macro since r8-5608

2022-01-28 Thread vmakarov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103616

--- Comment #1 from Vladimir Makarov  ---
I can not reproduce ICE on this week GCC.  Probably it was fixed (or switched
off) by some recent RA patch.

As for the second issue (code generation for function foo), I thought for some
time how it could be fixed.  It seemed that LRA inheritance sub-pass could be
extended to work on memory too besides regs.  But I got to conclusion that it
would complicate already complicated LRA (inheritance subpass) more as we need
to add sophisticated analysis (including aliasing) for memory.

I guess there is an simpler alternative solution.  The problem would disappear
if double constant were in asm insn before LRA.  I think some pass before RA
could this.  It could be driven by a target, for example to promote double
constants for x86-64.

Also the problem might be solved if we had pseudo<-double insn instead of
mem<-double insn before LRA, LRA code dealing with equiv could promote double
into the asm insn (although I am not 100% sure about this but, if it is not the
case, probably code dealing with equiv could be tweaked to do this).

So my proposal is to solve the problem somehow outside RA.

[Bug rtl-optimization/104049] [12 Regression] vec_select to subreg lowering causes superfluous moves

2022-01-18 Thread vmakarov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104049

--- Comment #3 from Vladimir Makarov  ---
(In reply to Richard Biener from comment #2)
> We need to understand the issue at least.

I think that it is not an RA problem.

IRA assigns quite reasonable registers.  LRA just generates 2 reloads for this
test, one for insn *add_lsr_si which has only one alternative and one for insn
andsi3 which needs reload insns for any alternative and LRA in this case
chooses the best one.

I guess the problem of the code generation regression is in some recent changes
of combiner or most probably aarch64 machine dependent code directing the
combiner (as Tamar wrote).

It would be nice if somebody bisected and found what commit resulted in the
regression.

As for double transfer of the value, it could be removed by inheritance in LRA
but it is impossible as an input reload pseudo got the same hard register (in
LRA assignment subpass) as one of the insn output pseudo (the assignment was
done in IRA) and the reloaded value is still used in subsequent insn.  
Unfortunately it can happen as RA can not make allocation and code selection
optimally in general case.

  Some coordination between LRA-assignment subpass and LRA-inheritance subpass
could help to avoid the double transfer but right now I have no idea how to do
this.  It is also dangerous to implement such coordination at this stage as
LRA-inheritance sub-pass is very complicated.

[Bug target/103676] [10/11/12 Regression] internal compiler error: in extract_constrain_insn, at recog.c:2671

2022-01-18 Thread vmakarov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103676

--- Comment #23 from Vladimir Makarov  ---
(In reply to Jakub Jelinek from comment #22)
> If we consider such an inline asm invalid, we could error on it, ICE is not
> the right thing.  But what exactly should we error on?  Alternative

I think it is better to fix it in LRA than describing the semantics.  I am
starting to work on it and will look how the fix is going.  If it is too
complicated, we could try another solution (with describing the current
semantics).

In any case, I think it is not worth to fix the same existing problem in the
old reload pass.

> containing multiple register classes for multi-word operands is still
> something used quite commonly in real-world, the problem is when the RA
> assigns it a reg spanning across those.  Or do most backends restrict
> multi-word regs to start at a reg number divisible by the number of words
> they need?

[Bug target/103676] [10/11/12 Regression] internal compiler error: in extract_constrain_insn, at recog.c:2671

2022-01-17 Thread vmakarov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103676

--- Comment #21 from Vladimir Makarov  ---
(In reply to Jakub Jelinek from comment #19)
> r10-3981-gf6ff841bc8dd87ce364deb217dc6d1ec5dc31de8 still doesn't ICE,
> r10-3984-g22060d0e575e7754eb1355763d22bbe37c3caa13 already ICEs.
> 
> I guess there is a disagreement between LRA and recog on how exactly they
> treat register constraints.
> "=lh" for TARGET_THUMB means LO_REGS or HI_REGS classes for the output, bet
> LRA sees that LO_REGS or HI_REGS is together GENERAL_REGS and picks a
> GENERAL_REGS
> (reg:DI 7 r7 [orig:119 tmp ] [119]).  But that one has one half in LO_REGS
> and another half in HI_REGS and so extract_constrain_insn ->
> constrain_operands
> doesn't consider it as matching.

Interesting case.  To find required (reload) register class, LRA (as also the
old reload pass) makes some union of register classes in one alternative which
contains all or part of the registers of the classes (in this case it is
general reg class).  The problem can be solved w/o fixing LRA (and reload pass)
by using

 asm volatile(
  "ldrd %Q[r], %R[r], %[p]\n"
  : [r]"=l,h"(tmp)
  : [p]"m,m"(*p64)
  : "memory"
 );

The problem can be solved in LRA by more complex representation of required reg
classes (still reload should have also such fix).  I guess it will complicate
LRA and reload code a lot.

We could also use more clear description of semantics of constraints currently
used by LRA/reload.  In this case we still need to output more meaningful error
for LRA/reload instead of just internal compiler error.

[Bug target/103722] [12 Regression] ICE in extract_constrain_insn building glibc for SH4

2021-12-15 Thread vmakarov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103722

--- Comment #1 from Vladimir Makarov  ---
(In reply to Joseph S. Myers from comment #0)
> Created attachment 52003 [details]
> preprocessed source
> 
> Build the attached code (from glibc) with -O2 for sh4-linux-gnu.  This
> produces an ICE:
> 
> malloc-debug.c: In function '__debug_realloc':
> malloc-debug.c:267:1: error: insn does not satisfy its constraints:
> (insn 955 1863 2 2 (set (reg:SI 76 fr12 [314])
> (reg:SI 146 pr)) 189 {movsi_ie}
>  (nil))
> during RTL pass: postreload
> malloc-debug.c:267:1: internal compiler error: in extract_constrain_insn, at
> recog.c:2670
> 0x5eec04 _fatal_insn(char const*, rtx_def const*, char const*, int, char
> const*)
> /scratch/jmyers/glibc/many12/src/gcc/gcc/rtl-error.c:108
> 0x5eec2a _fatal_insn_not_found(rtx_def const*, char const*, int, char const*)
> /scratch/jmyers/glibc/many12/src/gcc/gcc/rtl-error.c:119
> 0xcab367 extract_constrain_insn(rtx_insn*)
> /scratch/jmyers/glibc/many12/src/gcc/gcc/recog.c:2670
> 0xc71acd reload_cse_simplify_operands
> /scratch/jmyers/glibc/many12/src/gcc/gcc/postreload.c:407
> 0xc732bc reload_cse_simplify
> /scratch/jmyers/glibc/many12/src/gcc/gcc/postreload.c:190
> 0xc732bc reload_cse_regs_1
> /scratch/jmyers/glibc/many12/src/gcc/gcc/postreload.c:238
> 0xc7584b reload_cse_regs
> /scratch/jmyers/glibc/many12/src/gcc/gcc/postreload.c:66
> 0xc7584b execute
> /scratch/jmyers/glibc/many12/src/gcc/gcc/postreload.c:2355
> Please submit a full bug report,
> with preprocessed source if appropriate.
> Please include the complete backtrace with any bug report.
> See  for instructions.
> 
> This was introduced (exposed?) by:
> 
> commit a7acb6dca941db2b1c135107dac3a34a20650d5c
> Author: Vladimir N. Makarov 
> Date:   Mon Dec 13 13:48:12 2021 -0500
> 
> [PR99531] Modify pseudo class cost calculation when processing move
> involving the pseudo and a hard register

I am conforming that it was triggered by my patch.

But it is not an IRA bug.  The old pass reload (used by SH port) fails to
change insn although insn constraints are not satisfied.  The insn in question
is move

fpreg = poreg

The old reload is mistaken by cost of moving prreg to fpreg.  SH machine code
provides cost 2 for this.  In this case the old reload pass skips checking
constraints of the move.

The following patch solves the problem:

diff --git a/gcc/config/sh/sh.c b/gcc/config/sh/sh.c
index 0628f059ca2..e7c8e5f84b7 100644
--- a/gcc/config/sh/sh.c
+++ b/gcc/config/sh/sh.c
@@ -10762,6 +10762,12 @@ sh_register_move_cost (machine_mode mode,
   && ! REGCLASS_HAS_GENERAL_REG (dstclass))
 return 2 * ((GET_MODE_SIZE (mode) + 7) / 8U);

+  if (((dstclass == FP_REGS || dstclass == DF_REGS)
+   && (srcclass == PR_REGS))
+  || ((srcclass == FP_REGS || srcclass == DF_REGS)
+ && (dstclass == PR_REGS)))
+return 7;
+
   return 2 * ((GET_MODE_SIZE (mode) + 3) / 4U);
 }

The patch also makes IRA to allocate a general reg instead of fpreg which is
more costly after applying the patch.

[Bug target/99531] [9/10/11/12 Regression] Performance regression since gcc 9 (argument passing / register allocation)

2021-12-07 Thread vmakarov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99531

--- Comment #4 from Vladimir Makarov  ---
  Thank you for reporting this.  It is true my patch caused this.

  I've reproduced the bug on master too.  I will be working on this PR.  I
think a fix will be ready on the next week the best as the fix will touch cost
calculations and will require a lot of testing on different targets.

[Bug rtl-optimization/103437] gcc/ira-color.c:2813:5: runtime error: signed integer overflow: 15 * 147462000 cannot be represented in type 'int'

2021-11-29 Thread vmakarov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103437

--- Comment #5 from Vladimir Makarov  ---
Thank you for reporting this.  This problem seems not that important as it is
only about heuristic costs and might be result only in worse performance code
generation (but might be in better code -- it is hard to say).

Still it is better not to remove this warning.  I'll look into this.

[Bug rtl-optimization/102842] [10 Regression] ICE in cselib_record_set at -O2 or greater

2021-10-21 Thread vmakarov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102842

--- Comment #12 from Vladimir Makarov  ---
The patch just hid the bug.  I believe the bug is still present on the trunk
too.

The insn in question is

(insn 26 64 109 3 (parallel [
(set (reg:SI 134 [ _12 ])
(plus:SI (mult:SI (reg:SI 117 [ _8 ])
(reg:SI 128))
(reg:SI 138)))
(set (reg:SI 135 [ _12+4 ])
(plus:SI (truncate:SI (lshiftrt:DI (plus:DI (mult:DI
(zero_extend:DI (reg:SI 117 [ _8 ]))
(zero_extend:DI (reg:SI 128)))
(zero_extend:DI (reg:SI 138)))
(const_int 32 [0x20])))
(reg:SI 138)))
]) "a.cpp":15:32 70 {umlal}
 (expr_list:REG_DEAD (reg:SI 138)
(expr_list:REG_DEAD (reg:SI 128)
(nil

And its definition is 

(define_insn "mlal"
  [(set (match_operand:SI 0 "s_register_operand" "=r,")
(plus:SI
 (mult:SI
  (match_operand:SI 4 "s_register_operand" "%r,r")
  (match_operand:SI 5 "s_register_operand" "r,r"))
 (match_operand:SI 1 "s_register_operand" "0,0")))
   (set (match_operand:SI 2 "s_register_operand" "=r,")
(plus:SI
 (truncate:SI
  (lshiftrt:DI
   (plus:DI
(mult:DI (SE:DI (match_dup 4)) (SE:DI (match_dup 5)))
(zero_extend:DI (match_dup 1)))
   (const_int 32)))
 (match_operand:SI 3 "s_register_operand" "2,2")))]
  "TARGET_32BIT"
  "mlal%?\\t%0, %2, %4, %5"
  [(set_attr "type" "umlal")
   (set_attr "predicable" "yes")
   (set_attr "arch" "v6,nov6")]

After couple of LRA constraints and assignment sub-passes, the two output
operands get the same hard reg.  And this results in cse abort in post-reload
pass.

The issue is that reload pseudos for pseudos 134 and 135 get the same value as
they both are matched with different occurrences of pseudo 138 in the insn.

The bug is in a very sensitive LRA code area and fixing it will take some time.
 But I hope I'll have a fix at the end of next week.

[Bug rtl-optimization/102627] [11 Regression] wrong code with "-O1"

2021-10-14 Thread vmakarov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102627

--- Comment #8 from Vladimir Makarov  ---
I've committed the patch to gcc-11 branch too after nobody made complaints
about the patch in the trunk.  I've also successfully tested and bootstrapped
the patch on the branch too.

[Bug rtl-optimization/102627] [11/12 Regression] wrong code with "-O1"

2021-10-07 Thread vmakarov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102627

--- Comment #4 from Vladimir Makarov  ---
(In reply to Jakub Jelinek from comment #3)
> The assembly difference r11-8007 to r11-8008 is:
> --- pr102627.s2021-10-06 06:32:46.0 -0400
> +++ pr102627.s2021-10-06 06:33:00.0 -0400
> @@ -77,10 +77,10 @@ main:
>   movq%rdx, %rcx
>   movq%rax, %rdx
>   movqe(%rip), %rax
> - movq%rcx, 8(%rsp)
> + movl%ecx, 12(%rsp)
>   movzbl  f(%rip), %ecx
>   salq%cl, %rax
> - movq8(%rsp), %rcx
> + movl12(%rsp), %ecx
>   movq%rax, %rsi
>   movl$0, %edi
>   callw
> I believe y returns the 128-bit struct g return value in %rdx:%rax pair,
> right before the above instructions, and the above change means that instead
> of spilling the whole 64-bits of %rcx that holds at that point u.j and u.k
> members (u.k in the upper 32 bits of %rcx) it spills just 32-bits of %ecx
> and fills it back in, effectively setting u.k to 0.  The w call then takes
> %rdi, %rsi arguments it doesn't use and the TImode in %rcx:%rdx pair, but
> with the high 32 bits of the TImode value lost.  The reason for the spill is
> clear, the shift instruction needs that register...

Jakub, thank you for the analysis.  I believe the patch in question just
triggered a bug in hard reg live range splitting.

I am working on the PR.  I hope to fix it on this week or at begining of the
next week.

[Bug rtl-optimization/102147] IRA dependent on 32-bit vs 64-bit pointer size

2021-09-22 Thread vmakarov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102147

--- Comment #7 from Vladimir Makarov  ---
I've been thinking about ways to fix this problem but only come to the
following patch.  The patch results in working mostly the same for 64-bit
targets and different for 32-bit targets. In any case the profitability is only
an estimation so I think the patch is ok.  Avoiding 4 stage bootstrap is more
important than a bit slower RA on 32-bit targets (which is questionable) on few
border cases.

I am going to commit the patch this Friday.

--- a/gcc/ira-build.c
+++ b/gcc/ira-build.c
@@ -629,7 +629,7 @@ ior_hard_reg_conflicts (ira_allocno_t a, const_hard_reg_set
set)
 bool
 ira_conflict_vector_profitable_p (ira_object_t obj, int num)
 {
-  int nw;
+  int nbytes;
   int max = OBJECT_MAX (obj);
   int min = OBJECT_MIN (obj);

@@ -638,9 +638,14 @@ ira_conflict_vector_profitable_p (ira_object_t obj, int
num)
in allocation.  */
 return false;

-  nw = (max - min + IRA_INT_BITS) / IRA_INT_BITS;
-  return (2 * sizeof (ira_object_t) * (num + 1)
- < 3 * nw * sizeof (IRA_INT_TYPE));
+  nbytes = (max - min) / 8 + 1;
+  STATIC_ASSERT (sizeof (ira_object_t) <= 8);
+  /* Don't use sizeof (ira_object_t), use constant 8.  Size of ira_object_t (a
+ pointer) is different on 32-bit and 64-bit targets.  Usage sizeof
+ (ira_object_t) can result in different code generation by GCC built as
32-
+ and 64-bit program.  In any case the profitability is just an estimation
+ and border cases are rare.  */
+  return (2 * 8 /* sizeof (ira_object_t) */ * (num + 1) < 3 * nbytes);
 }

 /* Allocates and initialize the conflict vector of OBJ for NUM

[Bug rtl-optimization/102356] [11/12 Regression] compile-time explosion at -O3 -g in var-tracking since r11-209-g74dc179a6da33cd0

2021-09-22 Thread vmakarov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102356

--- Comment #3 from Vladimir Makarov  ---
(In reply to Martin Liška from comment #2)
> If I see correctly, it started with r11-209-g74dc179a6da33cd0.

Yes, I am confirming that my patch triggered the slow down.  But the actual
problem is not RA, it is in scalability of var-tracking pass.

I'll investigate more can I fix it in RA and is it worth to fix it in RA.

[Bug rtl-optimization/102147] IRA dependent on 32-bit vs 64-bit pointer size

2021-09-01 Thread vmakarov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102147

--- Comment #6 from Vladimir Makarov  ---
(In reply to David Edelsohn from comment #5)
> Vlad privately commented: "I suspect order of processing conflicts might
> depend on their representation."
> 
> The two representations may produce different results and the heuristics to
> choose the representation depend on the pointer size.

Yes.

I'll be working on the PR.  It is an interesting type of problem.  I think GCC
output should be the same independently what type of compiler (64-bit or 32-bit
one) we use for building GCC.

[Bug rtl-optimization/100328] IRA doesn't model matching constraint well

2021-06-23 Thread vmakarov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100328

--- Comment #2 from Vladimir Makarov  ---
(In reply to Kewen Lin from comment #1)
> Created attachment 50715 [details]
> ira:consider matching cstr in all alternatives
> 
> With little understanding on ira, I am not quite sure this patch is on the
> reasonable direction. It aims to check the matching constraint in all
> alternatives, if there is one alternative with matching constraint and
> matches the current preferred regclass, it will record the output operand
> number and further create one copy for it. Normally it can get the priority
> against shuffle copies and the matching constraint will get satisfied with
> higher possibility, reload doesn't create extra copies to meet the matching
> constraint or the desirable register class when it has to.
> 
> For FMA A,B,C,D, I think ideally copies A/B, A/C, A/D can firstly stay as
> shuffle copies, and later any of A,B,C,D gets assigned by one hardware
> register which is a VSX register but not a FP register, which means it has
> to pay costs once we can NOT go with VSX alternatives, so at that time we
> can increase the freq for the remaining copies related to this, once the
> matching constraint gets satisfied further, there aren't any extra costs to
> pay. This idea seems a bit complicated in the current framework, so the
> proposed patch aggressively emphasizes the matching constraint at the time
> of creating copies.
> 
> FWIW bootstrapped/regtested on powerpc64le-linux-gnu P9. The evaluation with
> Power9 SPEC2017 all run shows 505.mcf_r +2.98%, 508.namd_r +3.37%, 519.lbm_r
> +2.51%, no remarkable degradation is observed.

Thank you for working on this issue.

The current implementation of ira_get_dup_out_num was specifically tuned for
better register allocation for x86-64 div insns.

Your patch definitely improves code for power9 and I would say significantly
(congratulations!).  The patch you proposed makes me think that it might work
for major targets as well.

I would prefer to avoid introducing new parameter because there are too many of
them already and its description is cryptic.

It would be nice if you benchmark the patch on x86-64 too, If there is no
overall degradation with new behaviour we could remove the parameter and make
the new behaviour as a default. If it is not, well we will keep the parameter.

As for the patch itself, I don't like some variable names.  Sorry.  Could you
use op_regno, out_regno, and present_alt instead of op_no, out_no, tot. 
Please, in general use longer variable names reflecting their purpose as GCC
developers reads code in many times more than writing it.

[Bug rtl-optimization/100066] [11 Regression] ICE in lra_assign, at lra-assigns.c:1649

2021-04-13 Thread vmakarov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100066

--- Comment #2 from Vladimir Makarov  ---
Thank you for reporting this.  I've reproduced this bug.  It seems something
wrong  with hard reg live range splitting.  This code is complicated so I can
not say when it will be fixed but I'll do my best to fix this as soon as
possible.

[Bug rtl-optimization/96264] [10 Regression] wrong code with -Os -fno-forward-propagate -fschedule-insns -fno-tree-ter

2021-03-31 Thread vmakarov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96264

--- Comment #22 from Vladimir Makarov  ---
I've committed the patch to gcc-10 branch.

I also committed patch modifying the test -- see PR99233.

[Bug rtl-optimization/96264] [10 Regression] wrong code with -Os -fno-forward-propagate -fschedule-insns -fno-tree-ter

2021-03-31 Thread vmakarov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96264

--- Comment #19 from Vladimir Makarov  ---
(In reply to Richard Biener from comment #18)
> Please somebody do it quick then (not omitting necessary testing, of course).

I am working on it.  It is my highest priority work.  The patch is ready.  If
the testing is ok (arm64 machines are a bottleneck for me), I'll commit it
today.

[Bug rtl-optimization/96264] [10 Regression] wrong code with -Os -fno-forward-propagate -fschedule-insns -fno-tree-ter

2021-03-30 Thread vmakarov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96264

--- Comment #17 from Vladimir Makarov  ---
(In reply to Peter Bergner from comment #16)
> (In reply to seurer from comment #15)
> > It still fails on gcc 10, though
> 
> Vlad, can we get this backported to GCC 10?  Maybe in time for GCC 10.3?

Nobody complained about this patch since its commit.  So I believe we can
backport it and the patch should be safe for GCC 10 branch.

  1   2   3   4   5   6   7   8   9   >