[Bug rtl-optimization/42575] arm-eabi-gcc 64-bit multiply weirdness
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=42575 Wilco changed: What|Removed |Added Status|REOPENED|RESOLVED Resolution|--- |FIXED --- Comment #21 from Wilco --- Fixed by r276887 and other recent multiply improvements. The only CPU that can still trigger stack pushes is cortex-a8 with -O2/O3 both for Arm and Thumb: mov ip, r0 mul r3, ip, r3 mla r1, r2, r1, r3 push{lr} umull r0, lr, r0, r2 add r1, r1, lr ldr pc, [sp], #4 All other CPUs and optimization options generate the optimal: mulsr3, r0, r3 mla r1, r2, r1, r3 umull r0, r2, r0, r2 add r1, r1, r2 bx lr So I consider this fixed.
[Bug rtl-optimization/42575] arm-eabi-gcc 64-bit multiply weirdness
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=42575 --- Comment #20 from Wilco --- (In reply to Wilco from comment #19) > (In reply to Christophe Lyon from comment #18) > > This is still wrong with current trunk. > > I don't see it happening since expansion of DImode instructions improved. > The only case that uses an extra register is -mcpu=cortex-a9/-mcpu=cortex-a5 > with -O2 -mthumb: > > mul r3, r0, r3 > push{r4} > mov r4, r1 > umull r0, r1, r0, r2 > mla r2, r2, r4, r3 > ldr r4, [sp], #4 > add r1, r1, r2 > bx lr > > I don't think we should expect perfect register allocation in severely > constrained cases like this - scheduling can increase register pressure. Interestingly this will be fixed by https://gcc.gnu.org/ml/gcc-patches/2019-09/msg00576.html: mul r3, r0, r3 mov ip, r1 umull r0, r1, r0, r2 mla ip, r2, ip, r3 add r1, r1, ip bx lr With r12 as an extra temporary r4 no longer needs to be saved/restored.
[Bug rtl-optimization/42575] arm-eabi-gcc 64-bit multiply weirdness
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=42575 Wilco changed: What|Removed |Added CC||wilco at gcc dot gnu.org --- Comment #19 from Wilco --- (In reply to Christophe Lyon from comment #18) > This is still wrong with current trunk. I don't see it happening since expansion of DImode instructions improved. The only case that uses an extra register is -mcpu=cortex-a9/-mcpu=cortex-a5 with -O2 -mthumb: mul r3, r0, r3 push{r4} mov r4, r1 umull r0, r1, r0, r2 mla r2, r2, r4, r3 ldr r4, [sp], #4 add r1, r1, r2 bx lr I don't think we should expect perfect register allocation in severely constrained cases like this - scheduling can increase register pressure.
[Bug rtl-optimization/42575] arm-eabi-gcc 64-bit multiply weirdness
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=42575 Christophe Lyon changed: What|Removed |Added CC||clyon at gcc dot gnu.org --- Comment #18 from Christophe Lyon --- This is still wrong with current trunk.
[Bug rtl-optimization/42575] arm-eabi-gcc 64-bit multiply weirdness
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=42575 --- Comment #17 from ktkachov at gcc dot gnu.org --- As mentioned in PR, sched1 exposes this problem.
[Bug rtl-optimization/42575] arm-eabi-gcc 64-bit multiply weirdness
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=42575 ktkachov at gcc dot gnu.org changed: What|Removed |Added CC||janis at gcc dot gnu.org --- Comment #16 from ktkachov at gcc dot gnu.org --- *** Bug 49678 has been marked as a duplicate of this bug. ***
[Bug rtl-optimization/42575] arm-eabi-gcc 64-bit multiply weirdness
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=42575 ktkachov at gcc dot gnu.org changed: What|Removed |Added Version|4.2.1 |5.0 --- Comment #15 from ktkachov at gcc dot gnu.org --- Updating version as this still affects trunk
[Bug rtl-optimization/42575] arm-eabi-gcc 64-bit multiply weirdness
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=42575 ktkachov at gcc dot gnu.org changed: What|Removed |Added CC||vmakarov at redhat dot com --- Comment #14 from ktkachov at gcc dot gnu.org --- Vlad, do you have any insight on this? The difference in scheduling is only the order between a mult and an add but the register allocation looks like the underlying cause.
[Bug rtl-optimization/42575] arm-eabi-gcc 64-bit multiply weirdness
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=42575 ktkachov at gcc dot gnu.org changed: What|Removed |Added Status|RESOLVED|REOPENED Resolution|FIXED |--- --- Comment #13 from ktkachov at gcc dot gnu.org --- So I see this regression still, but only for some -mcpu options. For example for -mcpu=cortex-a15 we get: mul r3, r0, r3 strdr4, [sp, #-8]! umull r4, r5, r0, r2 mla r1, r2, r1, r3 mov r0, r4 add r5, r1, r5 mov r1, r5 ldrdr4, [sp] add sp, sp, #8 whereas for cortex-a7 we get: mul r3, r0, r3 mla r3, r2, r1, r3 umull r0, r1, r0, r2 add r1, r3, r1 I think the problem here is reload. If I look at the the dump of postreload, for the 'bad' RTL I see: r0(SI) := r0(SI) r3(SI) := r0(SI) * r3(SI) r4(DI) := r0(SI) * r2(SI) //with sign extension r1(SI) := r2(SI) * r1(SI) + r3(SI) r5(SI) := r1(SI) + r5(SI) r0(DI) := r4(DI) whereas for the good one I see: r0(SI) := r0(SI) r3(SI) := r0(SI) * r3(SI) r3(SI) := r2(SI) * r1(SI) + r3(SI) r0(DI) := r0(SI) * r2(SI) //with sign extension r1(SI) := r3(SI) + r1(SI) r0(DI) := r0(DI) In the good one the final insn is eliminated due to being dead, whereas the in the bad one the final DImode move is split into two moves. Sched1 changed the order of the mult and mult-accumulate but it's the register allocator that causes the bad codegen
[Bug rtl-optimization/42575] arm-eabi-gcc 64-bit multiply weirdness
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=42575 Bernd Edlinger bernd.edlinger at hotmail dot de changed: What|Removed |Added CC||bernd.edlinger at hotmail dot de --- Comment #11 from Bernd Edlinger bernd.edlinger at hotmail dot de --- The test case fails on current trunk: longfunc: @ args = 0, pretend = 0, frame = 0 @ frame_needed = 0, uses_anonymous_args = 0 @ link register save eliminated. mulr3, r0, r3 push{r4, r5} umullr4, r5, r0, r2 mlar1, r2, r1, r3 movr0, r4 addr5, r5, r1 movr1, r5 pop{r4, r5} bxlr .sizelongfunc, .-longfunc .identGCC: (GNU) 4.9.0 20140209 (experimental)
[Bug rtl-optimization/42575] arm-eabi-gcc 64-bit multiply weirdness
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=42575 --- Comment #12 from Bernd Edlinger bernd.edlinger at hotmail dot de --- $ gcc -v Using built-in specs. COLLECT_GCC=gcc COLLECT_LTO_WRAPPER=/home/ed/gnu/arm-linux-gnueabihf/libexec/gcc/armv7l-unknown-linux-gnueabihf/4.9.0/lto-wrapper Target: armv7l-unknown-linux-gnueabihf Configured with: ../gcc-4.9-20140209/configure --prefix=/home/ed/gnu/arm-linux-gnueabihf --enable-languages=c,c++,objc,obj-c++,fortran,ada,go --with-arch=armv7-a --with-tune=cortex-a9 --with-fpu=vfpv3-d16 --with-float=hard Thread model: posix gcc version 4.9.0 20140209 (experimental) (GCC)
[Bug rtl-optimization/42575] arm-eabi-gcc 64-bit multiply weirdness
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=42575 ktkachov at gcc dot gnu.org changed: What|Removed |Added Status|REOPENED|RESOLVED CC||ktkachov at gcc dot gnu.org Resolution|--- |FIXED --- Comment #10 from ktkachov at gcc dot gnu.org --- (In reply to jules from comment #9) This appears to have regressed on mainline. I now get the following assembly output for the test case added by Maxim: longfunc: @ args = 0, pretend = 0, frame = 0 @ frame_needed = 0, uses_anonymous_args = 0 @ link register save eliminated. stmfd sp!, {r4, r5} umull r4, r5, r0, r2 mul r3, r0, r3 mla r1, r2, r1, r3 mov r0, r4 add r1, r1, r5 ldmfd sp!, {r4, r5} bx lr Current trunk (r199375) gives, I think this can be closed. longfunc: @ args = 0, pretend = 0, frame = 0 @ frame_needed = 0, uses_anonymous_args = 0 @ link register save eliminated. mul r3, r0, r3 mla r3, r2, r1, r3 umull r0, r1, r0, r2 add r1, r3, r1 bx lr
[Bug rtl-optimization/42575] arm-eabi-gcc 64-bit multiply weirdness
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=42575 jules at gcc dot gnu.org changed: What|Removed |Added Status|RESOLVED|REOPENED CC||jules at gcc dot gnu.org Resolution|FIXED | --- Comment #9 from jules at gcc dot gnu.org 2011-09-20 19:03:43 UTC --- This appears to have regressed on mainline. I now get the following assembly output for the test case added by Maxim: longfunc: @ args = 0, pretend = 0, frame = 0 @ frame_needed = 0, uses_anonymous_args = 0 @ link register save eliminated. stmfd sp!, {r4, r5} umull r4, r5, r0, r2 mul r3, r0, r3 mla r1, r2, r1, r3 mov r0, r4 add r1, r1, r5 ldmfd sp!, {r4, r5} bx lr
[Bug rtl-optimization/42575] arm-eabi-gcc 64-bit multiply weirdness
--- Comment #7 from mkuvyrkov at gcc dot gnu dot org 2010-08-18 10:34 --- Subject: Bug 42575 Author: mkuvyrkov Date: Wed Aug 18 10:34:02 2010 New Revision: 163334 URL: http://gcc.gnu.org/viewcvs?root=gccview=revrev=163334 Log: gcc/ PR rtl-optimization/42575 * optabs.c (expand_doubleword_mult): Generate new pseudos to shorten live ranges. gcc/testsuite/ PR rtl-optimization/42575 * gcc.target/pr42575.c: New test. Added: trunk/gcc/testsuite/gcc.target/arm/pr42575.c Modified: trunk/gcc/ChangeLog trunk/gcc/optabs.c trunk/gcc/testsuite/ChangeLog -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=42575
[Bug rtl-optimization/42575] arm-eabi-gcc 64-bit multiply weirdness
--- Comment #8 from mkuvyrkov at gcc dot gnu dot org 2010-08-18 10:43 --- Bernd did all the heavy lifting for this patch. The above patch fixes the last piece of the problem -- extra move when compiling for ARMv7-A. -- mkuvyrkov at gcc dot gnu dot org changed: What|Removed |Added Status|NEW |RESOLVED Resolution||FIXED http://gcc.gnu.org/bugzilla/show_bug.cgi?id=42575
[Bug rtl-optimization/42575] arm-eabi-gcc 64-bit multiply weirdness
--- Comment #6 from bernds at gcc dot gnu dot org 2010-07-29 12:40 --- Subject: Bug 42575 Author: bernds Date: Thu Jul 29 12:39:57 2010 New Revision: 162678 URL: http://gcc.gnu.org/viewcvs?root=gccview=revrev=162678 Log: PR rtl-optimization/42575 * dce.c (word_dce_process_block): Renamed from byte_dce_process_block. Argument AU removed. All callers changed. Ignore artificial refs. Use return value of df_word_lr_simulate_defs to decide whether an insn is necessary. (fast_dce): Rename arg to WORD_LEVEL. (run_word_dce): Renamed from rest_of_handle_fast_byte_dce. No longer static. (pass_fast_rtl_byte_dce): Delete. * dce.h (run_word_dce): Declare. * df-core.c (df_print_word_regset): Renamed from df_print_byteregset. All callers changed. Simplify code to only deal with two-word regs. * df.h (DF_WORD_LR): Renamed from DF_BYTE_LR. (DF_WORD_LR_BB_INFO): Renamed from DF_BYTE_LR_BB_INFO. (DF_WORD_LR_IN): Renamed from DF_BYTE_LR_IN. (DF_WORD_LR_OUT): Renamed from DF_BYTE_LR_OUT. (struct df_word_lr_bb_info): Renamed from df_byte_lr_bb_info. (df_word_lr_mark_ref): Declare. (df_word_lr_add_problem, df_word_lr_mark_ref, df_word_lr_simulate_defs, df_word_lr_simulate_uses): Declare or rename from byte variants. (df_byte_lr_simulate_artificial_refs_at_top, df_byte_lr_simulate_artificial_refs_at_end, df_byte_lr_get_regno_start, df_byte_lr_get_regno_len, df_compute_accessed_bytes): Delete declarations. (df_word_lr_get_bb_info): Rename from df_byte_lr_get_bb_info. (enum df_mm): Delete. * df-byte-scan.c: Delete file. * df-problems.c (df_word_lr_problem_data): Renamed from df_byte_lr_problem_data, all members deleted except for WORD_LR_BITMAPS, which is renamed from BYTE_LR_BITMAPS. Uses changed. (df_word_lr_expand_bitmap, df_byte_lr_simulate_artificial_refs_at_top, df_byte_lr_simulate_artificial_refs_at_end, df_byte_lr_get_regno_start, df_byte_lr_get_regno_len, df_byte_lr_check_regs, df_byte_lr_confluence_0): Delete functions. (df_word_lr_free_bb_info): Renamed from df_byte_lr_free_bb_info; all callers changed. (df_word_lr_alloc): Renamed from df_byte_lr_alloc; all callers changed. Don't initialize members that were deleted, don't try to discover data about registers. Ignore hard regs. (df_word_lr_reset): Renamed from df_byte_lr_reset; all callers changed. (df_word_lr_mark_ref): New function. (df_word_lr_bb_local_compute): Renamed from df_byte_bb_lr_local_compute; all callers changed. Use df_word_lr_mark_ref. Assert that artificial refs don't include pseudos. Ignore hard registers. (df_word_lr_local_compute): Renamed from df_byte_lr_local_compute. Assert that exit block uses don't contain pseudos. (df_word_lr_init): Renamed from df_byte_lr_init; all callers changed. (df_word_lr_confluence_n): Renamed from df_byte_lr_confluence_n; all callers changed. Ignore hard regs. (df_word_lr_transfer_function): Renamed from df_byte_lr_transfer_function; all callers changed. (df_word_lr_free): Renamed from df_byte_lr_free; all callers changed. (df_word_lr_top_dump): Renamed from df_byte_lr_top_dump; all callers changed. (df_word_lr_bottom_dump): Renamed from df_byte_lr_bottom_dump; all callers changed. (problem_WORD_LR): Renamed from problem_BYTE_LR; uses changed; confluence operator 0 set to NULL. (df_word_lr_add_problem): Renamed from df_byte_lr_add_problem; all callers changed. (df_word_lr_simulate_defs): Renamed from df_byte_lr_simulate_defs. Return bool, true if bitmap changed or insn otherwise necessary. All callers changed. Simplify using df_word_lr_mark_ref. (df_word_lr_simulate_uses): Renamed from df_byte_lr_simulate_uses; all callers changed. Simplify using df_word_lr_mark_ref. * lower-subreg.c: Include dce.h (decompose_multiword_subregs): Call run_word_dce if df available. * Makefile.in (lower-subreg.o): Adjust dependencies. (df-byte-scan.o): Delete. * timevar.def (TV_DF_WORD_LR): Renamed from TV_DF_BYTE_LR. Removed: trunk/gcc/df-byte-scan.c Modified: trunk/gcc/ChangeLog trunk/gcc/Makefile.in trunk/gcc/dce.c trunk/gcc/dce.h trunk/gcc/df-core.c trunk/gcc/df-problems.c trunk/gcc/df.h trunk/gcc/lower-subreg.c trunk/gcc/timevar.def -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=42575
[Bug rtl-optimization/42575] arm-eabi-gcc 64-bit multiply weirdness
--- Comment #5 from drow at gcc dot gnu dot org 2010-02-22 21:06 --- (In reply to comment #3) * What is the purpose of insn 12 here? It looks to me like this is dead code, since r5 is restored in insn 38 (although, not knowing ARM so well, I may be wrong). I couldn't figure this out either. Where did it come from - was it so late that we never DCE'd it, or does something bizarre claim to be dependent on the value? Note how the sched1 pass has switched the two insns around. The register allocator now decides to use two new registers here, because r0 and r3 are both live. After RA, sched2 switches insn 9 and insn 10 again, and r2 and r3 become available in insn 10 -- but this is too late. Question for the ARM maintainer now is: Why does sched1 want to swap insns 9 and 10, when sched2 wants to swap them back again? I'm guessing, but presumably we want to separate the mul from the mla because they're dependent; the umull isn't. But I don't know what would swap them back again and that's probably the crux. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=42575
[Bug rtl-optimization/42575] arm-eabi-gcc 64-bit multiply weirdness
--- Comment #3 from steven at gcc dot gnu dot org 2010-02-08 10:47 --- Trunk today produces this (with -dAP hacked to print slim RTL): .file t.c .text .align 2 .global longfunc .type longfunc, %function longfunc: @ args = 0, pretend = 0, frame = 0 @ frame_needed = 0, uses_anonymous_args = 0 @ link register save eliminated. @ basic block 2 @8 ip:SI=r2:SI*r1:SI @ REG_DEAD: r1:SI mul ip, r2, r1 @ 8 *arm_mulsi3/2 [length = 4] @ 35 {[--sp:SI]=unspec[r4:SI] 2;use r5:SI;} @ REG_DEAD: r5:SI @ REG_DEAD: r4:SI @ REG_FRAME_RELATED_EXPR: sequence stmfd sp!, {r4, r5} @ 35*push_multi [length = 4] @9 r1:SI=r0:SI*r3:SI+ip:SI @ REG_DEAD: ip:SI @ REG_DEAD: r3:SI @ REG_DEAD: r0:SI mla r1, r0, r3, ip @ 9 *mulsi3addsi/2 [length = 4] @ 10 r4:DI=zero_extend(r2:SI)*zero_extend(r0:SI) @ REG_DEAD: r2:SI umull r4, r5, r2, r0 @ 10*umulsidi3_nov6 [length = 4] @ 11 r1:SI=r1:SI+r5:SI @ REG_DEAD: r5:SI add r1, r1, r5 @ 11*arm_addsi3/1 [length = 4] @ 12 r5:SI=r1:SI mov r5, r1 @ 12*arm_movsi_insn/1 [length = 4] @ 31 r0:SI=r4:SI mov r0, r4 @ 31*arm_movsi_insn/1 [length = 4] @ 38 unspec/v{return;} ldmfd sp!, {r4, r5} bx lr .size longfunc, .-longfunc .ident GCC: (GNU) 4.5.0 20100208 (experimental) [trunk revision 156595] Questions for those who know ARM: * What is the purpose of insn 12 here? It looks to me like this is dead code, since r5 is restored in insn 38 (although, not knowing ARM so well, I may be wrong). * After combine we have these two insns: 9 r138:SI=r142:SI*r3:SI+r139:SI REG_DEAD: r3:SI REG_DEAD: r139:SI 10 r137:DI=zero_extend(r144:SI)*zero_extend(r142:SI) REG_DEAD: r144:SI REG_DEAD: r142:SI which translate to the mla insn and to the umull insn that uses r4 and r5: @ 10 r4:DI=zero_extend(r2:SI)*zero_extend(r0:SI) @ REG_DEAD: r2:SI umull r4, r5, r2, r0 @ 10*umulsidi3_nov6 [length = 4] @9 r1:SI=r0:SI*r3:SI+ip:SI @ REG_DEAD: ip:SI @ REG_DEAD: r3:SI @ REG_DEAD: r0:SI mla r1, r0, r3, ip @ 9 *mulsi3addsi/2 [length = 4] Note how the sched1 pass has switched the two insns around. The register allocator now decides to use two new registers here, because r0 and r3 are both live. After RA, sched2 switches insn 9 and insn 10 again, and r2 and r3 become available in insn 10 -- but this is too late. Question for the ARM maintainer now is: Why does sched1 want to swap insns 9 and 10, when sched2 wants to swap them back again? (Note, btw, how wrong the REG_DEAD notes are: r0 dies in insn 9 and is used in insn 10, because the sched2 pass fails to update the notes when it moves insn 9 before insn 10. But that's a separate issue...) * If I compile with -fno-schedule-insns, I still don't get the optimal code: mul ip, r2, r1 str r4, [sp, #-4]! mla r1, r0, r3, ip umull r3, r4, r2, r0 add r1, r1, r4 mov r4, r1 mov r0, r3 ldmfd sp!, {r4} bx lr This time the compiler choses to use r3:DI in the umull, instead of r2:DI (that is r2 and r3). I am guessing ths may be a target REG_ALLOC_ORDER issue, where r3 comes before r2. That's another thing for a target maintainer to look into. If IRA would select r2:DI, you would also lose the save/restore of r4 and get the perfect code of comment #2. So two issues: 1. Why does the sched1 pass schedule insn 10 before insn 9? 2. With -fno-schedule-insns, why does IRA prefer (r3,r4) over (r2,r3)? -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=42575
[Bug rtl-optimization/42575] arm-eabi-gcc 64-bit multiply weirdness
--- Comment #4 from steven at gcc dot gnu dot org 2010-02-08 10:51 --- Add an ARM guy to the CC: -- steven at gcc dot gnu dot org changed: What|Removed |Added CC||ramana at gcc dot gnu dot ||org http://gcc.gnu.org/bugzilla/show_bug.cgi?id=42575
[Bug rtl-optimization/42575] arm-eabi-gcc 64-bit multiply weirdness
--- Comment #2 from ramana at gcc dot gnu dot org 2010-01-04 10:54 --- Confirmed with trunk I get longfunc: @ args = 0, pretend = 0, frame = 0 @ frame_needed = 0, uses_anonymous_args = 0 @ link register save eliminated. mul r1, r2, r1 mla r1, r0, r3, r1 stmfd sp!, {r4, r5} umull r4, r5, r2, r0 add r1, r1, r5 mov r0, r4 mov r5, r1 ldmfd sp!, {r4, r5} bx lr r4 and r5 need not be used here - you could do with just r2 and r3 instead of r4 and r5 here i.e. mul r1, r2, r1 mla r1, r0, r3, r1 umull r2, r3, r2, r0 add r1, r1, r3 mov r0, r2 bx lr -- ramana at gcc dot gnu dot org changed: What|Removed |Added Status|UNCONFIRMED |NEW Component|target |rtl-optimization Ever Confirmed|0 |1 Keywords||missed-optimization, ra Last reconfirmed|-00-00 00:00:00 |2010-01-04 10:54:28 date|| Summary|arm-eabi-gcc 4.2.1 64-bit |arm-eabi-gcc 64-bit multiply |multiply weirdness |weirdness http://gcc.gnu.org/bugzilla/show_bug.cgi?id=42575