[Bug rtl-optimization/81025] [8 Regression] gcc ICE while building glibc for MIPS soft-float multi-lib variant
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81025 --- Comment #9 from Doug Gilmore --- > I bet this is a bug in reorg.c. It is the least used code (major > target usage: MIPS and sparc only) and also one of the more buggy > code. You're right, compiling with -fno-delayed-branch doesn't tickle the bug. Thanks!
[Bug tree-optimization/81025] [8 Regression] gcc ICE while building glibc for MIPS soft-float multi-lib variant
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81025 Doug Gilmore changed: What|Removed |Added Summary|[MIPS] soft-float glibc |[8 Regression] gcc ICE |build fails at r248863 |while building glibc for ||MIPS soft-float multi-lib ||variant --- Comment #6 from Doug Gilmore --- We are back to having our MIPS nightly ToT toolchain builds all working with r247049 reverted. Given that r247049 exposes another PRE issue, see bug 80620, does it make sense to back out until we resolve the problems at hand?
[Bug tree-optimization/81025] [MIPS] soft-float glibc build fails at r248863
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81025 Doug Gilmore changed: What|Removed |Added CC||rguenth at gcc dot gnu.org --- Comment #4 from Doug Gilmore --- Created attachment 41513 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=41513=edit cut down example via delta Sorry attachment for the last comment was dropped. I bisected the failure to r247049 using the cut down example, compiled via: $dir/xgcc -B$dir -O2 -msoft-float -mabi=32 delta_1.i -c -std=gnu11 -fgnu89-inline -O2 -fmerge-all-constants -fno-stack-protector -frounding-math -g For this bisect I configured with --disable-multilib. I'll look into this more tomorrow.
[Bug tree-optimization/81025] [MIPS] soft-float glibc build fails at r248863
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81025 --- Comment #3 from Doug Gilmore --- It appears that r248863 just tickles the bug. With the attached example produced by delta the failure mode is exposed by r248862.With luck, I may be able to bisect the problem to an earlier commit.
[Bug tree-optimization/81025] New: [MIPS] soft-float glibc build fails at r248863
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81025 Bug ID: 81025 Summary: [MIPS] soft-float glibc build fails at r248863 Product: gcc Version: 8.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: doug.gilmore at imgtec dot com Target Milestone: --- Created attachment 41509 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=41509=edit CPP output file Our ToT GLIBC soft-float builds are failing ToT, I bisected the problem to r248863. To reproduce the problem with minimum effort, configure and build via: /configure --prefix=.../install-mips-mti-linux-gnu --disable-libssp --disable-libmudflap --disable-decimal-float --with-mips-plt --target=mips-mti-linux-gnu --enable-languages=c --without-headers --disable-shared --disable-threads --disable-libquadmath --disable-libatomic --with-sysroot=.../install-mips-mti-linux-gnu/sysroot make maybe-all-gcc I attached two patches: One to restrict the number of multi-lib variants, which probably isn't needed for maybe-all-gcc, but will speed full gcc build. The other patch is a cherry pick of r248879 which is needed to build r248863 for MIPS. Build CPP file: /gcc/xgcc -B/gcc -O2 -msoft-float -mabi=32 s_fmaf.i -c -std=gnu11 -fgnu89-inline -O2 -Wall -Werror -Wundef -Wwrite-strings -fmerge-all-constants -fno-stack-protector -frounding-math -g -Wstrict-prototypes -Wold-style-definition The CPP file compiles cleanly at r248862, but at r248863 with patch for r248879 applied, the compile fails with: during RTL pass: dwarf2 In file included from ../sysdeps/mips/ieee754/s_fmaf.c:4:0: ../soft-fp/fmasf4.c: In function '__fmaf': ../soft-fp/fmasf4.c:62:1: internal compiler error: in maybe_record_trace_start, at dwarf2cfi.c:2330 0x74ab9f maybe_record_trace_start /scratch/dgilmore/sgcc-pp5/src/gcc/gcc/dwarf2cfi.c:2330 0x74af2f create_trace_edges /scratch/dgilmore/sgcc-pp5/src/gcc/gcc/dwarf2cfi.c:2426 0x74b0af scan_trace /scratch/dgilmore/sgcc-pp5/src/gcc/gcc/dwarf2cfi.c:2640 0x74bd16 create_cfi_notes /scratch/dgilmore/sgcc-pp5/src/gcc/gcc/dwarf2cfi.c:2666 0x74bd16 execute_dwarf2_frame /scratch/dgilmore/sgcc-pp5/src/gcc/gcc/dwarf2cfi.c:3024 0x74bd16 execute /scratch/dgilmore/sgcc-pp5/src/gcc/gcc/dwarf2cfi.c:3504 Please submit a full bug report, with preprocessed source if appropriate. Please include the complete backtrace with any bug report. See <https://gcc.gnu.org/bugs/> for instructions. Applying -fdump-rtl-dwarf2 to the compilation line the associated dump file contains: Inconsistent CFI state! SHOULD have: .cfi_def_cfa 29, 0 DO have: .cfi_def_cfa 29, 8 .cfi_offset 16, -4 The CPP file is quite complicated, I am investigating whether a cut down example will reproduce the failure.
[Bug tree-optimization/81025] [MIPS] soft-float glibc build fails at r248863
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81025 --- Comment #2 from Doug Gilmore --- Created attachment 41511 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=41511=edit patch needed to build r248863 for MIPS
[Bug tree-optimization/81025] [MIPS] soft-float glibc build fails at r248863
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81025 --- Comment #1 from Doug Gilmore --- Created attachment 41510 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=41510=edit Patch to constrain the number of multi-lib variants
[Bug tree-optimization/79955] New: GLIBC build fails after r245840
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79955 Bug ID: 79955 Summary: GLIBC build fails after r245840 Product: gcc Version: 7.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: doug.gilmore at imgtec dot com Target Milestone: --- Created attachment 40920 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=40920=edit CPP output file for mips-mti-linux-gnu target See: https://sourceware.org/ml/libc-alpha/2017-03/msg00052.html We are working around the issue by disabling -Werror in the build. I'll upload an X86_64 .i file tomorrow. $ mips-mti-linux-gnu-gcc -mabi=32 fnmatch.i -c -std=gnu11 -fgnu89-inline -O2 -Wall -Wundef -Wwrite-strings -fmerge-all-constants -fno-stack-protector -frounding-math \ -g -Wstrict-prototypes -Wold-style-definition -ftls-model=initial-exec In file included from fnmatch.c:250:0: fnmatch_loop.c: In function 'internal_fnwmatch': ../locale/weightwc.h:103:28: warning: '*((void *)+4)' may be used uninitialized in this function [-Wmaybe-uninitialized]
[Bug tree-optimization/79291] r244897 introduces IV related performance issues for daxpy on MIPS by enabling peeling for alignment
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79291 --- Comment #6 from Doug Gilmore --- > It also looks like mips lacks implementation of any of the > vectorizer cost hooks and thus defaults to > default_builtin_vectorization_cost which means that unaligned > loads/stores have double cost. Removing the double cost for unaligned memory OPs didn't have any effect, pealing still occurred and the alias problem is exposed on MIPS. So it looks like we need to come up with fix for bug 69710, that hopefully also fixes bug68030, to address is issue.
[Bug tree-optimization/79291] r244897 introduces IV related performance issues for daxpy on MIPS by enabling peeling for alignment
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79291 --- Comment #5 from Doug Gilmore --- > Bin: I suspect this is also now broken on ARM, can > you check? Oops, sorry I forgot that this problem is not exposed on the original ARM/Neon for DP. Sorry for the noise.
[Bug tree-optimization/79291] r244897 introduces IV related performance issues for daxpy on MIPS by enabling peeling for alignment
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79291 --- Comment #4 from Doug Gilmore --- > It also looks like mips lacks implementation of any of the > vectorizer cost hooks and thus defaults to > default_builtin_vectorization_cost which means that unaligned > loads/stores have double cost. I have investigated that in the past and that costing is needed in some cases. I'll start looking into that again.
[Bug target/78176] [MIPS] miscompiles ldxc1 with large pointers on 32-bits
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78176 --- Comment #20 from Doug Gilmore --- I'll collect more tracing data on the costing problem. Hopefully I post an update in the next few days.
[Bug target/78176] [MIPS] miscompiles ldxc1 with large pointers on 32-bits
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78176 Doug Gilmore changed: What|Removed |Added CC||law at redhat dot com, ||rguenth at gcc dot gnu.org, ||zqchen at gcc dot gnu.org --- Comment #18 from Doug Gilmore --- CC author and reviewers of r216501.
[Bug target/78176] [MIPS] miscompiles ldxc1 with large pointers on 32-bits
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78176 --- Comment #17 from Doug Gilmore --- > This really throws off the costing of substituting different IVs on > MIPS. I forgot to mention that for MIPS the net of effect r216501 is to not produce indexed memory OPs in simple examples where we should. But we also will produce problematic indexed memory OPs in situations where address generation costing is a bit complicated (the original issue associated with this bug report). Applying the the two patches I just attached fixes the problem of generating indexed memory OPs in simple examples, and also will cause IVOPTS to select IVs that are similar to those that were made in the past that avoids the problem executing indexed memory OPs in O32 binaries on 64-bit MIPS processors running current Linux kernels. There is still the issue of recognizing that rewriting a "use" to use a different IV can expose a problem with indexed memory OPs on 64-bit MIPS processors, where an infinite cost should be associated in that situation, that still needs to be addressed (thus the need for the flag to turn off the generation of indexed memory OPs until this issue is addressed).
[Bug target/78176] [MIPS] miscompiles ldxc1 with large pointers on 32-bits
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78176 --- Comment #16 from Doug Gilmore --- Created attachment 40632 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=40632=edit Tweak to adjust_setup_cost (r220473). Second patch associated with previous comment.
[Bug target/78176] [MIPS] miscompiles ldxc1 with large pointers on 32-bits
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78176 --- Comment #15 from Doug Gilmore --- Created attachment 40631 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=40631=edit Prototype change to backout r216501. > Bisected the problem to commit r216501: The review discussion of r216501 starts with message: https://gcc.gnu.org/ml/gcc-patches/2014-10/msg00758.html Which contains: The are two implementations of seq_cost. The function bodies are exactly the same. The patch removes one of them and make the other global. This seems the patch was cleanup that shouldn't introduce a functional change. However implementations of seq_cost are different, per final version of the patch: https://gcc.gnu.org/ml/gcc-patches/2014-10/msg00896.html cfgloopanal.c: - cost += set_rtx_cost (set, speed); rtlanal.c: +cost += set_rtx_cost (set, speed); tree-ssa-loop-ivopts.c: - cost += set_src_cost (SET_SRC (set), speed); In general, when computing the cost of a sequence of N INSNs this increases the cost of the sequence by N*4. This really throws off the costing of substituting different IVs on MIPS. The first patch attached (just a prototype) basically reverts this change. The second fixes a problem with r220473, a fix for PR62631 from Eric Botcazou: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62631#c17 This looks a generic problem in get_shiftadd_cost to me, it ought to mimic the algorithms in expmed.c, something like ... This change can lower the cost of a sequence of instruction. However there are situations this (lower) cost is being scaled by an estimated iteration count will cause the adjusted cost to now become zero. For the example attached to the second patch the IV replacement algorithm will determine that the cost using separate IVs for each load will be less than then cost of one IV for all loads. Thus, in the second patch we detect that a non-zero cost being scaled to zero should represented by one instead, which gets us back to IVSOPTS generating just one IV that will be used for all loads.
[Bug tree-optimization/79291] New: r244397 introduces alias related performance issues for daxpy on MIPS
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79291 Bug ID: 79291 Summary: r244397 introduces alias related performance issues for daxpy on MIPS Product: gcc Version: 7.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: doug.gilmore at imgtec dot com Target Milestone: --- It appears that r244397 introduces pealing for DP daxpy, which per bug 69710, introduces a performance degradation due to alias issues. After IVOPTS before r244897 (use daxpy example from bug 69710): ivtmp.20_36 = ivtmp.20_35 + 1; ivtmp.21_24 = ivtmp.21_9 + 16; ivtmp.24_3 = ivtmp.24_2 + 16; After IVOPTS after r244897: ivtmp.23_56 = ivtmp.23_24 + 1; ivtmp.24_11 = ivtmp.24_9 + 16; ivtmp.27_87 = ivtmp.27_86 + 16; ivtmp.29_90 = ivtmp.29_89 + 16; Thus after r244397 we have a problem in DP daxpy that we were only seeing for SP daxpy (or saxpy) as shown in bug69710. BTW: I have been investigating another IVOPTS related regression on MIPS32R2 that is related to the generation of indexed memory OPs: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78176#c12 I'll be updating that report with more information on how to fix the regression and how it relates to this issue. Bin: I suspect this is also now broken on ARM, can you check? Thanks, Doug
[Bug target/78176] [MIPS] miscompiles ldxc1 with large pointers on 32-bits
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78176 Doug Gilmore changed: What|Removed |Added CC||doug.gilmore at imgtec dot com --- Comment #12 from Doug Gilmore --- Bisected the problem to commit r216501: commit 9a416363e99c9f2d48fa810e220bc2f7904f1788 Author: zqchen <zqchen@138bc75d-0d04-0410-961f-82ee72b054a4> Date: Tue Oct 21 03:38:37 2014 + 2014-10-21 Zhenqiang Chen <zhenqiang.c...@arm.com> * cfgloopanal.c (seq_cost): Delete. * rtl.h (seq_cost): New prototype. * rtlanal.c (seq_cost): New function. * tree-ssa-loop-ivopts.c (seq_cost): Delete. git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/trunk@216501 138bc75d-0d04-0410-961f-82ee72b054a4 More analysis to follow. Given the short time until the release, we plan submit a patch to provide a target flag and build option to avoid the bug.
[Bug tree-optimization/77808] [7 Regression] ICE in duplicate_ssa_name_ptr_info, at tree-ssanames.c:630 starting with r240439
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77808 Doug Gilmore changed: What|Removed |Added CC||clyon at gcc dot gnu.org --- Comment #2 from Doug Gilmore --- Christophe: Can we close this bug?
[Bug testsuite/72850] [7 Regression] FAIL: gcc.dg/tree-ssa/pr69270-3.c scan-tree-dump-times uncprop1 ", 1" 4
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=72850 Doug Gilmore changed: What|Removed |Added CC||doug.gilmore at imgtec dot com --- Comment #5 from Doug Gilmore --- Thanks Uri for the test case! In case it wasn't clear, the switch statement should be removed at higher levels of optimization: $ for i in 0 1 2 3 ; do ( set -x ; mips-mti-linux-gnu-gcc test.c -c -O$i -fdump-tree-optimized ; egrep ";; Function|switch" test.c.169t.optimized ) done + mips-mti-linux-gnu-gcc test.c -c -O0 -fdump-tree-optimized + egrep ';; Function|switch' test.c.169t.optimized ;; Function is_digit (is_digit, funcdef_no=0, decl_uid=1406, symbol_order=0) ;; Function FMS (FMS, funcdef_no=1, decl_uid=1410, symbol_order=1) switch (state_11) , case 0: , case 2: , case 3: , case 4: , case 5: , case 6: , case 7: > + mips-mti-linux-gnu-gcc test.c -c -O1 -fdump-tree-optimized + egrep ';; Function|switch' test.c.169t.optimized ;; Function FMS (FMS, funcdef_no=1, decl_uid=1410, symbol_order=1) switch (state_98) , case 0: , case 2: , case 3: , case 4: , case 5: , case 6: , case 7: > + mips-mti-linux-gnu-gcc test.c -c -O2 -fdump-tree-optimized + egrep ';; Function|switch' test.c.169t.optimized ;; Function FMS (FMS, funcdef_no=1, decl_uid=1410, symbol_order=1) + mips-mti-linux-gnu-gcc test.c -c -O3 -fdump-tree-optimized + egrep ';; Function|switch' test.c.169t.optimized ;; Function FMS (FMS, funcdef_no=1, decl_uid=1410, symbol_order=1)
[Bug tree-optimization/77808] New: [7 Regression] ICE in duplicate_ssa_name_ptr_info, at tree-ssanames.c:630 starting with r240439
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77808 Bug ID: 77808 Summary: [7 Regression] ICE in duplicate_ssa_name_ptr_info, at tree-ssanames.c:630 starting with r240439 Product: gcc Version: 7.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: doug.gilmore at imgtec dot com Target Milestone: --- Reported in: https://gcc.gnu.org/ml/gcc-patches/2016-09/msg02285.html This issue was not found during regression testing for commit r240439 since -fprefetch-loop-arrays needs to be set by default. Will send a fix and test case to gcc-patches.
[Bug tree-optimization/77654] restrict pointer attribute not preserved with -fprefetch-loop-arrays
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77654 --- Comment #2 from Doug Gilmore --- Created attachment 39652 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=39652=edit Prototype fix for bug.
[Bug tree-optimization/77654] restrict pointer attribute not preserved with -fprefetch-loop-arrays
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77654 --- Comment #1 from Doug Gilmore --- Created attachment 39651 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=39651=edit Additional tracing used to identify problem.
[Bug tree-optimization/77654] New: restrict pointer attribute not preserved with -fprefetch-loop-arrays
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77654 Bug ID: 77654 Summary: restrict pointer attribute not preserved with -fprefetch-loop-arrays Product: gcc Version: 7.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: doug.gilmore at imgtec dot com Target Milestone: --- Compiling the test example: void daxpy(int n, double da, double * __restrict dx, double * __restrict dy) { int i; for (i = 0;i < n; i++) { dy[i] = dy[i] + da*dx[i]; } } via: mips-img-linux-gnu-gcc -fprefetch-loop-arrays daxpy.c -c -O2 -save-temps -fsched-verbose=9 -fdump-rtl-sched2 The following code is generated for the main loop: $L4: ldc1$f2,0($5) pref6,0($2) ldc1$f8,-120($2) addiu $2,$2,32 ldc1$f6,-144($2) addiu $5,$5,32 ldc1$f4,-136($2) addiu $3,$3,4 maddf.d $f8,$f2,$f0 ldc1$f2,-128($2) sdc1$f8,-152($2) ldc1$f1,-24($5) maddf.d $f6,$f0,$f1 sdc1$f6,-144($2) ldc1$f1,-16($5) maddf.d $f4,$f0,$f1 sdc1$f4,-136($2) ldc1$f1,-8($5) maddf.d $f2,$f0,$f1 bne $3,$8,$L4 sdc1$f2,-128($2) Due to the __restrict attributes on the pointer declarations, after scheduling we should see that loads through $5 should move above the stores through $2. However, during the transformation done by the phase that is enabled by -fprefetch-loop-arrays, the points-to information is lost. This prevents the loads to move above the stores during scheduling. The attached uses logic borrowed from IVS phase: 0002-Ensure-points-to-information-is-maintained-for-prefe.patch After applying the patch, the points-to information is maintained, which results in good code being generated after scheduling (which is very important when running on in-order processors): $L4: addiu $5,$5,32 ldc1$f8,-120($2) ldc1$f6,-112($2) pref6,0($2) ldc1$f4,-104($2) addiu $3,$3,4 ldc1$f2,-96($2) addiu $2,$2,32 ldc1$f7,-32($5) ldc1$f5,-24($5) ldc1$f3,-16($5) ldc1$f1,-8($5) maddf.d $f8,$f7,$f0 maddf.d $f6,$f0,$f5 maddf.d $f4,$f0,$f3 maddf.d $f2,$f0,$f1 sdc1$f8,-152($2) sdc1$f6,-144($2) sdc1$f4,-136($2) sdc1$f2,-128($2) bnec$3,$8,$L4 I am not sure what to do about a test case. One possibility is to commit some of the tracing in debugging patch: 0001-Add-more-tracing-for-missing-points-to-information.patch and we could scan for the RE "pi. is NULL", in the dump file created by -fdump-rtl-sched2.
[Bug rtl-optimization/69710] performance issue with SP Linpack with Autovectorization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69710 --- Comment #15 from Doug Gilmore --- > I had a patch too, will send it for review in GCC7 if it's still needed. Sorry I got side track last week and didn't make much progress. Please go ahead and submit if you have something you feel comfortable with, I'll assist in testing. Thanks,
[Bug rtl-optimization/69710] performance issue with SP Linpack with Autovectorization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69710 --- Comment #13 from Doug Gilmore --- I think this should be fairly straightforward to fix in the autovectorization pass. Hopefully I should be able to post a patch in the next few days.
[Bug rtl-optimization/69710] performance issue with SP Linpack with Autovectorization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69710 --- Comment #12 from Doug Gilmore --- > Yes, I proposed some cleanup passess after vectorization but richi > thinks it's genrally expensive. So what's implmentation complexity > of pass_dominator? One thing we might consider is only enable it when vectorization is run on architectures where cleanup is needed. I plan to send an RFC comment for my patch to see what objections there are to that approach, though beforehand I'd like to investigate what could be done to the vectorizer so that it doesn't generate code that contain false dependencies.
[Bug rtl-optimization/69710] performance issue with SP Linpack with Autovectorization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69710 --- Comment #10 from Doug Gilmore --- Created attachment 37681 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=37681=edit prototype fix > 1) we failed recognize that use 0 and 2 are identical to each other. > This is because vectorizer generates redundant setup code in loop > pre-header. There are two possible fixes here. One is to make > expand_simple_operations more aggressive in expanding (used by > ivopts) in tree-ssa-loop-niter.c. But I don't think this is a good > idea in all cases, because expanded complicated expression makes ivo > transform and niter analysis harder. Or something along the lines of the attached patch, tested only on the on the problem at hand. As it stands it is probably to heavy handed to consider as a possible review candidate. > The other is to fix vectorizer > to generate clean code. Richard's suggestion is to use gimple_build > for that. ISTM to be the reasonable approach but I haven't yet investigated what's involved. > Also the problem exists only for arm because it doesn't support > [base+index] addressing mode for vect load/store. I guess mips > doesn't either. > Right MIPS MSA doesn't support [base+index] mode. BTW, the reason why IVOPTS works for DP but not SP on MIPS MSA is that the code in the pre-header is simpler for DP: : vect_cst__52 = {da_6(D), da_6(D)}; : # vectp_dy.8_46 = PHI# vectp_dx.11_49 = PHI # vectp_dy.16_55 = PHI # ivtmp_58 = PHI <0(6), ivtmp_59(12)> ... which IVOPS can handle.
[Bug tree-optimization/69710] performance issue with SP Linpack with Autovectorization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69710 --- Comment #1 from Doug Gilmore --- Created attachment 37615 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=37615=edit daxpy for DP (previous was for SP) Compilation example: arm-linux-gnueabihf-gcc -O3 -save-temps daxpy.c saxpy.c -c -mfpu=neon -c -fdump-tree-{vect,ivopts}-{verbose,details} -fdump-tree-{slp1,optimized} -fsched-verbose=9 \ -fdump-rtl-sched{1,2} -marm -funsafe-math-optimizations -funroll-all-loops Note that Neon does not support DP, thus daxpy.s won't contain autovectorized code. I haven't built a ToT compiler for aarch64-linux-gnu, but I suspect that you will see autovectorized code in daxpy.s in which reasonable schedules are being produced (loads are being moved above stores).
[Bug rtl-optimization/69710] performance issue with SP Linpack with Autovectorization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69710 --- Comment #5 from Doug Gilmore --- Thanks for checking on AArch64 Andrew. BTW, I made my (incorrect) hunch by running a test on gcc113, where the installed 4.8 compile showed problems for both DP and SP. (I assumed that the problem was addressed on DP since we don't see it on MIPS at DP ToT with the MSA patch applied.) For Neon after ivopts I see: : # vectp_dy.20_96 = PHI# ivtmp.22_78 = PHI <0(13), ivtmp.22_77(21)> # ivtmp.26_112 = PHI # ivtmp.31_153 = PHI vectp_dx.15_88 = (vector(4) float *) ivtmp.26_112; _156 = (void *) ivtmp.31_153; vect__12.14_85 = MEM[base: _156, offset: 0B]; ivtmp.31_154 = ivtmp.31_153 + 16; vect__15.17_90 = MEM[(float *)vectp_dx.15_88]; vect__16.18_92 = vect_cst__91 * vect__15.17_90; vect__17.19_93 = vect__12.14_85 + vect__16.18_92; MEM[base: vectp_dy.20_96, offset: 0B] = vect__17.19_93; vectp_dy.20_97 = vectp_dy.20_96 + 16; ivtmp.22_77 = ivtmp.22_78 + 1; ivtmp.26_111 = ivtmp.26_112 + 16; if (ivtmp.22_77 < bnd.9_53) goto ; else goto ; ... : goto ; So the problem is indeed exposed on Neon.
[Bug tree-optimization/69710] New: performance issue with SP Linpack with Autovectorization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69710 Bug ID: 69710 Summary: performance issue with SP Linpack with Autovectorization Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: doug.gilmore at imgtec dot com Target Milestone: --- Created attachment 37614 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=37614=edit extracted daxpy example We've noticed a performance problem in single precision Linpack with the MSA patch applied: https://gcc.gnu.org/ml/gcc-patches/2016-01/msg00177.html which I have been able to reproduce with ARM Neon. The problem that the autovectorization is generating more induction variables for memory references in daxpy (this is an issue on all architectures). That is, when the statement: dy[i] = dy[i] + da*dx[i]; is vectorized the vector load associated with load of dy[i] uses a different Induction Variable (IV) for the subsequent vector store for dy[i]. For example, for ARM neon after vect we see: : # i_26 = PHI <i_44(11), i_19(20)> # vectp_dy.12_83 = PHI <vectp_dy.13_81(11), vectp_dy.12_84(20)> # vectp_dx.15_88 = PHI <vectp_dx.16_86(11), vectp_dx.15_89(20)> # vectp_dy.20_96 = PHI <vectp_dy.21_94(11), vectp_dy.20_97(20)> # ivtmp_99 = PHI <0(11), ivtmp_100(20)> i.0_7 = (unsigned int) i_26; _8 = i.0_7 * 4; _10 = dy_9(D) + _8; vect__12.14_85 = MEM[(float *)vectp_dy.12_83]; _12 = *_10; _14 = dx_13(D) + _8; vect__15.17_90 = MEM[(float *)vectp_dx.15_88]; _15 = *_14; vect__16.18_92 = vect_cst__91 * vect__15.17_90; _16 = da_6(D) * _15; vect__17.19_93 = vect__12.14_85 + vect__16.18_92; _17 = _12 + _16; MEM[(float *)vectp_dy.20_96] = vect__17.19_93; i_19 = i_26 + 1; vectp_dy.12_84 = vectp_dy.12_83 + 16; vectp_dx.15_89 = vectp_dx.15_88 + 16; vectp_dy.20_97 = vectp_dy.20_96 + 16; ivtmp_100 = ivtmp_99 + 1; if (ivtmp_100 < bnd.9_53) goto ; else goto ; ... : goto ; Note that the use of a separate IV for the load and store off of dy can introduces a false memory dependency which causes poor scheduling after unrolling. From what I have seen so far, for double precision the ivopts phase is able to clean up the induction variables so the false memory dependency is removed. However the cleanup does not happen for single precision. Attached simple example for single precision, more to follow.
[Bug target/66747] [6 Regression] The commit r225260 broke the builds of the mips-{mti,img}-linux-gnu tool chains.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66747 --- Comment #9 from Doug Gilmore doug.gilmore at imgtec dot com --- Our nightly builds are now clean with this patch. Thanks!
[Bug middle-end/66747] [6 Regression] The commit r225260 broke the builds of the mips-{mti,img}-linux-gnu tool chains.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66747 Doug Gilmore doug.gilmore at imgtec dot com changed: What|Removed |Added CC||matthew.fortune at imgtec dot com --- Comment #5 from Doug Gilmore doug.gilmore at imgtec dot com --- The build succeeded and the regression test run showed no regressions. Bernd: could you send the patch to the list for approval? Thanks!
[Bug middle-end/66747] [6 Regression] The commit r225260 broke the builds of the mips-{mti,img}-linux-gnu tool chains.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66747 --- Comment #4 from Doug Gilmore doug.gilmore at imgtec dot com --- Thanks! I started up a build with the patch and it got through the initial_gcc build so that is a good sign. I'll send an update once the build is done.
[Bug c/66747] New: The commit r225260 broke the builds of the mips-{mti,img}-linux-gnu tool chains.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66747 Bug ID: 66747 Summary: The commit r225260 broke the builds of the mips-{mti,img}-linux-gnu tool chains. Product: gcc Version: 5.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: doug.gilmore at imgtec dot com Target Milestone: --- The commit r225260 broke the builds of the mips-{mti,img}-linux-gnu tool chains. To reproduce the problem, configure the binutils build from the directory /scratch/d/obj-mips-img-linux-gnu/binutils-gdb: /scratch/d/src/binutils-gdb/configure --prefix=/scratch/d/install-mips-img-linux-gnu --target=mips-img-linux-gnu --with-sysroot=/scratch/d/install-mips-img-linux-gnu/sysroot then run make and make install Then configure the gcc build from the directory /scratch/d/obj-mips-img-linux-gnu/initial_gcc: /scratch/d/src/gcc/configure --prefix=/scratch/d/install-mips-img-linux-gnu --disable-libssp --disable-libgomp --disable-libmudflap --disable-decimal-float --with-mips-plt --target=mips-img-linux-gnu --enable-languages=c --without-headers --disable-shared --disable-threads --disable-libquadmath --disable-libatomic running make fails with: /scratch/d/obj-mips-img-linux-gnu/initial_gcc/./gcc/xgcc -B/scratch/d/obj-mips-img-linux-gnu/initial_gcc/./gcc/ -B/scratch/d/install-mips-img-linux-gnu/mips-img-linux-gnu/bin/ -B/scratch/d/install-mips-img-linux-gnu/mips-img-linux-gnu/lib/ -isystem /scratch/d/install-mips-img-linux-gnu/mips-img-linux-gnu/include -isystem /scratch/d/install-mips-img-linux-gnu/mips-img-linux-gnu/sys-include-g -O2 -minterlink-mips16 -mips64r6 -O2 -g -O2 -minterlink-mips16 -DIN_GCC -DCROSS_DIRECTORY_STRUCTURE -W -Wall -Wwrite-strings -Wcast-qual -Wstrict-prototypes -Wmissing-prototypes -Wold-style-definition -isystem ./include -I. -I. -I../../../.././gcc -I/scratch/d/src/gcc/libgcc -I/scratch/d/src/gcc/libgcc/. -I/scratch/d/src/gcc/libgcc/../gcc -I/scratch/d/src/gcc/libgcc/../include -g0 -finhibit-size-directive -fno-inline -fno-exceptions -fno-zero-initialized-in-bss -fno-toplevel-reorder -fno-tree-vectorize -fbuilding-libgcc -fno-stack-protector -Dinhibit_libc -I. -I. -I../../../.././gcc -I/scratch/d/src/gcc/libgcc -I/scratch/d/src/gcc/libgcc/. -I/scratch/d/src/gcc/libgcc/../gcc -I/scratch/d/src/gcc/libgcc/../include -o crtbeginT.o -MT crtbeginT.o -MD -MP -MF crtbeginT.dep -c /scratch/d/src/gcc/libgcc/crtstuff.c -DCRT_BEGIN -DCRTSTUFFT_O /scratch/d/src/gcc/libgcc/crtstuff.c: In function 'frame_dummy': /scratch/d/src/gcc/libgcc/crtstuff.c:490:1: error: unrecognizable insn: } ^ (insn 82 67 8 (sequence [ (jump_insn 7 67 66 (set (pc) (if_then_else (eq (reg/f:SI 2 $2 [197]) (const_int 0 [0])) (label_ref:SI 15) (pc))) /scratch/d/src/gcc/libgcc/crtstuff.c:470 466 {*branch_equalitysi} (expr_list:REG_DEAD (reg/f:SI 2 $2 [197]) (int_list:REG_BR_PROB 3017 (nil))) - 15) (insn/f 66 7 8 (set (mem/c:DI (plus:SI (reg/f:SI 29 $sp) (const_int 8 [0x8])) [5 S8 A64]) (reg:DI 31 $31)) 302 {*movdi_64bit} (expr_list:REG_FRAME_RELATED_EXPR (set/f (mem/c:DI (plus:SI (reg/f:SI 29 $sp) (const_int 8 [0x8])) [5 S8 A64]) (reg:DI 31 $31)) (nil))) ]) /scratch/d/src/gcc/libgcc/crtstuff.c:470 -1 (nil)) We are working around the issue by reverting r225260.
[Bug c++/63412] New: aliasing issue exposed by inlining
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63412 Bug ID: 63412 Summary: aliasing issue exposed by inlining Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: doug.gilmore at imgtec dot com Created attachment 33616 -- https://gcc.gnu.org/bugzilla/attachment.cgi?id=33616action=edit test program The attached test program fails with 4.7 up to ToT at -O2 on both x86 (I built x86_64 with the -m32 multi-lib variant) and MIPS. $ g++ -Wall -g -m32 -std=gnu++11 -O2 -fno-exceptions bad_i5.c -static -o la -save-temps ./la Aborted (core dumped) $ g++ -Wall -g -m32 -std=gnu++11 -O0 -fno-exceptions bad_i5.c -static -o la -save-temps ./la $ Note that simplifying one of the expressions makes the program work: $ g++ -Wall -g -DNO_VOL -m32 -std=gnu++11 -O2 -fno-exceptions bad_i5.c -static -o la -save-temps ./la $ The generated code has the store below the implicit load in the compare: cmpl%ebx, 4(%esp,%edx,4) movl%eax, 4(%esp) jne.L5 which is incorrect. It should be: movl%eax, 4(%esp) cmpl%ebx, 4(%esp,%edx,4) jne.L5 We have an internal debate on what the issue is. Some are of the opinion that casting is breaking alias rules and thus the behavior of the program is undefined. Thus something along the lines the following changes are needed. $ diff bad_i5{,_mod}.c 48c48 return reference_-AsMirrorPtr(); --- return static_castT*(reference_-AsMirrorPtr()); 50c50 ObjectReferenceT* reference_; --- ObjectReferenceObject* reference_; 52,53c52,53 : reference_(reinterpret_castObjectReferenceT*(reference)) { } --- : reference_((reference)) { } $ g++ -g -m32 -std=gnu++11 -O2 -fno-exceptions bad_i5_mod.c -static -o la -save-temps ./la $ If there is a strict aliasing issue, shouldn't -Wall be warning about it? My take is that the casting is not a concern here since the returns (and entries) from the inlined routines effectively sequences the problematic store to be above the problematic load, and thus should be considered a bug in GCC.
[Bug c++/63412] aliasing issue exposed by inlining
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63412 --- Comment #1 from Doug Gilmore doug.gilmore at imgtec dot com --- Created attachment 33617 -- https://gcc.gnu.org/bugzilla/attachment.cgi?id=33617action=edit Modified version where type casts are modified.
[Bug tree-optimization/63148] [4.8/4.9 Regression] r187042 causes auto-vectorization failure for X86 for -m32.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63148 Doug Gilmore doug.gilmore at imgtec dot com changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED --- Comment #10 from Doug Gilmore doug.gilmore at imgtec dot com --- Verified my test examples are working (both X86 -m32 and MIPS32 -mmsa (patch is under review) are now working. Thanks! Doug
[Bug tree-optimization/63148] [4.8/4.9/5 Regression] r187042 causes auto-vectorization failure for X86 for -m32.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63148 --- Comment #6 from Doug Gilmore doug.gilmore at imgtec dot com --- The input to the vectorizer is already bogus: _12 = i.0_5 + 536870911; _13 = global_data.b[_12]; Note that gimple out generated by the front end is already problematic: Before r187042: D.1747 = i.0 + -1; With r187042: D.1747 = i.0 + 536870911; Any idea what the intent of the changes in r187042 that transform signed to unsigned constants? To me, that is the problematic issue.
[Bug tree-optimization/63148] r187042 causes auto-vectorization failure for X86 for -m32.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63148 Doug Gilmore doug.gilmore at imgtec dot com changed: What|Removed |Added CC||rguenther at suse dot de --- Comment #2 from Doug Gilmore doug.gilmore at imgtec dot com --- I still see the test failure at -m32 using the TIP of gcc-4_8-branch and ToT. Richard: when you have the chance, could double check your test results?
[Bug c/63148] New: r187042 causes auto-vectorization failure for X86 for -m32.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63148 Bug ID: 63148 Summary: r187042 causes auto-vectorization failure for X86 for -m32. Product: gcc Version: 4.8.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: doug.gilmore at imgtec dot com Created attachment 33440 -- https://gcc.gnu.org/bugzilla/attachment.cgi?id=33440action=edit test example I noticed that MultiSource/Benchmarks/TSVC/LoopRestructuring-{flt,dbl} from LLVM test-suite fail on X86 -m32 and I was able to bisect the failure to commit r187042. I attached a stripped down example: Before the revision if we compile with -fdump-tree-vect-details we see that a loop carried dependency is recorded: (compute_affine_dependence stmt_a: D.1748_9 = global_data.b[D.1747_8]; stmt_b: global_data.b[i.0_2] = D.1750_11; (subscript_dependence_tester (analyze_overlapping_iterations (chrec_a = {0, +, 1}_5) (chrec_b = {1, +, 1}_5) (analyze_siv_subscript (analyze_subscript_affine_affine (overlaps_a = [1 + 1 * x_1] ) (overlaps_b = [0 + 1 * x_1] ) ) ) (overlap_iterations_a = [1 + 1 * x_1] ) (overlap_iterations_b = [0 + 1 * x_1] ) ) (analyze_overlapping_iterations (chrec_a = 2816) (chrec_b = 2816) (overlap_iterations_a = [0] ) (overlap_iterations_b = [0] ) ) (build_classic_dist_vector dist_vector = ( 1 ) ) ) ) which results in the loop not being vectorized because of the memory recurrence. After the change the dependency is not recorded: (compute_affine_dependence stmt_a: D.1748_9 = global_data.b[D.1747_8]; stmt_b: global_data.b[i.0_2] = D.1750_11; (subscript_dependence_tester (analyze_overlapping_iterations (chrec_a = {536870912, +, 1}_5) (chrec_b = {1, +, 1}_5) (analyze_siv_subscript (analyze_subscript_affine_affine (overlaps_a = no dependence ) (overlaps_b = no dependence ) ) ) (overlap_iterations_a = no dependence ) (overlap_iterations_b = no dependence ) ) (dependence classified: scev_known) ) Causing the loop to be incorrectly vectorized. Note that when compiled with -m64 is actually vectorized, but it is determined that versioning is needed: 45: dependence distance == 0 between global_data.a[D.1767_2] and global_data.a[D.1767_2] 45: versioning for alias required: can't determine dependence between global_data.a[D.1767_2] and *D.1776_10 ... 58: LOOP VECTORIZED. s221_extract.c:40: note: vectorized 5 loops in function. Merging blocks 2 and 41 Removing basic block 5 ... and the incorrectly vectorized code is removed.