[Bug rtl-optimization/81025] [8 Regression] gcc ICE while building glibc for MIPS soft-float multi-lib variant

2017-06-10 Thread doug.gilmore at imgtec dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81025

--- Comment #9 from Doug Gilmore  ---
> I bet this is a bug in reorg.c.  It is the least used code (major
> target usage: MIPS and sparc only) and also one of the more buggy
> code.
You're right, compiling with -fno-delayed-branch doesn't tickle the bug.

Thanks!

[Bug tree-optimization/81025] [8 Regression] gcc ICE while building glibc for MIPS soft-float multi-lib variant

2017-06-10 Thread doug.gilmore at imgtec dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81025

Doug Gilmore  changed:

   What|Removed |Added

Summary|[MIPS] soft-float glibc |[8 Regression] gcc ICE
   |build fails at r248863  |while building glibc for
   ||MIPS soft-float multi-lib
   ||variant

--- Comment #6 from Doug Gilmore  ---
We are back to having our MIPS nightly ToT toolchain builds all
working with r247049 reverted.

Given that r247049 exposes another PRE issue, see bug 80620,
does it make sense to back out until we resolve the problems
at hand?

[Bug tree-optimization/81025] [MIPS] soft-float glibc build fails at r248863

2017-06-09 Thread doug.gilmore at imgtec dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81025

Doug Gilmore  changed:

   What|Removed |Added

 CC||rguenth at gcc dot gnu.org

--- Comment #4 from Doug Gilmore  ---
Created attachment 41513
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=41513=edit
cut down example via delta

Sorry attachment for the last comment was dropped.

I bisected the failure to r247049 using the cut down
example, compiled via:

$dir/xgcc -B$dir -O2 -msoft-float -mabi=32 delta_1.i -c -std=gnu11
-fgnu89-inline  -O2 -fmerge-all-constants -fno-stack-protector -frounding-math
-g

For this bisect I configured with --disable-multilib.

I'll look into this more tomorrow.

[Bug tree-optimization/81025] [MIPS] soft-float glibc build fails at r248863

2017-06-08 Thread doug.gilmore at imgtec dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81025

--- Comment #3 from Doug Gilmore  ---
It appears that r248863 just tickles the bug.  With
the attached example produced by delta the failure mode
is exposed by r248862.With luck, I may be able to
bisect the problem to an earlier commit.

[Bug tree-optimization/81025] New: [MIPS] soft-float glibc build fails at r248863

2017-06-08 Thread doug.gilmore at imgtec dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81025

Bug ID: 81025
   Summary: [MIPS] soft-float glibc build fails at r248863
   Product: gcc
   Version: 8.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: doug.gilmore at imgtec dot com
  Target Milestone: ---

Created attachment 41509
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=41509=edit
CPP output file

Our ToT GLIBC soft-float builds are failing ToT, I bisected the
problem to r248863.

To reproduce the problem with minimum effort, configure and build via:

/configure --prefix=.../install-mips-mti-linux-gnu
--disable-libssp --disable-libmudflap --disable-decimal-float --with-mips-plt
--target=mips-mti-linux-gnu --enable-languages=c --without-headers
--disable-shared --disable-threads --disable-libquadmath --disable-libatomic
--with-sysroot=.../install-mips-mti-linux-gnu/sysroot
make maybe-all-gcc

I attached two patches:

One to restrict the number of multi-lib variants, which probably isn't
needed for maybe-all-gcc, but will speed full gcc build.

The other patch is a cherry pick of r248879 which is needed to build
r248863 for MIPS.


Build CPP file:

/gcc/xgcc -B/gcc -O2 -msoft-float -mabi=32 s_fmaf.i -c
-std=gnu11 -fgnu89-inline  -O2 -Wall -Werror -Wundef -Wwrite-strings
-fmerge-all-constants -fno-stack-protector -frounding-math -g
-Wstrict-prototypes -Wold-style-definition


The CPP file compiles cleanly at r248862, but at r248863 with patch
for r248879 applied, the compile fails with:

during RTL pass: dwarf2
In file included from ../sysdeps/mips/ieee754/s_fmaf.c:4:0:
../soft-fp/fmasf4.c: In function '__fmaf':
../soft-fp/fmasf4.c:62:1: internal compiler error: in maybe_record_trace_start,
at dwarf2cfi.c:2330
0x74ab9f maybe_record_trace_start
/scratch/dgilmore/sgcc-pp5/src/gcc/gcc/dwarf2cfi.c:2330
0x74af2f create_trace_edges
/scratch/dgilmore/sgcc-pp5/src/gcc/gcc/dwarf2cfi.c:2426
0x74b0af scan_trace
/scratch/dgilmore/sgcc-pp5/src/gcc/gcc/dwarf2cfi.c:2640
0x74bd16 create_cfi_notes
/scratch/dgilmore/sgcc-pp5/src/gcc/gcc/dwarf2cfi.c:2666
0x74bd16 execute_dwarf2_frame
/scratch/dgilmore/sgcc-pp5/src/gcc/gcc/dwarf2cfi.c:3024
0x74bd16 execute
/scratch/dgilmore/sgcc-pp5/src/gcc/gcc/dwarf2cfi.c:3504
Please submit a full bug report,
with preprocessed source if appropriate.
Please include the complete backtrace with any bug report.
See <https://gcc.gnu.org/bugs/> for instructions.

Applying -fdump-rtl-dwarf2 to the compilation line the associated
dump file contains:

Inconsistent CFI state!
SHOULD have:
.cfi_def_cfa 29, 0
DO have:
.cfi_def_cfa 29, 8
.cfi_offset 16, -4

The CPP file is quite complicated, I am investigating whether
a cut down example will reproduce the failure.

[Bug tree-optimization/81025] [MIPS] soft-float glibc build fails at r248863

2017-06-08 Thread doug.gilmore at imgtec dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81025

--- Comment #2 from Doug Gilmore  ---
Created attachment 41511
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=41511=edit
patch needed to build r248863 for MIPS

[Bug tree-optimization/81025] [MIPS] soft-float glibc build fails at r248863

2017-06-08 Thread doug.gilmore at imgtec dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81025

--- Comment #1 from Doug Gilmore  ---
Created attachment 41510
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=41510=edit
Patch to constrain the number of multi-lib variants

[Bug tree-optimization/79955] New: GLIBC build fails after r245840

2017-03-07 Thread doug.gilmore at imgtec dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79955

Bug ID: 79955
   Summary: GLIBC build fails after r245840
   Product: gcc
   Version: 7.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: doug.gilmore at imgtec dot com
  Target Milestone: ---

Created attachment 40920
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=40920=edit
CPP output file for mips-mti-linux-gnu target

See:

https://sourceware.org/ml/libc-alpha/2017-03/msg00052.html

We are working around the issue by disabling -Werror in the build.

I'll upload an X86_64 .i file tomorrow.


$ mips-mti-linux-gnu-gcc -mabi=32 fnmatch.i -c -std=gnu11 -fgnu89-inline  -O2
-Wall -Wundef -Wwrite-strings -fmerge-all-constants -fno-stack-protector
-frounding-math \
-g -Wstrict-prototypes -Wold-style-definition -ftls-model=initial-exec
In file included from fnmatch.c:250:0:
fnmatch_loop.c: In function 'internal_fnwmatch':
../locale/weightwc.h:103:28: warning: '*((void *)+4)' may be used
uninitialized in this function [-Wmaybe-uninitialized]

[Bug tree-optimization/79291] r244897 introduces IV related performance issues for daxpy on MIPS by enabling peeling for alignment

2017-02-20 Thread doug.gilmore at imgtec dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79291

--- Comment #6 from Doug Gilmore  ---
> It also looks like mips lacks implementation of any of the
> vectorizer cost hooks and thus defaults to
> default_builtin_vectorization_cost which means that unaligned
> loads/stores have double cost.
Removing the double cost for unaligned memory OPs didn't have
any effect, pealing still occurred and the alias problem is
exposed on MIPS.

So it looks like we need to come up with fix for bug 69710,
that hopefully also fixes bug68030, to address is issue.

[Bug tree-optimization/79291] r244897 introduces IV related performance issues for daxpy on MIPS by enabling peeling for alignment

2017-02-01 Thread doug.gilmore at imgtec dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79291

--- Comment #5 from Doug Gilmore  ---
> Bin:  I suspect this is also now broken on ARM, can
> you check?
Oops, sorry I forgot that this problem is not exposed
on the original ARM/Neon for DP.  Sorry for the noise.

[Bug tree-optimization/79291] r244897 introduces IV related performance issues for daxpy on MIPS by enabling peeling for alignment

2017-02-01 Thread doug.gilmore at imgtec dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79291

--- Comment #4 from Doug Gilmore  ---
> It also looks like mips lacks implementation of any of the
> vectorizer cost hooks and thus defaults to
> default_builtin_vectorization_cost which means that unaligned
> loads/stores have double cost.
I have investigated that in the past and that costing is needed
in some cases.  I'll start looking into that again.

[Bug target/78176] [MIPS] miscompiles ldxc1 with large pointers on 32-bits

2017-02-01 Thread doug.gilmore at imgtec dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78176

--- Comment #20 from Doug Gilmore  ---
I'll collect more tracing data on the costing problem.

Hopefully I post an update in the next few days.

[Bug target/78176] [MIPS] miscompiles ldxc1 with large pointers on 32-bits

2017-01-30 Thread doug.gilmore at imgtec dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78176

Doug Gilmore  changed:

   What|Removed |Added

 CC||law at redhat dot com,
   ||rguenth at gcc dot gnu.org,
   ||zqchen at gcc dot gnu.org

--- Comment #18 from Doug Gilmore  ---
CC author and reviewers of r216501.

[Bug target/78176] [MIPS] miscompiles ldxc1 with large pointers on 32-bits

2017-01-30 Thread doug.gilmore at imgtec dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78176

--- Comment #17 from Doug Gilmore  ---
> This really throws off the costing of substituting different IVs on
> MIPS.
I forgot to mention that for MIPS the net of effect r216501 is to not
produce indexed memory OPs in simple examples where we should.  But
we also will produce problematic indexed memory OPs in situations
where address generation costing is a bit complicated (the original
issue associated with this bug report).

Applying the the two patches I just attached fixes the problem of
generating indexed memory OPs in simple examples, and also will cause
IVOPTS to select IVs that are similar to those that were made in the
past that avoids the problem executing indexed memory OPs in O32
binaries on 64-bit MIPS processors running current Linux kernels.

There is still the issue of recognizing that rewriting a "use" to use
a different IV can expose a problem with indexed memory OPs on 64-bit
MIPS processors, where an infinite cost should be associated in that
situation, that still needs to be addressed (thus the need for the
flag to turn off the generation of indexed memory OPs until this issue
is addressed).

[Bug target/78176] [MIPS] miscompiles ldxc1 with large pointers on 32-bits

2017-01-30 Thread doug.gilmore at imgtec dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78176

--- Comment #16 from Doug Gilmore  ---
Created attachment 40632
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=40632=edit
Tweak to adjust_setup_cost (r220473).

Second patch associated with previous comment.

[Bug target/78176] [MIPS] miscompiles ldxc1 with large pointers on 32-bits

2017-01-30 Thread doug.gilmore at imgtec dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78176

--- Comment #15 from Doug Gilmore  ---
Created attachment 40631
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=40631=edit
Prototype change to backout r216501.

> Bisected the problem to commit r216501:

The review discussion of r216501 starts with message:

https://gcc.gnu.org/ml/gcc-patches/2014-10/msg00758.html

Which contains:

The are two implementations of seq_cost. The function bodies are
exactly the same. The patch removes one of them and make the other
global.

This seems the patch was cleanup that shouldn't introduce a
functional change.

However implementations of seq_cost are different, per
final version of the patch:

https://gcc.gnu.org/ml/gcc-patches/2014-10/msg00896.html

cfgloopanal.c:
-   cost += set_rtx_cost (set, speed);


rtlanal.c:
+cost += set_rtx_cost (set, speed);

tree-ssa-loop-ivopts.c:
-   cost += set_src_cost (SET_SRC (set), speed);

In general, when computing the cost of a sequence of N INSNs this
increases the cost of the sequence by N*4.  This really throws
off the costing of substituting different IVs on MIPS.

The first patch attached (just a prototype) basically reverts
this change.  The second fixes a problem with r220473, a fix
for PR62631 from Eric Botcazou:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62631#c17

This looks a generic problem in get_shiftadd_cost to me, it ought
to mimic the algorithms in expmed.c, something like ...

This change can lower the cost of a sequence of instruction.  However
there are situations this (lower) cost is being scaled by an estimated
iteration count will cause the adjusted cost to now become zero.  For
the example attached to the second patch the IV replacement algorithm
will determine that the cost using separate IVs for each load will be
less than then cost of one IV for all loads.

Thus, in the second patch we detect that a non-zero cost being scaled
to zero should represented by one instead, which gets us back to
IVSOPTS generating just one IV that will be used for all loads.

[Bug tree-optimization/79291] New: r244397 introduces alias related performance issues for daxpy on MIPS

2017-01-30 Thread doug.gilmore at imgtec dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79291

Bug ID: 79291
   Summary: r244397 introduces alias related performance issues
for daxpy on MIPS
   Product: gcc
   Version: 7.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: doug.gilmore at imgtec dot com
  Target Milestone: ---

It appears that r244397 introduces pealing for DP daxpy, which per
bug 69710, introduces a performance degradation due to alias issues.

After IVOPTS before r244897 (use daxpy example from bug 69710):

  ivtmp.20_36 = ivtmp.20_35 + 1;
  ivtmp.21_24 = ivtmp.21_9 + 16;
  ivtmp.24_3 = ivtmp.24_2 + 16;

After IVOPTS after r244897:  
  ivtmp.23_56 = ivtmp.23_24 + 1;
  ivtmp.24_11 = ivtmp.24_9 + 16;
  ivtmp.27_87 = ivtmp.27_86 + 16;
  ivtmp.29_90 = ivtmp.29_89 + 16;

Thus after r244397 we have a problem in DP daxpy that we were
only seeing for SP daxpy (or saxpy) as shown in bug69710.

BTW: I have been investigating another IVOPTS related regression
on MIPS32R2 that is related to the generation of indexed

memory OPs:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78176#c12

I'll be updating that report with more information on how to
fix the regression and how it relates to this issue.

Bin:  I suspect this is also now broken on ARM, can
you check?

Thanks,

Doug

[Bug target/78176] [MIPS] miscompiles ldxc1 with large pointers on 32-bits

2017-01-13 Thread doug.gilmore at imgtec dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78176

Doug Gilmore  changed:

   What|Removed |Added

 CC||doug.gilmore at imgtec dot com

--- Comment #12 from Doug Gilmore  ---
Bisected the problem to commit r216501:

commit 9a416363e99c9f2d48fa810e220bc2f7904f1788
Author: zqchen <zqchen@138bc75d-0d04-0410-961f-82ee72b054a4>
Date:   Tue Oct 21 03:38:37 2014 +

2014-10-21  Zhenqiang Chen  <zhenqiang.c...@arm.com>

* cfgloopanal.c (seq_cost): Delete.
* rtl.h (seq_cost): New prototype.
* rtlanal.c (seq_cost): New function.
* tree-ssa-loop-ivopts.c (seq_cost): Delete.


git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/trunk@216501
138bc75d-0d04-0410-961f-82ee72b054a4

More analysis to follow.

Given the short time until the release, we plan submit a patch to
provide a target flag and build option to avoid the bug.

[Bug tree-optimization/77808] [7 Regression] ICE in duplicate_ssa_name_ptr_info, at tree-ssanames.c:630 starting with r240439

2016-10-05 Thread doug.gilmore at imgtec dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77808

Doug Gilmore  changed:

   What|Removed |Added

 CC||clyon at gcc dot gnu.org

--- Comment #2 from Doug Gilmore  ---
Christophe:  Can we close this bug?

[Bug testsuite/72850] [7 Regression] FAIL: gcc.dg/tree-ssa/pr69270-3.c scan-tree-dump-times uncprop1 ", 1" 4

2016-10-03 Thread doug.gilmore at imgtec dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=72850

Doug Gilmore  changed:

   What|Removed |Added

 CC||doug.gilmore at imgtec dot com

--- Comment #5 from Doug Gilmore  ---
Thanks Uri for the test case!

In case it wasn't clear, the switch statement should be removed at higher
levels
of optimization:

$ for i in 0 1 2 3 ; do ( set -x ; mips-mti-linux-gnu-gcc test.c -c -O$i
-fdump-tree-optimized ; egrep ";; Function|switch" test.c.169t.optimized  )
done
+ mips-mti-linux-gnu-gcc test.c -c -O0 -fdump-tree-optimized
+ egrep ';; Function|switch' test.c.169t.optimized
;; Function is_digit (is_digit, funcdef_no=0, decl_uid=1406, symbol_order=0)
;; Function FMS (FMS, funcdef_no=1, decl_uid=1410, symbol_order=1)
  switch (state_11) , case 0: , case 2: , case 3:
, case 4: , case 5: , case 6: , case 7: >
+ mips-mti-linux-gnu-gcc test.c -c -O1 -fdump-tree-optimized
+ egrep ';; Function|switch' test.c.169t.optimized
;; Function FMS (FMS, funcdef_no=1, decl_uid=1410, symbol_order=1)
  switch (state_98) , case 0: , case 2: , case 3:
, case 4: , case 5: , case 6: , case 7: >
+ mips-mti-linux-gnu-gcc test.c -c -O2 -fdump-tree-optimized
+ egrep ';; Function|switch' test.c.169t.optimized
;; Function FMS (FMS, funcdef_no=1, decl_uid=1410, symbol_order=1)
+ mips-mti-linux-gnu-gcc test.c -c -O3 -fdump-tree-optimized
+ egrep ';; Function|switch' test.c.169t.optimized
;; Function FMS (FMS, funcdef_no=1, decl_uid=1410, symbol_order=1)

[Bug tree-optimization/77808] New: [7 Regression] ICE in duplicate_ssa_name_ptr_info, at tree-ssanames.c:630 starting with r240439

2016-09-30 Thread doug.gilmore at imgtec dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77808

Bug ID: 77808
   Summary: [7 Regression] ICE in duplicate_ssa_name_ptr_info, at
tree-ssanames.c:630 starting with r240439
   Product: gcc
   Version: 7.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: doug.gilmore at imgtec dot com
  Target Milestone: ---

Reported in:

https://gcc.gnu.org/ml/gcc-patches/2016-09/msg02285.html

This issue was not found during regression testing for
commit r240439 since -fprefetch-loop-arrays needs to
be set by default.

Will send a fix and test case to gcc-patches.

[Bug tree-optimization/77654] restrict pointer attribute not preserved with -fprefetch-loop-arrays

2016-09-19 Thread doug.gilmore at imgtec dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77654

--- Comment #2 from Doug Gilmore  ---
Created attachment 39652
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=39652=edit
Prototype fix for bug.

[Bug tree-optimization/77654] restrict pointer attribute not preserved with -fprefetch-loop-arrays

2016-09-19 Thread doug.gilmore at imgtec dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77654

--- Comment #1 from Doug Gilmore  ---
Created attachment 39651
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=39651=edit
Additional tracing used to identify problem.

[Bug tree-optimization/77654] New: restrict pointer attribute not preserved with -fprefetch-loop-arrays

2016-09-19 Thread doug.gilmore at imgtec dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77654

Bug ID: 77654
   Summary: restrict pointer attribute not preserved with
-fprefetch-loop-arrays
   Product: gcc
   Version: 7.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: doug.gilmore at imgtec dot com
  Target Milestone: ---

Compiling the test example:

void daxpy(int n, double da, double * __restrict dx, double * __restrict dy)
{
int i;

for (i = 0;i < n; i++) {
dy[i] = dy[i] + da*dx[i];
}
}

via:

mips-img-linux-gnu-gcc -fprefetch-loop-arrays daxpy.c -c -O2 -save-temps
-fsched-verbose=9 -fdump-rtl-sched2

The following code is generated for the main loop:

$L4:
ldc1$f2,0($5)
pref6,0($2)
ldc1$f8,-120($2)
addiu   $2,$2,32
ldc1$f6,-144($2)
addiu   $5,$5,32
ldc1$f4,-136($2)
addiu   $3,$3,4
maddf.d $f8,$f2,$f0
ldc1$f2,-128($2)
sdc1$f8,-152($2)
ldc1$f1,-24($5)
maddf.d $f6,$f0,$f1
sdc1$f6,-144($2)
ldc1$f1,-16($5)
maddf.d $f4,$f0,$f1
sdc1$f4,-136($2)
ldc1$f1,-8($5)
maddf.d $f2,$f0,$f1
bne $3,$8,$L4
sdc1$f2,-128($2)

Due to the __restrict attributes on the pointer declarations, after
scheduling we should see that loads through $5 should move above the
stores through $2.  However, during the transformation done by the
phase that is enabled by -fprefetch-loop-arrays, the points-to
information is lost.  This prevents the loads to move above the stores
during scheduling.

The attached uses logic borrowed from IVS phase:

0002-Ensure-points-to-information-is-maintained-for-prefe.patch

After applying the patch, the points-to information is maintained,
which results in good code being generated after scheduling (which is
very important when running on in-order processors):

$L4:
addiu   $5,$5,32
ldc1$f8,-120($2)
ldc1$f6,-112($2)
pref6,0($2)
ldc1$f4,-104($2)
addiu   $3,$3,4
ldc1$f2,-96($2)
addiu   $2,$2,32
ldc1$f7,-32($5)
ldc1$f5,-24($5)
ldc1$f3,-16($5)
ldc1$f1,-8($5)
maddf.d $f8,$f7,$f0
maddf.d $f6,$f0,$f5
maddf.d $f4,$f0,$f3
maddf.d $f2,$f0,$f1
sdc1$f8,-152($2)
sdc1$f6,-144($2)
sdc1$f4,-136($2)
sdc1$f2,-128($2)
bnec$3,$8,$L4

I am not sure what to do about a test case.  One possibility is to
commit some of the tracing in debugging patch:

0001-Add-more-tracing-for-missing-points-to-information.patch

and we could scan for the RE "pi. is NULL", in the dump file
created by -fdump-rtl-sched2.

[Bug rtl-optimization/69710] performance issue with SP Linpack with Autovectorization

2016-03-07 Thread doug.gilmore at imgtec dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69710

--- Comment #15 from Doug Gilmore  ---
> I had a patch too, will send it for review in GCC7 if it's still needed.
Sorry I got side track last week and didn't make much progress.

Please go ahead and submit if you have something you feel comfortable with,
I'll assist in testing.

Thanks,

[Bug rtl-optimization/69710] performance issue with SP Linpack with Autovectorization

2016-02-24 Thread doug.gilmore at imgtec dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69710

--- Comment #13 from Doug Gilmore  ---
I think this should be fairly straightforward to fix in the
autovectorization pass.  Hopefully I should be able to post a patch
in the next few days.

[Bug rtl-optimization/69710] performance issue with SP Linpack with Autovectorization

2016-02-16 Thread doug.gilmore at imgtec dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69710

--- Comment #12 from Doug Gilmore  ---
> Yes, I proposed some cleanup passess after vectorization but richi
> thinks it's genrally expensive.  So what's implmentation complexity
> of pass_dominator?
One thing we might consider is only enable it when vectorization is
run on architectures where cleanup is needed.

I plan to send an RFC comment for my patch to see what objections
there are to that approach, though beforehand I'd like to investigate
what could be done to the vectorizer so that it doesn't generate code
that contain false dependencies.

[Bug rtl-optimization/69710] performance issue with SP Linpack with Autovectorization

2016-02-13 Thread doug.gilmore at imgtec dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69710

--- Comment #10 from Doug Gilmore  ---
Created attachment 37681
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=37681=edit
prototype fix

> 1) we failed recognize that use 0 and 2 are identical to each other.
> This is because vectorizer generates redundant setup code in loop
> pre-header.  There are two possible fixes here.  One is to make
> expand_simple_operations more aggressive in expanding (used by
> ivopts) in tree-ssa-loop-niter.c.  But I don't think this is a good
> idea in all cases, because expanded complicated expression makes ivo
> transform and niter analysis harder.
Or something along the lines of the attached patch, tested only on
the on the problem at hand.   As it stands it is probably to heavy
handed to consider as a possible review candidate.
> The other is to fix vectorizer
> to generate clean code.  Richard's suggestion is to use gimple_build
> for that.
ISTM to be the reasonable approach but I haven't yet investigated
what's involved.
> Also the problem exists only for arm because it doesn't support
> [base+index] addressing mode for vect load/store.  I guess mips
> doesn't either.
> 
Right MIPS MSA doesn't support [base+index] mode.

BTW, the reason why IVOPTS works for DP but not SP on MIPS MSA is
that the code in the pre-header is simpler for DP:

  :
  vect_cst__52 = {da_6(D), da_6(D)};

  :
  # vectp_dy.8_46 = PHI 
  # vectp_dx.11_49 = PHI 
  # vectp_dy.16_55 = PHI 
  # ivtmp_58 = PHI <0(6), ivtmp_59(12)>
...
which IVOPS can handle.

[Bug tree-optimization/69710] performance issue with SP Linpack with Autovectorization

2016-02-06 Thread doug.gilmore at imgtec dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69710

--- Comment #1 from Doug Gilmore  ---
Created attachment 37615
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=37615=edit
daxpy for DP (previous was for SP)

Compilation example:

arm-linux-gnueabihf-gcc -O3 -save-temps daxpy.c saxpy.c -c -mfpu=neon  -c
-fdump-tree-{vect,ivopts}-{verbose,details} -fdump-tree-{slp1,optimized}
-fsched-verbose=9 \
-fdump-rtl-sched{1,2} -marm  -funsafe-math-optimizations -funroll-all-loops

Note that Neon does not support DP, thus daxpy.s won't contain
autovectorized code.

I haven't built a ToT compiler for aarch64-linux-gnu, but I suspect
that you will see autovectorized code in daxpy.s in which reasonable
schedules are being produced (loads are being moved above stores).

[Bug rtl-optimization/69710] performance issue with SP Linpack with Autovectorization

2016-02-06 Thread doug.gilmore at imgtec dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69710

--- Comment #5 from Doug Gilmore  ---
Thanks for checking on AArch64 Andrew.

BTW, I made my (incorrect) hunch by running a test on gcc113, where
the installed 4.8 compile showed problems for both DP and SP.  (I
assumed that the problem was addressed on DP since we don't see it on
MIPS at DP ToT with the MSA patch applied.)

For Neon after ivopts I see:

  :
  # vectp_dy.20_96 = PHI 
  # ivtmp.22_78 = PHI <0(13), ivtmp.22_77(21)>
  # ivtmp.26_112 = PHI 
  # ivtmp.31_153 = PHI 
  vectp_dx.15_88 = (vector(4) float *) ivtmp.26_112;
  _156 = (void *) ivtmp.31_153;
  vect__12.14_85 = MEM[base: _156, offset: 0B];
  ivtmp.31_154 = ivtmp.31_153 + 16;
  vect__15.17_90 = MEM[(float *)vectp_dx.15_88];
  vect__16.18_92 = vect_cst__91 * vect__15.17_90;
  vect__17.19_93 = vect__12.14_85 + vect__16.18_92;
  MEM[base: vectp_dy.20_96, offset: 0B] = vect__17.19_93;
  vectp_dy.20_97 = vectp_dy.20_96 + 16;
  ivtmp.22_77 = ivtmp.22_78 + 1;
  ivtmp.26_111 = ivtmp.26_112 + 16;
  if (ivtmp.22_77 < bnd.9_53)
goto ;
  else
goto ;
...
  :
  goto ;

So the problem is indeed exposed on Neon.

[Bug tree-optimization/69710] New: performance issue with SP Linpack with Autovectorization

2016-02-06 Thread doug.gilmore at imgtec dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69710

Bug ID: 69710
   Summary: performance issue with SP Linpack with
Autovectorization
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: doug.gilmore at imgtec dot com
  Target Milestone: ---

Created attachment 37614
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=37614=edit
extracted daxpy example

We've noticed a performance problem in single precision
Linpack with the MSA patch applied:

https://gcc.gnu.org/ml/gcc-patches/2016-01/msg00177.html

which I have been able to reproduce with ARM Neon.

The problem that the autovectorization is generating more induction
variables for memory references in daxpy (this is an issue on all
architectures).  That is, when the statement:

  dy[i] = dy[i] + da*dx[i];

is vectorized the vector load associated with load of dy[i] uses
a different Induction Variable (IV) for the subsequent vector store
for dy[i].  For example, for ARM neon after vect we see:

  :
  # i_26 = PHI <i_44(11), i_19(20)>
  # vectp_dy.12_83 = PHI <vectp_dy.13_81(11), vectp_dy.12_84(20)>
  # vectp_dx.15_88 = PHI <vectp_dx.16_86(11), vectp_dx.15_89(20)>
  # vectp_dy.20_96 = PHI <vectp_dy.21_94(11), vectp_dy.20_97(20)>
  # ivtmp_99 = PHI <0(11), ivtmp_100(20)>
  i.0_7 = (unsigned int) i_26;
  _8 = i.0_7 * 4;
  _10 = dy_9(D) + _8;
  vect__12.14_85 = MEM[(float *)vectp_dy.12_83];
  _12 = *_10;
  _14 = dx_13(D) + _8;
  vect__15.17_90 = MEM[(float *)vectp_dx.15_88];
  _15 = *_14;
  vect__16.18_92 = vect_cst__91 * vect__15.17_90;
  _16 = da_6(D) * _15;
  vect__17.19_93 = vect__12.14_85 + vect__16.18_92;
  _17 = _12 + _16;
  MEM[(float *)vectp_dy.20_96] = vect__17.19_93;
  i_19 = i_26 + 1;
  vectp_dy.12_84 = vectp_dy.12_83 + 16;
  vectp_dx.15_89 = vectp_dx.15_88 + 16;
  vectp_dy.20_97 = vectp_dy.20_96 + 16;
  ivtmp_100 = ivtmp_99 + 1;
  if (ivtmp_100 < bnd.9_53)
goto ;
  else
goto ;
...
  :
  goto ;

Note that the use of a separate IV for the load and store off of dy
can introduces a false memory dependency which causes poor scheduling
after unrolling.  From what I have seen so far, for double precision
the ivopts phase is able to clean up the induction variables so the
false memory dependency is removed.  However the cleanup does not
happen for single precision.

Attached simple example for single precision, more to follow.

[Bug target/66747] [6 Regression] The commit r225260 broke the builds of the mips-{mti,img}-linux-gnu tool chains.

2015-07-06 Thread doug.gilmore at imgtec dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66747

--- Comment #9 from Doug Gilmore doug.gilmore at imgtec dot com ---
Our nightly builds are now clean with this patch.

Thanks!


[Bug middle-end/66747] [6 Regression] The commit r225260 broke the builds of the mips-{mti,img}-linux-gnu tool chains.

2015-07-03 Thread doug.gilmore at imgtec dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66747

Doug Gilmore doug.gilmore at imgtec dot com changed:

   What|Removed |Added

 CC||matthew.fortune at imgtec dot 
com

--- Comment #5 from Doug Gilmore doug.gilmore at imgtec dot com ---
The build succeeded and the regression test run
showed no regressions.

Bernd: could you send the patch to the list for approval?

Thanks!


[Bug middle-end/66747] [6 Regression] The commit r225260 broke the builds of the mips-{mti,img}-linux-gnu tool chains.

2015-07-03 Thread doug.gilmore at imgtec dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66747

--- Comment #4 from Doug Gilmore doug.gilmore at imgtec dot com ---
Thanks!

I started up a build with the patch and it got through
the initial_gcc build so that is a good sign.

I'll send an update once the build is done.


[Bug c/66747] New: The commit r225260 broke the builds of the mips-{mti,img}-linux-gnu tool chains.

2015-07-02 Thread doug.gilmore at imgtec dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66747

Bug ID: 66747
   Summary: The commit r225260 broke the builds of the
mips-{mti,img}-linux-gnu tool chains.
   Product: gcc
   Version: 5.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: doug.gilmore at imgtec dot com
  Target Milestone: ---

The commit r225260 broke the builds of the mips-{mti,img}-linux-gnu tool
chains.

To reproduce the problem, configure the binutils build from the directory
/scratch/d/obj-mips-img-linux-gnu/binutils-gdb:


/scratch/d/src/binutils-gdb/configure
--prefix=/scratch/d/install-mips-img-linux-gnu --target=mips-img-linux-gnu
--with-sysroot=/scratch/d/install-mips-img-linux-gnu/sysroot
then run make and make install

Then configure the gcc build from the directory
/scratch/d/obj-mips-img-linux-gnu/initial_gcc:
/scratch/d/src/gcc/configure --prefix=/scratch/d/install-mips-img-linux-gnu
--disable-libssp --disable-libgomp --disable-libmudflap --disable-decimal-float
--with-mips-plt --target=mips-img-linux-gnu --enable-languages=c
--without-headers --disable-shared --disable-threads --disable-libquadmath
--disable-libatomic
running make fails with:

/scratch/d/obj-mips-img-linux-gnu/initial_gcc/./gcc/xgcc
-B/scratch/d/obj-mips-img-linux-gnu/initial_gcc/./gcc/
-B/scratch/d/install-mips-img-linux-gnu/mips-img-linux-gnu/bin/
-B/scratch/d/install-mips-img-linux-gnu/mips-img-linux-gnu/lib/ -isystem
/scratch/d/install-mips-img-linux-gnu/mips-img-linux-gnu/include -isystem
/scratch/d/install-mips-img-linux-gnu/mips-img-linux-gnu/sys-include-g -O2
-minterlink-mips16 -mips64r6 -O2 -g -O2 -minterlink-mips16 -DIN_GCC 
-DCROSS_DIRECTORY_STRUCTURE  -W -Wall -Wwrite-strings -Wcast-qual
-Wstrict-prototypes -Wmissing-prototypes -Wold-style-definition  -isystem
./include  -I. -I. -I../../../.././gcc -I/scratch/d/src/gcc/libgcc
-I/scratch/d/src/gcc/libgcc/. -I/scratch/d/src/gcc/libgcc/../gcc
-I/scratch/d/src/gcc/libgcc/../include   -g0  -finhibit-size-directive
-fno-inline -fno-exceptions -fno-zero-initialized-in-bss -fno-toplevel-reorder
-fno-tree-vectorize -fbuilding-libgcc -fno-stack-protector  -Dinhibit_libc -I.
-I. -I../../../.././gcc -I/scratch/d/src/gcc/libgcc
-I/scratch/d/src/gcc/libgcc/. -I/scratch/d/src/gcc/libgcc/../gcc
-I/scratch/d/src/gcc/libgcc/../include  -o crtbeginT.o -MT crtbeginT.o -MD -MP
-MF crtbeginT.dep  -c /scratch/d/src/gcc/libgcc/crtstuff.c -DCRT_BEGIN
-DCRTSTUFFT_O
/scratch/d/src/gcc/libgcc/crtstuff.c: In function 'frame_dummy':
/scratch/d/src/gcc/libgcc/crtstuff.c:490:1: error: unrecognizable insn:
 }
 ^
(insn 82 67 8 (sequence [
(jump_insn 7 67 66 (set (pc)
(if_then_else (eq (reg/f:SI 2 $2 [197])
(const_int 0 [0]))
(label_ref:SI 15)
(pc))) /scratch/d/src/gcc/libgcc/crtstuff.c:470 466
{*branch_equalitysi}
 (expr_list:REG_DEAD (reg/f:SI 2 $2 [197])
(int_list:REG_BR_PROB 3017 (nil)))
 - 15)
(insn/f 66 7 8 (set (mem/c:DI (plus:SI (reg/f:SI 29 $sp)
(const_int 8 [0x8])) [5  S8 A64])
(reg:DI 31 $31)) 302 {*movdi_64bit}
 (expr_list:REG_FRAME_RELATED_EXPR (set/f (mem/c:DI (plus:SI
(reg/f:SI 29 $sp)
(const_int 8 [0x8])) [5  S8 A64])
(reg:DI 31 $31))
(nil)))
]) /scratch/d/src/gcc/libgcc/crtstuff.c:470 -1
 (nil))

We are working around the issue by reverting r225260.


[Bug c++/63412] New: aliasing issue exposed by inlining

2014-09-29 Thread doug.gilmore at imgtec dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63412

Bug ID: 63412
   Summary: aliasing issue exposed by inlining
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: doug.gilmore at imgtec dot com

Created attachment 33616
  -- https://gcc.gnu.org/bugzilla/attachment.cgi?id=33616action=edit
test program

The attached test program fails with 4.7 up to ToT at -O2 on both x86
(I built x86_64 with the -m32 multi-lib variant) and MIPS.

$ g++ -Wall -g -m32 -std=gnu++11 -O2 -fno-exceptions bad_i5.c -static -o la
-save-temps  ./la
Aborted (core dumped)
$ g++ -Wall -g -m32 -std=gnu++11 -O0 -fno-exceptions bad_i5.c -static -o la
-save-temps  ./la
$ 
Note that simplifying one of the expressions makes the program work:
$ g++ -Wall -g -DNO_VOL -m32 -std=gnu++11 -O2 -fno-exceptions bad_i5.c -static
-o la -save-temps  ./la
$ 

The generated code has the store below the implicit
load in the compare:

cmpl%ebx, 4(%esp,%edx,4)
movl%eax, 4(%esp)
jne.L5

which is incorrect.  It should be:

movl%eax, 4(%esp)
cmpl%ebx, 4(%esp,%edx,4)
jne.L5

We have an internal debate on what the issue is.

Some are of the opinion that casting is breaking alias rules and
thus the behavior of the program is undefined.

Thus something along the lines the following changes are needed.

$ diff bad_i5{,_mod}.c
48c48
 return reference_-AsMirrorPtr();
---
 return static_castT*(reference_-AsMirrorPtr());
50c50
   ObjectReferenceT* reference_;
---
   ObjectReferenceObject* reference_;
52,53c52,53
 : reference_(reinterpret_castObjectReferenceT*(reference))
 { }
---
 : reference_((reference))
   { }
$ g++ -g -m32 -std=gnu++11 -O2 -fno-exceptions bad_i5_mod.c -static -o la
-save-temps  ./la
$

If there is a strict aliasing issue, shouldn't -Wall be warning about
it?

My take is that the casting is not a concern here since the returns
(and entries) from the inlined routines effectively sequences the
problematic store to be above the problematic load, and thus should
be considered a bug in GCC.


[Bug c++/63412] aliasing issue exposed by inlining

2014-09-29 Thread doug.gilmore at imgtec dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63412

--- Comment #1 from Doug Gilmore doug.gilmore at imgtec dot com ---
Created attachment 33617
  -- https://gcc.gnu.org/bugzilla/attachment.cgi?id=33617action=edit
Modified version where type casts are modified.


[Bug tree-optimization/63148] [4.8/4.9 Regression] r187042 causes auto-vectorization failure for X86 for -m32.

2014-09-05 Thread doug.gilmore at imgtec dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63148

Doug Gilmore doug.gilmore at imgtec dot com changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #10 from Doug Gilmore doug.gilmore at imgtec dot com ---
Verified my test examples are working (both X86 -m32
and MIPS32 -mmsa (patch is under review) are now working.

Thanks!

Doug


[Bug tree-optimization/63148] [4.8/4.9/5 Regression] r187042 causes auto-vectorization failure for X86 for -m32.

2014-09-04 Thread doug.gilmore at imgtec dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63148

--- Comment #6 from Doug Gilmore doug.gilmore at imgtec dot com ---
 The input to the vectorizer is already bogus:

   _12 = i.0_5 + 536870911;
   _13 = global_data.b[_12];

Note that gimple out generated by the front end
is already problematic:

Before r187042:
  D.1747 = i.0 + -1;
With r187042:
  D.1747 = i.0 + 536870911;
Any idea what the intent of the changes in r187042 that transform
signed to unsigned constants?  To me, that is the problematic issue.


[Bug tree-optimization/63148] r187042 causes auto-vectorization failure for X86 for -m32.

2014-09-03 Thread doug.gilmore at imgtec dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63148

Doug Gilmore doug.gilmore at imgtec dot com changed:

   What|Removed |Added

 CC||rguenther at suse dot de

--- Comment #2 from Doug Gilmore doug.gilmore at imgtec dot com ---
I still see the test failure at -m32 using the TIP of gcc-4_8-branch and ToT.

Richard: when you have the chance, could double check your test results?


[Bug c/63148] New: r187042 causes auto-vectorization failure for X86 for -m32.

2014-09-02 Thread doug.gilmore at imgtec dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63148

Bug ID: 63148
   Summary: r187042 causes auto-vectorization failure for X86 for
-m32.
   Product: gcc
   Version: 4.8.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: doug.gilmore at imgtec dot com

Created attachment 33440
  -- https://gcc.gnu.org/bugzilla/attachment.cgi?id=33440action=edit
test example

I noticed that MultiSource/Benchmarks/TSVC/LoopRestructuring-{flt,dbl}
from LLVM test-suite fail on X86 -m32 and I was able to bisect the
failure to commit r187042.

I attached a stripped down example:

Before the revision if we compile with -fdump-tree-vect-details
we see that a loop carried dependency is recorded:

(compute_affine_dependence
  stmt_a: D.1748_9 = global_data.b[D.1747_8];
  stmt_b: global_data.b[i.0_2] = D.1750_11;
(subscript_dependence_tester 
(analyze_overlapping_iterations 
  (chrec_a = {0, +, 1}_5)
  (chrec_b = {1, +, 1}_5)
(analyze_siv_subscript 
(analyze_subscript_affine_affine 
  (overlaps_a = [1 + 1 * x_1]
)
  (overlaps_b = [0 + 1 * x_1]
)
)
)
  (overlap_iterations_a = [1 + 1 * x_1]
)
  (overlap_iterations_b = [0 + 1 * x_1]
)
)
(analyze_overlapping_iterations 
  (chrec_a = 2816)
  (chrec_b = 2816)
  (overlap_iterations_a = [0]
)
  (overlap_iterations_b = [0]
)
)
(build_classic_dist_vector
  dist_vector = (  1 
  )
)
)
)

which results in the loop not being vectorized because of the memory
recurrence.

After the change the dependency is not recorded:

(compute_affine_dependence
  stmt_a: D.1748_9 = global_data.b[D.1747_8];
  stmt_b: global_data.b[i.0_2] = D.1750_11;
(subscript_dependence_tester 
(analyze_overlapping_iterations 
  (chrec_a = {536870912, +, 1}_5)
  (chrec_b = {1, +, 1}_5)
(analyze_siv_subscript 
(analyze_subscript_affine_affine 
  (overlaps_a = no dependence
)
  (overlaps_b = no dependence
)
)
)
  (overlap_iterations_a = no dependence
)
  (overlap_iterations_b = no dependence
)
)
(dependence classified: scev_known)
)

Causing the loop to be incorrectly vectorized.

Note that when compiled with -m64 is actually vectorized,
but it is determined that versioning is needed:

45: dependence distance == 0 between global_data.a[D.1767_2] and
global_data.a[D.1767_2]
45: versioning for alias required: can't determine dependence between
global_data.a[D.1767_2] and *D.1776_10
...
58: LOOP VECTORIZED.
s221_extract.c:40: note: vectorized 5 loops in function.
Merging blocks 2 and 41
Removing basic block 5
...

and the incorrectly vectorized code is removed.