[Bug tree-optimization/102131] [12 Regression] wrong code at -O1 and above on x86_64-linux-gnu

2021-08-31 Thread amker at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102131

--- Comment #4 from bin cheng  ---
(In reply to Jiu Fu Guo from comment #3)
> The issue may come from 'iv0 cmp iv1' transform:
> 
>if (c -->if (c>=b) in-loop
> -->if (b<=c) in-loop
> 
>   c: {4, +, 3}
>   b: {1, +, 1}
> 
>   if ({1, +, 1} <= {4, +, 3})
>   ==> if ({1,+,-2} <= {4,+,0})  here, error occur
>   ==> if ({1,+,-2} < {5,+,0}) le-->lt

So this duplicates to PR100740?  Thanks

[Bug tree-optimization/101145] niter analysis fails for until-wrap condition

2021-06-25 Thread amker at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101145

--- Comment #7 from bin cheng  ---
(In reply to Jiu Fu Guo from comment #5)
> (In reply to bin cheng from comment #4)
> > (In reply to Jiu Fu Guo from comment #3)
> > > Yes, while the code in adjust_cond_for_loop_until_wrap seems somehow 
> > > tricky:
> > > 
> > >   /* Only support simple cases for the moment.  */
> > >   if (TREE_CODE (iv0->base) != INTEGER_CST
> > >   || TREE_CODE (iv1->base) != INTEGER_CST)
> > > return false;
> > > 
> > > This code requires both sides are constant.
> > Actually it requires an IV with constant base.
> 
> I also feel that the intention of this function may only require one side
> constant for IV0 CODE IV1.
> As tests, for below loop, adjust_cond_for_loop_until_wrap return false:
> 
> foo (int *__restrict__ a, int *__restrict__ b, unsigned i)
> {
>   while (++i > 100)
> *a++ = *b++ + 1;
> }
> 
> For below code, adjust_cond_for_loop_until_wrap returns true:
>   i = UINT_MAX - 200;
>   while (++i > 100)
> *a++ = *b++ + 1;

Oh sorry for being misleading.  When I mentioned it requires something(...), I
was describing the current behavior, not that the conditions are necessary. 
Feel free to improve such cases.  Looking into niter analysis, these
cases(trade-offs) are not rare.

Thanks

[Bug tree-optimization/101145] niter analysis fails for until-wrap condition

2021-06-24 Thread amker at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101145

--- Comment #4 from bin cheng  ---
(In reply to Jiu Fu Guo from comment #3)
> Yes, while the code in adjust_cond_for_loop_until_wrap seems somehow tricky:
> 
>   /* Only support simple cases for the moment.  */
>   if (TREE_CODE (iv0->base) != INTEGER_CST
>   || TREE_CODE (iv1->base) != INTEGER_CST)
> return false;
> 
> This code requires both sides are constant.
Actually it requires an IV with constant base.

[Bug tree-optimization/101145] niter analysis fails for until-wrap condition

2021-06-24 Thread amker at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101145

--- Comment #2 from bin cheng  ---
(In reply to Richard Biener from comment #1)
> This comes up with a pending patch to split loops like
> 
> void
> foo (int *a, int *b, unsigned l, unsigned n)
> {
>   while (++l != n)
> a[l] = b[l] + 1;
> }
> 
> into
> 
>   while (++l > n)
> a[l] = b[l] + 1;
>   while (++l < n)
> a[l] = b[l] + 1;
> 
> since for the second loop (the "usual" case involving no wrapping of the IV)
> this results in affine IVs and thus analyzable data dependence.

Special case like "i++ > constant" are handled in function
adjust_cond_for_loop_until_wrap, however, it only handles constant invariant on
the other side right now.

Will see how to cover simple cases as reported here.

[Bug tree-optimization/101173] [9/10/11/12 Regression] wrong code at -O3 on x86_64-linux-gnu

2021-06-23 Thread amker at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101173

--- Comment #5 from bin cheng  ---
(In reply to Richard Biener from comment #3)
> So we're exchanging the inner two loops
> 
>   a[1][3] = 8;
>   for (int b = 1; b <= 5; b++)
> for (int d = 0; d <= 5; d++)
>   for (c = 0; c <= 5; c++)
> a[b][c] = a[b][c + 2] & 216;
> 
> to
> 
>   a[1][3] = 8;
>   for (int b = 1; b <= 5; b++)
> for (c = 0; c <= 5; c++)
>   for (int d = 0; d <= 5; d++)
> a[b][c] = a[b][c + 2] & 216;
> 
> but that looks wrong from a dependence analysis perspective.  We have
> 
> (compute_affine_dependence
>   ref_a: a[b_33][_1], stmt_a: _2 = a[b_33][_1];
>   ref_b: a[b_33][c.3_32], stmt_b: a[b_33][c.3_32] = _3;
> (analyze_overlapping_iterations
>   (chrec_a = {2, +, 1}_5)
>   (chrec_b = {0, +, 1}_5)
> (analyze_siv_subscript
> (analyze_subscript_affine_affine
>   (overlaps_a = [0 + 1 * x_1])
>   (overlaps_b = [2 + 1 * x_1]))
> )
>   (overlap_iterations_a = [0 + 1 * x_1])
>   (overlap_iterations_b = [2 + 1 * x_1]))
> (analyze_overlapping_iterations
>   (chrec_a = {1, +, 1}_1)
>   (chrec_b = {1, +, 1}_1)
>   (overlap_iterations_a = [0])
>   (overlap_iterations_b = [0]))
> (analyze_overlapping_iterations
>   (chrec_a = {0, +, 1}_5)
>   (chrec_b = {2, +, 1}_5)
> (analyze_siv_subscript
> (analyze_subscript_affine_affine
>   (overlaps_a = [2 + 1 * x_1])
>   (overlaps_b = [0 + 1 * x_1]))
> )
>   (overlap_iterations_a = [2 + 1 * x_1])
>   (overlap_iterations_b = [0 + 1 * x_1]))
> (analyze_overlapping_iterations
>   (chrec_a = {1, +, 1}_1)
>   (chrec_b = {1, +, 1}_1)
>   (overlap_iterations_a = [0])
>   (overlap_iterations_b = [0]))
> (build_classic_dist_vector
>   dist_vector = (  0   0   2
>   )
> )
> )
> 
> I don't see anything wrong with that at a first glance so the bug must be in
> tree_loop_interchange::valid_data_dependences it checks
> 
>   /* Be conservative, skip case if either direction at i_idx/o_idx
>  levels is not '=' or '<'.  */
>   if (dist_vect[i_idx] < 0 || dist_vect[o_idx] < 0)
> return false;
> 
> dist_vect is [0 0 2], i_idx 2 and o_idx 1 but I think that dist_vect[o_idx]
> should exclude zero, thus
> 
> diff --git a/gcc/gimple-loop-interchange.cc b/gcc/gimple-loop-interchange.cc
> index f45b9364644..265e36c48d4 100644
> --- a/gcc/gimple-loop-interchange.cc
> +++ b/gcc/gimple-loop-interchange.cc
> @@ -1043,8 +1043,8 @@ tree_loop_interchange::valid_data_dependences
> (unsigned i_idx, unsigned o_idx,
> continue;
>  
>   /* Be conservative, skip case if either direction at i_idx/o_idx
> -levels is not '=' or '<'.  */
> - if (dist_vect[i_idx] < 0 || dist_vect[o_idx] < 0)
> +levels is not '=' (for the inner loop) or '<'.  */
> + if (dist_vect[i_idx] < 0 || dist_vect[o_idx] <= 0)
> return false;
> }
>  }
> 
> Bin - does this analysis look sound?

Hi Richard,
Thanks very much for helping on this.  Sorry I would need a bit more time to
answer this question.  Thanks again.

[Bug tree-optimization/100740] [9/10/11/12 Regression] wrong code at -O1 and above on x86_64-linux-gnu since r9-4145

2021-05-24 Thread amker at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100740

bin cheng  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |amker at gcc dot gnu.org

--- Comment #2 from bin cheng  ---
mine.  Sorry for the breakage.

[Bug tree-optimization/100499] Different results with -fpeel-loops -ftree-loop-vectorize options

2021-05-19 Thread amker at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100499

--- Comment #19 from bin cheng  ---
(In reply to bin cheng from comment #18)
> Did some experiments, there are two fallouts after explicitly returning
> false for unsigned/wrapping types in MULT_EXPR/MINUS_EXPR/PLUS_EXPR.  One is
> the mentioned use of multiple_of_p in number_of_iterations_ne, the other is
> for alignment warning in stor-layout.c.  As pointed out, the latter case is
> known not overflow/wrap.  
> 
> So I am thinking to introduce an additional parameter indicating that caller
> knows "top" doesn't overfow/wrap, otherwise, try to get rid of the
> undocumented assumption.  we can always improve the accuracy using ranger or
> other tools.  Not sure if this is the right way to do.
> 
> As for MULT_NO_OVERFLOW/PLUS_NO_OVERFLOW, IMHO, it's not that simple?  For
> example, unsigned_num(multiple of 4, and larger than 0) + 0xfffc is
> multiple of 4, but it's overflow behavior on which we rely here.

Hmm, 4 is special and not a correct example.  Considering:
  n (unsigned, multiple of 3, and > 0) + 0xfffd
It's multiple of 3, but we need to rely on wrapping to get answer.

[Bug tree-optimization/100499] Different results with -fpeel-loops -ftree-loop-vectorize options

2021-05-19 Thread amker at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100499

--- Comment #18 from bin cheng  ---
Did some experiments, there are two fallouts after explicitly returning false
for unsigned/wrapping types in MULT_EXPR/MINUS_EXPR/PLUS_EXPR.  One is the
mentioned use of multiple_of_p in number_of_iterations_ne, the other is for
alignment warning in stor-layout.c.  As pointed out, the latter case is known
not overflow/wrap.  

So I am thinking to introduce an additional parameter indicating that caller
knows "top" doesn't overfow/wrap, otherwise, try to get rid of the undocumented
assumption.  we can always improve the accuracy using ranger or other tools. 
Not sure if this is the right way to do.

As for MULT_NO_OVERFLOW/PLUS_NO_OVERFLOW, IMHO, it's not that simple?  For
example, unsigned_num(multiple of 4, and larger than 0) + 0xfffc is
multiple of 4, but it's overflow behavior on which we rely here.

[Bug tree-optimization/100499] Different results with -fpeel-loops -ftree-loop-vectorize options

2021-05-18 Thread amker at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100499

--- Comment #14 from bin cheng  ---
(In reply to Richard Biener from comment #12)
> So in number_of_iterations_ne it looks like the step 's' is always constant
> which makes me wonder if we can somehow use ranger to tell multiple_of_p
> (type, c, s)
> or at least whether, if c is x * s, the multiplication could have overflowed?

Yeah, I am looking if "multiple of" can be feasibly checked in niter analysis,
with help of some basic information from multiple_of_p.

BTW, I am not following changes in "ranger", how should I used in analysis? or
similar to value range info?

Thanks

[Bug tree-optimization/100499] Different results with -fpeel-loops -ftree-loop-vectorize options

2021-05-18 Thread amker at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100499

--- Comment #13 from bin cheng  ---
(In reply to Richard Biener from comment #10)
> (In reply to bin cheng from comment #9)
> > Seems we have a long standing bug in fold-const.c:multiple_of_p in case of
> > wrapping types.  Take unsigned int as an example:
> >   (0xfffc * 0x3) % 0x3 = 0x1
> > But multiple_of_p returns true here.
> > 
> > The same issue also stands for MINUS_EXPR and PLUS_EXPR.  Given
> > multiple_of_p is used elsewhere, the fix might break existing optimizations.
> > Especially, number of loop iterations is computed in unsigned types
> 
> multiple_of_p is mostly used in contexts where overflow "cannot happen"
> (in TYPE/DECL_SIZE computation context), and in niter analysis it seems to
> be guarded similarly.  This restriction of multiple_of_p seems undocumented,
Oh, I am not aware of this.  Actually my previous change to it seems broke this
assumption already.  Will see how to fix or revert the change.

> so fixing that might be good.
> 
> Now, you don't say what's the chain of events that lead to a multiple_of_p
> call
> eventually leading to the wrong answer, but I guess it's the code added
> under the
> 
> +  if (!niter->control.no_overflow
> +  && (integer_onep (s) || multiple_of_p (type, c, s)))
> 
> check as !niter->control.no_overflow seems to suggest that the multiple_of_p
> check is not properly guarded?

[Bug tree-optimization/90078] [9 Regression] ICE with deep templates caused by overflow

2021-05-17 Thread amker at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90078

--- Comment #19 from bin cheng  ---
I will check if the latter fix can be easily backported to GCC-9.

[Bug tree-optimization/100499] Different results with -fpeel-loops -ftree-loop-vectorize options

2021-05-16 Thread amker at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100499

--- Comment #9 from bin cheng  ---
Seems we have a long standing bug in fold-const.c:multiple_of_p in case of
wrapping types.  Take unsigned int as an example:
  (0xfffc * 0x3) % 0x3 = 0x1
But multiple_of_p returns true here.

The same issue also stands for MINUS_EXPR and PLUS_EXPR.  Given multiple_of_p
is used elsewhere, the fix might break existing optimizations.  Especially,
number of loop iterations is computed in unsigned types

[Bug tree-optimization/100499] Different results with -fpeel-loops -ftree-loop-vectorize options

2021-05-11 Thread amker at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100499

bin cheng  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |amker at gcc dot gnu.org

--- Comment #7 from bin cheng  ---
(In reply to Martin Liška from comment #4)
> (In reply to Martin Liška from comment #3)
> > But expected result is end g_2823 = 32768, right?
> > Clang returns the same result 32768.
> 
> Which regresses since r7-2373-g69b806f6a60efcf1.

Hmm, that was a fix long ago.  Will investigate this.  Sorry for the breakage.

[Bug tree-optimization/98736] [10/11 Regression] Wrong partition order generated in loop distribution pass since r10-619-g5879ab5fafedc8f6

2021-04-06 Thread amker at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98736

--- Comment #6 from bin cheng  ---
Shall this be backported to 10/11 later? Thanks.

[Bug tree-optimization/95638] [10 Regression] Legit-looking code doesn't work with -O2

2021-03-26 Thread amker at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95638

bin cheng  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
  Known to work||10.2.0
 Resolution|--- |FIXED

--- Comment #15 from bin cheng  ---
Confirmed fixed for 10.2.0 also.  Closing.

[Bug tree-optimization/98736] [10/11 Regression] Wrong partition order generated in loop distribution pass since r10-619-g5879ab5fafedc8f6

2021-03-20 Thread amker at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98736

--- Comment #4 from bin cheng  ---
(In reply to bin cheng from comment #3)
> hmm, seems topological order isn't enough for distributing a loop nest, we
> need topological order plus inner loop depth-first.

Well, not really.  In this case, problem is that rev-post order algorithm puts
"a[c] = d[3];" before the inner loop which violates the original program order. 

Seems that it can be fixed by inner loop depth-first order wrto how we
distribute inner loop, but I am not sure if this always preserves programming
order because loop has been reformed by various optimizers.

[Bug tree-optimization/99067] Missed optimization for induction variable elimination

2021-02-17 Thread amker at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99067

--- Comment #3 from bin cheng  ---
Though not sure if the underlying root causes are the same, I think these are
two different issues, at least, they are handled by different parts of code in
IVOPTs.  
For the first one, it's a known issue in GCC and IV elimination is complicated
yet quite conservative for long time, while for the second one, we indeed don't
know whether "i*N+j" wraps or not.  Even though we might be able to improve
IVOPTs under condition of wrapping behavior.

[Bug tree-optimization/98736] [10/11 Regression] Wrong partition order generated in loop distribution pass since r10-619-g5879ab5fafedc8f6

2021-02-17 Thread amker at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98736

--- Comment #3 from bin cheng  ---
hmm, seems topological order isn't enough for distributing a loop nest, we need
topological order plus inner loop depth-first.

[Bug tree-optimization/99067] Missed optimization for induction variable elimination

2021-02-16 Thread amker at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99067

bin cheng  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |amker at gcc dot gnu.org

--- Comment #2 from bin cheng  ---
Mine, will have a look.  Thanks for reporting.

[Bug c++/97627] [9/10/11 Regression] loop end condition missing - endless loop with -fPIC

2021-01-25 Thread amker at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97627

--- Comment #12 from bin cheng  ---
a. why the loop is considered as infinite
b. we need to skip fake exit edges in niter analysis?

[Bug c++/97627] [9/10/11 Regression] loop end condition missing - endless loop with -fPIC

2021-01-25 Thread amker at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97627

--- Comment #11 from bin cheng  ---
(In reply to bin cheng from comment #10)
> hmm,
> For below basic block:
> 128 ;;   basic block 4, loop depth 2, maybe hot
> 129 ;;prev block 3, next block 9, flags: (NEW, VISITED)
> 130 ;;pred:   3 (FALLTHRU,EXECUTABLE)
> 131 ;;7 (FALLTHRU,DFS_BACK,EXECUTABLE)
> 132   # RANGE [0, 2147483647] NONZERO 2147483647
> 133   # c_5 = PHI <0(3), c_17(7)>
> 134   # .MEM_8 = PHI <.MEM_7(3), .MEM_9(7)>
> 135   if (_2 < c_5)
> 136 goto ; [INV]
> 137   else
> 138 goto ; [INV]
> 139 ;;succ:   8 (TRUE_VALUE,EXECUTABLE)
> 140 ;;9 (FALSE_VALUE,EXECUTABLE)
> 
> Code in :
> 4276 
> 4277   basic_block *body = get_loop_body (loop);
> 4278   exits = get_loop_exit_edges (loop, body);
> 4279   likely_exit = single_likely_exit (loop, exits);
> 4280   FOR_EACH_VEC_ELT (exits, i, ex)
> 4281 {
> 4282   if (ex == likely_exit)
> 4283 {
> 4284   gimple *stmt = last_stmt (ex->src);
> 4285   if (stmt != NULL)
> 4286 {
> 
> gets three exit edges, one of which is  bb1>, as a result, 0 niter is
> computed for this exit in function number_of_iterations_exit_assumptions. 
> This seems strange, is it a fake edge added for some reason?
> 
> Thanks

Right, it's added by connect_infinite_loops_to_exit.

[Bug c++/97627] [9/10/11 Regression] loop end condition missing - endless loop with -fPIC

2021-01-25 Thread amker at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97627

--- Comment #10 from bin cheng  ---
hmm,
For below basic block:
128 ;;   basic block 4, loop depth 2, maybe hot
129 ;;prev block 3, next block 9, flags: (NEW, VISITED)
130 ;;pred:   3 (FALLTHRU,EXECUTABLE)
131 ;;7 (FALLTHRU,DFS_BACK,EXECUTABLE)
132   # RANGE [0, 2147483647] NONZERO 2147483647
133   # c_5 = PHI <0(3), c_17(7)>
134   # .MEM_8 = PHI <.MEM_7(3), .MEM_9(7)>
135   if (_2 < c_5)
136 goto ; [INV]
137   else
138 goto ; [INV]
139 ;;succ:   8 (TRUE_VALUE,EXECUTABLE)
140 ;;9 (FALSE_VALUE,EXECUTABLE)

Code in :
4276 
4277   basic_block *body = get_loop_body (loop);
4278   exits = get_loop_exit_edges (loop, body);
4279   likely_exit = single_likely_exit (loop, exits);
4280   FOR_EACH_VEC_ELT (exits, i, ex)
4281 {
4282   if (ex == likely_exit)
4283 {
4284   gimple *stmt = last_stmt (ex->src);
4285   if (stmt != NULL)
4286 {

gets three exit edges, one of which is  bb1>, as a result, 0 niter is
computed for this exit in function number_of_iterations_exit_assumptions.  This
seems strange, is it a fake edge added for some reason?

Thanks

[Bug c++/97627] [9/10/11 Regression] loop end condition missing - endless loop with -fPIC

2021-01-21 Thread amker at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97627

--- Comment #9 from bin cheng  ---
(In reply to Jakub Jelinek from comment #8)
> Still broken on current 10 branch, as written works fine on the trunk due to
> the C++ FE loop changes.
> Bin, did you have time to look into this yet?

I am very sorry, seems I have two correctness PRs now? Will try to investigate
these on this WE.

[Bug tree-optimization/98598] Missed opportunity to optimize dependent loads in loops

2021-01-08 Thread amker at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98598

--- Comment #4 from bin cheng  ---
Didn't go deep into the case.

For simple cases taken as examples, it's possible to interchange the two loops
thus enables loop invariant code motion.  Though loop interchange may fail
because of complicated data dependences, we may take some useful points from
it, for example, the cost model checking new loop invariants wrto the outer
loop.

[Bug c++/97627] [9/10/11 Regression] loop end condition missing - endless loop with -fPIC

2020-10-29 Thread amker at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97627

--- Comment #5 from bin cheng  ---
(In reply to Jakub Jelinek from comment #3)
> Started with r9-4145-ga81e2c6240655f60a49c16e0d8bbfd2ba40bba51

Sorry for the breakage.  Will fix this.

[Bug tree-optimization/78427] missed optimization of loop condition

2020-09-27 Thread amker at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78427

bin cheng  changed:

   What|Removed |Added

 CC||amker at gcc dot gnu.org

--- Comment #5 from bin cheng  ---
(In reply to Antony Polukhin from comment #4)
> Any progress?

Oh, I missed this one.  Will try to find time later.

Thanks

[Bug target/96201] x86 movsd/movsq string instructions and alignment inference

2020-09-15 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96201

bin cheng  changed:

   What|Removed |Added

 CC||amker at gcc dot gnu.org

--- Comment #2 from bin cheng  ---
Reason is that memory references in f3 are not identified as address type IV
uses.  I don't remember details, but it's intended by below commit:
commit 653a4b32fe72e33bfd4cdd4c25493049524a3805
Author: Bin Cheng 
Date:   Thu Mar 2 11:25:11 2017 +

re PR tree-optimization/66768 (address space gets lost on literal pointer)

PR tree-optimization/66768
* tree-ssa-loop-ivopts.c (find_interesting_uses_address): Skip addr
iv_use if base object can't be determined.

gcc/testsuite
* gcc.target/i386/pr66768.c: New test.

From-SVN: r245837

For f1/f2, IVOPTs fails to identify base object because pointers are converted
from integer.  We need to tell the difference better.

For f3, __builtin_assume_aligned is optimized away by GCC-10 before IVOPTs.

[Bug tree-optimization/95638] [10 Regression] Legit-looking code doesn't work with -O2

2020-07-23 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95638

--- Comment #14 from bin cheng  ---
(In reply to Richard Biener from comment #13)
> GCC 10.2 is released, adjusting target milestone.

Hmm, this should be fixed on GCC10/GCC9.  I backported PR95638/PR95804
separately using cherry-pick, so the backport information for latter PR is not
reflected here.

[Bug rtl-optimization/96031] suboptimal codegen for store low 16-bits value

2020-07-19 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96031

--- Comment #2 from bin cheng  ---
Interesting case, I see two issues in generated asm.  One is the unnecessary
bitwise and, the other is allocating different registers for induction variable
and the base address.  However, looks like neither issue is caused by ivopts. 
Check the dump:

431[local count: 105119324]:
432   _12 = (short unsigned int) step_8(D);
433   ivtmp.10_11 = (unsigned long) 
434   _18 = len_7(D) + 4294967294;
435   _19 = (unsigned long) _18;
436   _20 = _19 * 2;
437   _21 = (unsigned long) 
438   _22 = _21 + 2;
439   _23 = _20 + _22;
440
441[local count: 955630224]:
442   # ivtmp.8_15 = PHI <_12(4), ivtmp.8_5(6)>
443   # ivtmp.10_16 = PHI 
444   _3 = ivtmp.8_15;
445   _2 = (void *) ivtmp.10_16;
446   MEM[base: _2, offset: 2B] = _3;
447   ivtmp.8_5 = ivtmp.8_15 + _12;
448   ivtmp.10_4 = ivtmp.10_16 + 2;
449   if (ivtmp.10_4 != _23)
450 goto ; [89.00%]
451   else
452 goto ; [11.00%]
453
454[local count: 105119324]:
455   goto ; [100.00%]
456
457[local count: 850510900]:
458   goto ; [100.00%]

As far as I can tell, it's optimal.

The register allocation issue is introduced by rtl PRE, apparently we should
not save the add 2 instruction in the last iteration with a false dependence
which is more harmful.

As for ivopt, I can see a minor improvement by replacing != exit condition with
<=, thus saving add 2 instruction computing _22, which happens to "disable" the
wrong PRE transformation.

Ah, I see it's already classified as rtl-optimization.

Thanks

[Bug tree-optimization/95804] [11 Regression] ICE in generate_code_for_partition, at tree-loop-distribution.c:1323 since r11-1565-g2c0069fafb53ccb7

2020-07-09 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95804

--- Comment #11 from bin cheng  ---
(In reply to Richard Biener from comment #8)
> Fixed - note it needs to be backported when the PR95638 fix is backported.

I backported PR95638/PR95804 to GCC-10/GCC-9 branches.  However, unnecessary to
backport to GCC-8, because the starting issue (pr94125) is not exposed on it.

[Bug tree-optimization/95804] [11 Regression] ICE in generate_code_for_partition, at tree-loop-distribution.c:1323 since r11-1565-g2c0069fafb53ccb7

2020-07-08 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95804

--- Comment #6 from bin cheng  ---
(In reply to Martin Liška from comment #5)
> @Bin: Any news about this?

Patch is approved, will apply soon.  Thanks

[Bug tree-optimization/95638] [10/11 Regression] Legit-looking code doesn't work with -O2

2020-06-29 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95638

--- Comment #9 from bin cheng  ---
(In reply to Jakub Jelinek from comment #8)
> So fixed on the trunk, waiting for 10 backport?

Sorry, https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95804 is also in this part
which I believe is related to this fix.  Will backport the full patch after
fixing 95804.

Thanks

[Bug tree-optimization/95804] ice in generate_code_for_partition, at tree-loop-distribution.c:1323

2020-06-22 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95804

--- Comment #2 from bin cheng  ---
(In reply to Richard Biener from comment #1)
> Confirmed.  We seem to end up with a reduction partition not in the last
> position thus miss some required partition merging.
Sorry for the breakage.

Whew, this part IS can of worms.  Will investigate it.

[Bug tree-optimization/94969] [8/10 Regression] Invalid loop distribution since r8-2390-gdfbddbeb1ca912c9

2020-06-17 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94969

--- Comment #16 from bin cheng  ---
(In reply to Richard Biener from comment #15)
> I don't see the commit on the GCC 10 branch nor the GCC 8 branch.  Master
> and GCC 9 are fixed though.

Will backport the 10 and 8, thanks for reminding.

[Bug c++/95638] [10/11 Regression] Legit-looking code doesn't work with -O2

2020-06-13 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95638

--- Comment #6 from bin cheng  ---
We call graphds_scc twice to break alias dependence, with alias dependence
edges skipped in the second call.  The code (both before and after
r10-7184-ge4e9a59105a81cdd6c1328b0a5ed9fe4cc82840e) tries to rectify post order
information after the second call, however it never gets it right.  Actually I
don't think it can be easily rectified (if possible).

Will test another patch which records/restores post order information for the
second call.

[Bug c++/95638] Legit-looking code doesn't work with -O2

2020-06-11 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95638

--- Comment #5 from bin cheng  ---
(In reply to Jakub Jelinek from comment #1)
> All I can say is that bisection shows (at least when preprocessed with g++
> 8.3.1 first) that this changed behavior in
> r10-7184-ge4e9a59105a81cdd6c1328b0a5ed9fe4cc82840e
> No time to analyze if it is a bug in the code or on the GCC side.
> CCing patch author.

Thanks for ccing, I will look into it this WE.

[Bug tree-optimization/95199] Remove extra variable created for memory reference in loop vectorization.

2020-05-22 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95199

--- Comment #7 from bin cheng  ---
(In reply to rguent...@suse.de from comment #6)
> On Thu, 21 May 2020, zhoukaipeng3 at huawei dot com wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95199
> > 
> > --- Comment #4 from Kaipeng Zhou  ---
> > Sorry for not expressing clearly.
> > 
> > I have debugged the testcase you provided.  Not eliminating them is not 
> > caused
> > by IFN.  The relevant code is in the "get_computation_aff_1" function.
> > 
> > In IVOPTs the IV_STEPs must be checked by function "constant_multiple_of"
> > before using an IV variable to eliminate the other.  But if the tree_code of
> > input IV_STEP is SSA_NAME, the function will return false.  In your 
> > testcase,
> > the tree_code of IV_STEP is MULT_EXPR, so it return true.
> > 
> > Gimple for my testcase:
> >[local count: 8589933]:
> >   _83 = (sizetype) inc_y_22(D);
> >   _84 = _83 * POLY_INT_CST [16, 16];
> >   _85 = (long unsigned int) inc_y_22(D);
> >   _86 = _85 * 8;
> >   _87 = (ssizetype) _86;
> >   _88 = _87 /[ex] 8;
> >   _89 = (long unsigned int) _88;
> >   _90 = VEC_SERIES_EXPR <0, _89>;
> >   vect_cst__95 = [vec_duplicate_expr] m_17(D);
> >   _97 = (sizetype) inc_x_20(D);
> >   _98 = _97 * POLY_INT_CST [16, 16];
> >   _99 = (long unsigned int) inc_x_20(D);
> >   _100 = _99 * 8;
> >   _101 = (ssizetype) _100;
> >   _102 = _101 /[ex] 8;
> >   _103 = (long unsigned int) _102;
> >   _104 = VEC_SERIES_EXPR <0, _103>;
> >   _109 = (sizetype) inc_x_20(D);
> >   _110 = _109 * POLY_INT_CST [16, 16];
> >   _111 = (long unsigned int) inc_x_20(D);
> 
> The issue is you have two copies of
> (sizetype) inc_x_20(D) * POLY_INT_CST [16, 16];
> and IVOPTs does not perform CSE.  vinfo->ivexpr_map is supposed to
> catch those "IV base and/or step expressions".  So look where
> they are inserted and check the CSE map is used.  Alternatively
> fixup hashing/comparing to handle POLY_INT_CST [16, 16] if that
> is the reason for the missed CSE.
> 
Yes, it's because cse_and_gimplify_to_preheader is not called for
gathering/scattering.  Should be easily fixed by following patch:

diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c
index e7822c44951..ba9ee5c4996 100644
--- a/gcc/tree-vect-stmts.c
+++ b/gcc/tree-vect-stmts.c
@@ -2961,6 +2961,7 @@ vect_get_strided_load_store_ops (stmt_vec_info stmt_info,
   tree bump = size_binop (MULT_EXPR,
  fold_convert (sizetype, unshare_expr (DR_STEP (dr))),
  size_int (TYPE_VECTOR_SUBPARTS (vectype)));
+  bump = cse_and_gimplify_to_preheader (loop_vinfo, bump);
   *dataref_bump = force_gimple_operand (bump, , true, NULL_TREE);
   if (stmts)
 gsi_insert_seq_on_edge_immediate (loop_preheader_edge (loop), stmts);

[Bug tree-optimization/95199] Remove extra variable created for memory reference in loop vectorization.

2020-05-21 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95199

--- Comment #5 from bin cheng  ---
(In reply to Richard Biener from comment #1)
> But IVOPTs is supposed to know how to eliminate equal IVs.  Maybe it's
> confused
> about the IFN uses?

It's an known issue that IVOPTs has difficulty in recognizing equal BASEs.  For
now  it tries to identify/eliminate with limited expanding work which isn't
enough for  complicate cases.  I sent a patch to do IVOPTs a favor in
vectorization, but didn't follow up.

Without digging into the code, I am not sure if this is a similar issue.  Will
have a look this WE.

Thanks

[Bug tree-optimization/94969] [8/9/10/11 Regression] Invalid loop distribution since r8-2390-gdfbddbeb1ca912c9

2020-05-17 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94969

--- Comment #10 from bin cheng  ---
Hi,should I backport this and PR95110 to branches?  Thanks

[Bug tree-optimization/95019] Optimizer produces suboptimal code related to -ftree-ivopts

2020-05-13 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95019

--- Comment #3 from bin cheng  ---
(In reply to zhongyu...@tom.com from comment #2)
> It is a generic issue for all targets, such as x86, it also don't enpand
Yes, as said it's because SCEV currently doesn't model this, so it's not target
specific.

> IVOPTs as index is not used for DEST and Src directly. we may need expand
Yes, extending IVOPTs to handle this case (and cases from other PRs) seems
promising.
Anyway, patch is welcome, and I can do the review.

Thanks,
> IVOPTs, then different targets can select different one according their Cost
> model.
> Now, it seems ok for x86 as it have load/store insns folded the lshift
> operand, so it doesn't need separate lshift operand in loop body .
> 
> == base on the ARM gcc 9.2.1 on https://gcc.godbolt.org, You'll get
> separate lshift operand lsl in loop kernel, and ARM64 gcc 8.2 will use ldr  
> x3, [x1, x4, lsl 3] to avoid the separate lshift operand. so we can see all
> target dont select an IV with Step 8. 
> C0ADA(unsigned long long, long long*, long long*):
> push{r4, r5, r6, r7, lr}@
> mov r4, r0@ len, tmp135
> mov r5, r1@ len, tmp136
> orrsr1, r4, r5  @ tmp137, len
> beq .L1 @,
> mov r1, #0@ C05A1,
> .L3:
> lsl r0, r1, #3@ _2, C05A1,
> add ip, r2, r1, lsl #3@ tmp120, Src, C05A1,
> ldr lr, [r2, r0]  @ _4, *_3
> ldr ip, [ip, #4]  @ _4, *_3
> umull   r6, r7, lr, lr@ tmp125, _4, _4
> mul ip, lr, ip@ tmp122, _4, tmp122
> addsr1, r1, r4  @ C05A1, C05A1, len
> subsr4, r4, #1  @ len, len,
> sbc r5, r5, #0@ len, len,
> add r0, r3, r0@ tmp121, Dest, _2
> add r7, r7, ip, lsl #1@,, tmp122,
> orrslr, r4, r5  @ tmp138, len
> stm r0, {r6-r7}   @ *_5, tmp125
> bne .L3 @,
> .L1:
> pop {r4, r5, r6, r7, lr}  @
> bx  lr  @
> 
> Thanks for your notice.

[Bug tree-optimization/95019] Optimizer produces suboptimal code related to -ftree-ivopts

2020-05-12 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95019

--- Comment #1 from bin cheng  ---
Please provide the exact configuration/compilation command lines in bug report
next time, which could save others' time to reproduce.  Considering I didn't
touch mips for years.

As for this specific issue, note right now SCEV can't model C05A1, thus
DEST[C05A1] and Src[C05A1], so there is not much IVOPTs can do with its
current shape.

We did discuss about extending the pass to handle non-scev memory references in
other PRs, but unless that is implemented, I see no easy fix here.

Thanks

[Bug tree-optimization/94969] [8/9/10/11 Regression] Invalid loop distribution since r8-2390-gdfbddbeb1ca912c9

2020-05-10 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94969

--- Comment #8 from bin cheng  ---
Root cause is in build_classic_dist_vector -> constant_access_functions which
adds unit distance vector only in case of constant access function.  It should
cover invariant cases.  Testing a patch.  Thanks

[Bug tree-optimization/94969] [8/9/10/11 Regression] Invalid loop distribution since r8-2390-gdfbddbeb1ca912c9

2020-05-10 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94969

--- Comment #7 from bin cheng  ---
(In reply to Richard Biener from comment #5)
> So I think the issue is not dependence testing but loop distribution
> accepting a
> zero dependence distance as OK.  Of course dependence analysis is quite
> useless
> here since the accesses are to the same location in every iteration.
> 
> Bin, maybe you can share your thoughts on this issue?
> 
> The testcase doesn't need bitfields - those just disable the cost model
> which otherwise prevents the distribution.
> 
> diff --git a/gcc/tree-loop-distribution.c b/gcc/tree-loop-distribution.c
> index 44423215332..ac272d63c3d 100644
> --- a/gcc/tree-loop-distribution.c
> +++ b/gcc/tree-loop-distribution.c
> @@ -2852,6 +2852,7 @@ loop_distribution::finalize_partitions (class loop
> *loop,
>/* Don't distribute current loop into too many loops given we don't have
>   memory stream cost model.  Be even more conservative in case of loop
>   nest distribution.  */
> +#if 0
>if ((same_type_p && num_builtin == 0
> && (loop->inner == NULL || num_normal != 2 || num_partial_memset !=
> 1))
>|| (loop->inner != NULL
> @@ -2867,6 +2868,7 @@ loop_distribution::finalize_partitions (class loop
> *loop,
> }
>partitions->truncate (1);
>  }
> +#endif
>  
>/* Fuse memset builtins if possible.  */
>if (partitions->length () > 1)
> 
> 
> makes the testcase miscompiled even with the : 7 and : 2 commented, so plain
> 
> struct S {
>   signed m;
>   signed e;
> };

I think there is something wrong in data dependence analysis, however,
Richard's change just exposed it.  
Given below loop and data refs:

for (...) {
  array[loop_invariant] = x;  // ref1
  array[loop_invariant] ^= 1; // ref2
}
There are both output dependence for ref2(iteration i) -> ref1 (iteration i +
1), and for ref1(iteration i) -> ref2(iteration i).

It seems to me now the first one is missing.  Will dig deeper.

[Bug tree-optimization/93674] [8/9 Regression] GCC eliminates conditions it should not, when strict-enums is on

2020-04-20 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93674

--- Comment #18 from bin cheng  ---
(In reply to Richard Earnshaw from comment #17)
> Has not been backported yet.

Will do it.  Thanks

[Bug tree-optimization/94125] [9 Regression] wrong code at -O3 on x86_64-linux-gnu

2020-03-18 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94125

--- Comment #11 from bin cheng  ---
(In reply to Richard Biener from comment #10)
> Thanks Bin, fixed on trunk sofar.

Hmm, if it's fine, I will backport this to GCC9.

Thanks

[Bug tree-optimization/94125] [9/10 Regression] wrong code at -O3 on x86_64-linux-gnu

2020-03-15 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94125

--- Comment #7 from bin cheng  ---
Patch at https://gcc.gnu.org/pipermail/gcc-patches/2020-March/542038.html
It's a latent bug exposed by the mentioned alias analysis change, however:


unsigned char b, f;
short d[1][8][1], *g = [0][3][0];

int
main ()
{
  int k[] = { 0, 0, 0, 4, 0, 0 };
  for (int c = 2; c >= 0; c--)
{
  b = f;
  *g = k[c + 3];
  k[c + 1] = 0;
}
  for (int i = 0; i < 8; i++)
if (d[0][i][0] != 0)
  __builtin_abort ();
  return 0;
}

We can't tell no-alias info for pairs  and .  Is this expected or
should be improved?

Thanks

[Bug tree-optimization/94125] [9/10 Regression] wrong code at -O3 on x86_64-linux-gnu

2020-03-11 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94125

--- Comment #5 from bin cheng  ---
Thanks for CCing, I will have a look this WE.

[Bug tree-optimization/93674] [8/9/10 Regression] GCC eliminates conditions it should not, when strict-enums is on

2020-02-27 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93674

bin cheng  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |amker at gcc dot gnu.org

--- Comment #13 from bin cheng  ---
Sorry for missing this.

[Bug tree-optimization/92244] vectorized loop updating 2 copies of the same pointer (for in-place reversal cross in the middle)

2020-01-30 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92244

--- Comment #5 from bin cheng  ---
Vectorizer generates following address bases:
  _79 = (sizetype) len_6(D);
  _80 = _79 + 18446744073709551600;
  vectp.14_78 = head_7(D) + _80;
  _89 = (sizetype) len_6(D);
  _90 = _89 + 18446744073709551600;
  vectp.20_88 = head_7(D) + _90;
IVOPTS only does limited expansion of base by calling expand_simple_operations,
which is not enough for this case.  Let me do experiment on aggressive
expansion using tree_to_aff_combination_expand.  It should be able to fix this
issue, however, aggressive expansion itself might regress.

[Bug tree-optimization/93334] -O3 generates useless code checking for overlapping memset ?

2020-01-21 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93334

--- Comment #2 from bin cheng  ---
(In reply to Richard Biener from comment #1)
> Confirmed.  The issue is that the overlap would be an issue if the stores
> were using different values like
> 
> void test_simple_code(long l, double* mem, long ofs2) {
> for (long k=0; k   mem[k] = 0.0;
>   mem[ofs2 +k] = 1.0;
> }
> }
> 
> and we're simply not optimizing the case where the write-after-write
> dependence can be ignored because the stored value is always the same.
> I'm also not sure whether that's easy to do ... Bin?

I will check if it can be handled as a special case.  Thanks.

[Bug c++/93143] [10 Regression] Multiple calls to static constexpr member function gives wrong code

2020-01-08 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93143

--- Comment #7 from bin cheng  ---
(In reply to bin cheng from comment #6)
> (In reply to bin cheng from comment #5)
> > (In reply to Martin Sebor from comment #4)
> > > *** Bug 92926 has been marked as a duplicate of this bug. ***
> > 
> > I sent a patch fixing this a
> > https://gcc.gnu.org/ml/gcc-patches/2019-12/msg00920.html
> > The only question is if this one has already fixed by PR93033.
> 
> Sorry, wrong comment.

Hmm, seems my original comment is not wrong, and this issue still exists.  I
will update the patch.

[Bug c++/93143] [10 Regression] Multiple calls to static constexpr member function gives wrong code

2020-01-08 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93143

--- Comment #6 from bin cheng  ---
(In reply to bin cheng from comment #5)
> (In reply to Martin Sebor from comment #4)
> > *** Bug 92926 has been marked as a duplicate of this bug. ***
> 
> I sent a patch fixing this a
> https://gcc.gnu.org/ml/gcc-patches/2019-12/msg00920.html
> The only question is if this one has already fixed by PR93033.

Sorry, wrong comment.

[Bug c++/93143] [10 Regression] Multiple calls to static constexpr member function gives wrong code

2020-01-08 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93143

--- Comment #5 from bin cheng  ---
(In reply to Martin Sebor from comment #4)
> *** Bug 92926 has been marked as a duplicate of this bug. ***

I sent a patch fixing this a
https://gcc.gnu.org/ml/gcc-patches/2019-12/msg00920.html
The only question is if this one has already fixed by PR93033.

[Bug c++/92926] New: Wrong code generated because of shared tree node in gimplify

2019-12-12 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92926

Bug ID: 92926
   Summary: Wrong code generated because of shared tree node in
gimplify
   Product: gcc
   Version: 10.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: amker at gcc dot gnu.org
  Target Milestone: ---

Following code is reduced from cppcoro but is irrelevant to coroutine.

#include 
#include 
 class ipv6_address
 {
 public:
  constexpr ipv6_address(
   std::uint16_t part0,
   std::uint16_t part1,
   std::uint16_t part2,
   std::uint16_t part3,
   std::uint16_t part4,
   std::uint16_t part5,
   std::uint16_t part6,
   std::uint16_t part7);

  static constexpr ipv6_address loopback();
  std::string to_string() const;

 private:
  alignas(std::uint64_t) std::uint8_t m_bytes[16];
 };

constexpr ipv6_address::ipv6_address(
  std::uint16_t part0,
  std::uint16_t part1,
  std::uint16_t part2,
  std::uint16_t part3,
  std::uint16_t part4,
  std::uint16_t part5,
  std::uint16_t part6,
  std::uint16_t part7)
  : m_bytes{
   static_cast(part0 >> 8),
   static_cast(part0),
   static_cast(part1 >> 8),
   static_cast(part1),
   static_cast(part2 >> 8),
   static_cast(part2),
   static_cast(part3 >> 8),
   static_cast(part3),
   static_cast(part4 >> 8),
   static_cast(part4),
   static_cast(part5 >> 8),
   static_cast(part5),
   static_cast(part6 >> 8),
   static_cast(part6),
   static_cast(part7 >> 8),
   static_cast(part7) }
{}

constexpr ipv6_address ipv6_address::loopback()
{
  return ipv6_address{ 0, 0, 0, 0, 0, 0, 0, 1 };
}

char hex_char(std::uint8_t value)
{
  return value < 10 ?
static_cast('0' + value) :
static_cast('a' + value - 10);
}

std::string ipv6_address::to_string() const
{
  std::uint32_t longestZeroRunStart = 0;
  std::uint32_t longestZeroRunLength = 0;
  for (std::uint32_t i = 0; i < 8; )
  {
if (m_bytes[2 * i] == 0 && m_bytes[2 * i + 1] == 0)
{
  const std::uint32_t zeroRunStart = i;
  ++i;
  while (i < 8 && m_bytes[2 * i] == 0 && m_bytes[2 * i + 1] == 0)
  {
++i;
  }

  std::uint32_t zeroRunLength = i - zeroRunStart;
  if (zeroRunLength > longestZeroRunLength)
  {
longestZeroRunLength = zeroRunLength;
longestZeroRunStart = zeroRunStart;
  }
}
else
{
  ++i;
}
  }

  char buffer[40];

  char* c = [0];

  auto appendPart = [&](std::uint32_t index)
  {
const std::uint8_t highByte = m_bytes[index * 2];
const std::uint8_t lowByte = m_bytes[index * 2 + 1];

if (highByte > 0 || lowByte > 15)
{
  if (highByte > 0)
  {
if (highByte > 15)
{
  *c++ = hex_char(highByte >> 4);
}
*c++ = hex_char(highByte & 0xF);
  }
  *c++ = hex_char(lowByte >> 4);
}
*c++ = hex_char(lowByte & 0xF);
  };

  if (longestZeroRunLength >= 2)
  {
for (std::uint32_t i = 0; i < longestZeroRunStart; ++i)
{
  if (i > 0)
  {
*c++ = ':';
  }

  appendPart(i);
}

*c++ = ':';
*c++ = ':';

for (std::uint32_t i = longestZeroRunStart + longestZeroRunLength; i < 8;
++i)
{
  appendPart(i);

  if (i < 7)
  {
*c++ = ':';
  }
}
  }
  else
  {
appendPart(0);
for (std::uint32_t i = 1; i < 8; ++i)
{
  *c++ = ':';
  appendPart(i);
}
  }

  assert((c - [0]) <= sizeof(buffer));

  return std::string{ [0], c };
}

std::string __attribute__((noinline)) foo ()
{
   return ipv6_address::loopback().to_string();
}

ipv6_address __attribute__((noinline)) bar ()
{
   return ipv6_address::loopback();
}

int main() {
  std::string s = foo ();
  ipv6_address a = bar ();
  assert(a.to_string() == s);
  return 0;
}

Compiling using following command line:
$ ./g++ -std=c++17 -m64 -O3 z.cc -o a.out
$ ./a.out
a.out: z.cc:168: int main(): Assertion `a.to_string() == s' failed.
Aborted

Root cause is ipv6_address::loopback as a constexpr function, what it returns
is folded into const ctor by C++ FE, also the const ctor is shared translation
unit wide by constexpr_call_table.  As a result the ctor as well as its vector
elements are shared between foo and bar.

In gimplify, CONSTRUCTOR_ELTS is optimized and cleared, causing shared node
changed in the other function.

Will send a patch for discussion.

[Bug middle-end/92574] Inefficient code for multidimensional array assess

2019-11-19 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92574

--- Comment #2 from bin cheng  ---
Similar to https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57534
The original idea was handle this as much as possible in ivopt which is
difficult given ivopt code has lots of (scev/niter) validity checks.  In
aforementioned straight-line "ivopts", we only need to factor out common part,
choose addressing mode, rewrite memory references.
Maybe a light-weight pass to do the job using existing ivopt facility.

[Bug c++/85471] closing a "thread" in "C++" using "pthread_exit(NULL)" creates a "SIGABRT"

2019-10-24 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85471

bin cheng  changed:

   What|Removed |Added

 CC||amker at gcc dot gnu.org

--- Comment #6 from bin cheng  ---
I ran into a stackoverflow entry with following code:
#include 
#include 
#include 
#include 

static void cleanup(void *ptr)
{
}

void *child(void *ptr)
{
  pthread_cleanup_push(cleanup, NULL);
  pthread_exit(NULL);
  pthread_cleanup_pop(1);
  return NULL;
}

int main()
{
  pthread_t foo;
  pthread_create(, NULL, child, NULL);
  pthread_join(foo, NULL);
  return 0;
}

The abort can be reproduced when compiled using gcc-8.3 with following options:
$ g++ -o a.out test.cc -g -Wall -fexceptions  -pthread -static-libstdc++
-static-libgcc
$ gdb --args ./a.out
(gdb) r
(gdb) bt
#0  0xbf4972c8 in raise () from /lib64/libc.so.6
#1  0xbf498940 in abort () from /lib64/libc.so.6
#2  0x0040ec94 in _Unwind_SetGR ()
#3  0x00401c4c in __gxx_personality_v0 ()
#4  0xbec3fab8 in _Unwind_ForcedUnwind_Phase2
(exc=exc@entry=0xbf462670, context=context@entry=0xbf461560,
frames_p=frames_p@entry=0xbf461198)
at ../../../libgcc/unwind.inc:182
#5  0xbec3fea0 in _Unwind_ForcedUnwind (exc=0xbf462670,
stop=0xbf5f7950 , stop_argument=0xbf461a30) at
../../../libgcc/unwind.inc:217
#6  0xbf5fa15c in _Unwind_ForcedUnwind () from /lib64/libpthread.so.0
#7  0xbf5f7aac in __pthread_unwind () from /lib64/libpthread.so.0
#8  0xbf5f1a08 in pthread_exit () from /lib64/libpthread.so.0
#9  0x00401460 in child (ptr=0x0) at test.cc:13
#10 0xbf5f0bb0 in start_thread () from /lib64/libpthread.so.0
#11 0xbf53e4c0 in thread_start () from /lib64/libc.so.6

Issue with this case is because of static-libgcc, not sure if it's the same to
the original case.

Thanks

[Bug debug/90231] ivopts causes iterator in the loop

2019-10-17 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90231

--- Comment #12 from bin cheng  ---
(In reply to Jakub Jelinek from comment #10)
> Actually (int) ((ivtmp.11 - (unsigned long) dst_10) / 4), sorry.
> On 64-bit targets this will never be a problem, are you worried about 32-bit
> targets where int and pointers are the same width and for a loop with say up
> to INT_MAX iterations ivtmp.11 would wrap around?  Then dst[i] would be
> invalid too.
> So as long as the IVs aren't added there out of the blue sky, with larger
> steps than what is really used, it shouldn't be an issue.
> Or can say a loop that does:
> unsigned int j = x;
> for (int i = 0; i < n; i++)
>   {
> j += 32;
> use (i, j);
>   }
> use j as unsigned int IV with step 32 replace the i int IV with step 1?  If
> yes, then I'd understand that (int) ((j - x) / 32) might not be correct
> expression all the time, e.g. if j == x, then i might be 0, or 0x800
> etc., but (int) ((j - x) / 32) will be 0.

Yes, as mentioned in #11, we need to choose the same class IV in rewriting. 
And reuse of existing code makes it harder, after all, I don't want to disturb
existing code because of debug-stmt rewriting.

[Bug debug/90231] ivopts causes iterator in the loop

2019-10-17 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90231

--- Comment #11 from bin cheng  ---
(In reply to Richard Biener from comment #9)
> (In reply to bin cheng from comment #7)
> > The orignal iv needs to be represented in debug bind stmt is:
> >  64 IV struct:
> >  65   SSA_NAME: i_18
> >  66   Type: int
> >  67   Base: 0
> >  68   Step: 1
> >  69   Biv:  Y
> >  70   Overflowness wrto loop niter: No-overflow
> > 
> > While the possible candidate is:
> > 185 Candidate 8:
> > 186   Var befor: ivtmp.11
> > 187   Var after: ivtmp.11
> > 188   Incr POS: before exit test
> > 189   IV struct:
> > 190 Type:   unsigned long
> > 191 Base:   (unsigned long) dst_10(D)
> > 192 Step:   4
> > 193 Object: (void *) dst_10(D)
> > 194 Biv:N
> > 195 Overflowness wrto loop niter:   Overflow
> > 
> > Strictly speaking, with above information, we can't compute i_18 using
> > ivtmp.11 correctly in all cases, because ivtmp.11 could overflow.  Of
> > course, the overflow-ness in this case could be improved, thus solve the
> > problem.  Or there is another method: we can do the computation anyway, it
> > may give wrong value in some cases, but we are in debug stmt, value which is
> > correct in most cases is better than optimized away, sensible?
> 
> Actually we do know that ivtmp.11 doesn't overflow.
> 
> Since we can express the use of i in dst[i] by the new IV we can express
> i in terms of the new IV at the point of its original use as well, I see
> no way the transform isn't bijective.  The complication here is just
It's bijective if we can choose candidate derived from the same class of
induction variables as "i", however, code rewriting debug-stmt currently
selects cand using simple heuristic, it's not guaranteed cand from right class
would be chosen. 
Also we reuse existing iv_use -> iv_cand computation code in rewriting
debug-stmt.
> that we have to undo the 'use in dst[i]' effect somehow, but for simple
> cases or rewrite_use_* this should be doable.

[Bug debug/90231] ivopts causes iterator in the loop

2019-10-17 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90231

--- Comment #7 from bin cheng  ---
The orignal iv needs to be represented in debug bind stmt is:
 64 IV struct:
 65   SSA_NAME: i_18
 66   Type: int
 67   Base: 0
 68   Step: 1
 69   Biv:  Y
 70   Overflowness wrto loop niter: No-overflow

While the possible candidate is:
185 Candidate 8:
186   Var befor: ivtmp.11
187   Var after: ivtmp.11
188   Incr POS: before exit test
189   IV struct:
190 Type:   unsigned long
191 Base:   (unsigned long) dst_10(D)
192 Step:   4
193 Object: (void *) dst_10(D)
194 Biv:N
195 Overflowness wrto loop niter:   Overflow

Strictly speaking, with above information, we can't compute i_18 using ivtmp.11
correctly in all cases, because ivtmp.11 could overflow.  Of course, the
overflow-ness in this case could be improved, thus solve the problem.  Or there
is another method: we can do the computation anyway, it may give wrong value in
some cases, but we are in debug stmt, value which is correct in most cases is
better than optimized away, sensible?

Thanks,
bin

[Bug tree-optimization/91775] Can eliminate compare from loop with known number of iterations

2019-10-08 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91775

--- Comment #6 from bin cheng  ---
The address type iv_use has pointer type and 64-bit precision, while iv_cands
added (by ivcanon pass) has unsigned int type. So decremental candidates are
skipped because of following code:

4620│   /* Check if we have enough precision to express the values of use.  */
4621│   if (TYPE_PRECISION (utype) > TYPE_PRECISION (ctype))
4622├───> return infinite_cost;

Looks like better overflow-ness analysis is required here:
Candidate 6:
  Incr POS: orig biv
  IV struct:
Type:   unsigned int
Base:   1024
Step:   4294967295
Biv:N
Overflowness wrto loop niter:   Overflow  <--- here.

[Bug rtl-optimization/91137] [7 Regression] Wrong code with -O3

2019-09-02 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91137

--- Comment #15 from bin cheng  ---
Author: amker
Date: Mon Sep  2 10:10:44 2019
New Revision: 275304

URL: https://gcc.gnu.org/viewcvs?rev=275304=gcc=rev
Log:
Backport from mainline
2019-07-18  Bin Cheng  

PR tree-optimization/91137
* tree-ssa-loop-ivopts.c (struct ivopts_data): New field.
(tree_ssa_iv_optimize_init, alloc_iv, tree_ssa_iv_optimize_finalize):
Init, use and fini the above new field.
(determine_base_object_1): New function.
(determine_base_object): Reimplement using walk_tree.

2019-07-18  Bin Cheng  

PR tree-optimization/91137
* gcc.c-torture/execute/pr91137.c: New test.

Added:
branches/gcc-7-branch/gcc/testsuite/gcc.c-torture/execute/pr91137.c
Modified:
branches/gcc-7-branch/gcc/ChangeLog
branches/gcc-7-branch/gcc/testsuite/ChangeLog
branches/gcc-7-branch/gcc/tree-ssa-loop-ivopts.c

[Bug rtl-optimization/91137] [7/8 Regression] Wrong code with -O3

2019-08-30 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91137

--- Comment #14 from bin cheng  ---
Author: amker
Date: Fri Aug 30 11:02:48 2019
New Revision: 275064

URL: https://gcc.gnu.org/viewcvs?rev=275064=gcc=rev
Log:
Backport from mainline
2019-07-18  Bin Cheng  

PR tree-optimization/91137
* tree-ssa-loop-ivopts.c (struct ivopts_data): New field.
(tree_ssa_iv_optimize_init, alloc_iv, tree_ssa_iv_optimize_finalize):
Init, use and fini the above new field.
(determine_base_object_1): New function.
(determine_base_object): Reimplement using walk_tree.

2019-07-18  Bin Cheng  

PR tree-optimization/91137
* gcc.c-torture/execute/pr91137.c: New test.

Added:
branches/gcc-8-branch/gcc/testsuite/gcc.c-torture/execute/pr91137.c
Modified:
branches/gcc-8-branch/gcc/ChangeLog
branches/gcc-8-branch/gcc/testsuite/ChangeLog
branches/gcc-8-branch/gcc/tree-ssa-loop-ivopts.c

[Bug rtl-optimization/91137] [7/8/9 Regression] Wrong code with -O3

2019-07-23 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91137

--- Comment #13 from bin cheng  ---
Author: amker
Date: Wed Jul 24 01:28:33 2019
New Revision: 273754

URL: https://gcc.gnu.org/viewcvs?rev=273754=gcc=rev
Log:
Backport from mainline
2019-07-18  Bin Cheng  

PR tree-optimization/91137
* tree-ssa-loop-ivopts.c (struct ivopts_data): New field.
(tree_ssa_iv_optimize_init, alloc_iv, tree_ssa_iv_optimize_finalize):
Init, use and fini the above new field.
(determine_base_object_1): New function.
(determine_base_object): Reimplement using walk_tree.

gcc/testsuite
2019-07-18  Bin Cheng  

PR tree-optimization/91137
* gcc.c-torture/execute/pr91137.c: New test.

Added:
branches/gcc-9-branch/gcc/testsuite/gcc.c-torture/execute/pr91137.c
Modified:
branches/gcc-9-branch/gcc/ChangeLog
branches/gcc-9-branch/gcc/testsuite/ChangeLog
branches/gcc-9-branch/gcc/tree-ssa-loop-ivopts.c

[Bug rtl-optimization/91137] [7/8/9/10 Regression] Wrong code with -O3

2019-07-20 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91137

--- Comment #11 from bin cheng  ---
Hi, suppose this patch should be backported to 8/7 if no further issues.

[Bug rtl-optimization/91137] [7/8/9/10 Regression] Wrong code with -O3

2019-07-18 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91137

--- Comment #10 from bin cheng  ---
Author: amker
Date: Thu Jul 18 08:38:09 2019
New Revision: 273570

URL: https://gcc.gnu.org/viewcvs?rev=273570=gcc=rev
Log:
PR tree-optimization/91137
* tree-ssa-loop-ivopts.c (struct ivopts_data): New field.
(tree_ssa_iv_optimize_init, alloc_iv, tree_ssa_iv_optimize_finalize):
Init, use and fini the above new field.
(determine_base_object_1): New function.
(determine_base_object): Reimplement using walk_tree.

gcc/testsuite
PR tree-optimization/91137
* gcc.c-torture/execute/pr91137.c: New test.

Added:
trunk/gcc/testsuite/gcc.c-torture/execute/pr91137.c
Modified:
trunk/gcc/ChangeLog
trunk/gcc/testsuite/ChangeLog
trunk/gcc/tree-ssa-loop-ivopts.c

[Bug rtl-optimization/91137] [7/8/9/10 Regression] Wrong code with -O3

2019-07-15 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91137

--- Comment #8 from bin cheng  ---
(In reply to rguent...@suse.de from comment #7)
> On Mon, 15 Jul 2019, amker at gcc dot gnu.org wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91137
> > 
> > --- Comment #6 from bin cheng  ---
> > (In reply to Richard Biener from comment #2)
> > > 
> > > and I can very well imagine we're getting confused by find_base_term
> > > logic here.
> > > 
> > > There's logic in IVOPTs to not generate IVs based on two different
> > > objects but somehow it doesn't trigger here.
> > 
> > Hmm, it's because determine_base_object failed to identify the `base_object`
> > for IV because it has non-pointer type:
> > IV struct:
> >   SSA_NAME: _32
> >   Type: unsigned long
> >   Base: (unsigned long)  + 19600
> >   Step: 4
> >   Biv:  N
> >   Overflowness wrto loop niter: Overflow
> > 
> > And we have short-circuit in determine_base_object:
> > 
> > static tree
> > determine_base_object (tree expr)
> > {
> >   enum tree_code code = TREE_CODE (expr);
> >   tree base, obj;
> > 
> >   /* If this is a pointer casted to any type, we need to determine
> >  the base object for the pointer; so handle conversions before
> >  throwing away non-pointer expressions.  */
> >   if (CONVERT_EXPR_P (expr))
> > return determine_base_object (TREE_OPERAND (expr, 0));
> > 
> >   if (!POINTER_TYPE_P (TREE_TYPE (expr)))
> > return NULL_TREE;
> > 
> > The IV is generated from inner loop ivopts as we rewrite using unsigned 
> > type.
> > 
> > Any suggestion how to fix this?
> 
> I think we need to elide this check and make the following code
> more powerful which includes actually handling PLUS/MINUS_EXPR.
> There's also ptr + (unsigned) to be considered for the
> POINTER_PLUS_EXPR case - thus we cannot simply only search the
> pointer chain.
> 
> Which then leads us into exponential behavior if this is asked
> on SCEV results which may have tree sharing, thus we need some
> 'visited' machinery.  In the end I think we should re-do
Will work on a patch.  One thing I am unclear about is the ptr + (unsigned)
stuff, when would we have this? Could you provide an example please?

> it like I re-did contains_abnormal_ssa_name_p, use
> walk_tree_without_duplicates.  Btw, what should happen if the
> walk discovers two bases are used in the expression?
I guess it depends on why there are multiple bases in the first place?

[Bug rtl-optimization/91137] [7/8/9/10 Regression] Wrong code with -O3

2019-07-15 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91137

--- Comment #6 from bin cheng  ---
(In reply to Richard Biener from comment #2)
> 
> and I can very well imagine we're getting confused by find_base_term
> logic here.
> 
> There's logic in IVOPTs to not generate IVs based on two different
> objects but somehow it doesn't trigger here.

Hmm, it's because determine_base_object failed to identify the `base_object`
for IV because it has non-pointer type:
IV struct:
  SSA_NAME: _32
  Type: unsigned long
  Base: (unsigned long)  + 19600
  Step: 4
  Biv:  N
  Overflowness wrto loop niter: Overflow

And we have short-circuit in determine_base_object:

static tree
determine_base_object (tree expr)
{
  enum tree_code code = TREE_CODE (expr);
  tree base, obj;

  /* If this is a pointer casted to any type, we need to determine
 the base object for the pointer; so handle conversions before
 throwing away non-pointer expressions.  */
  if (CONVERT_EXPR_P (expr))
return determine_base_object (TREE_OPERAND (expr, 0));

  if (!POINTER_TYPE_P (TREE_TYPE (expr)))
return NULL_TREE;

The IV is generated from inner loop ivopts as we rewrite using unsigned type.

Any suggestion how to fix this?

[Bug rtl-optimization/91137] [7/8/9/10 Regression] Wrong code with -O3

2019-07-11 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91137

--- Comment #5 from bin cheng  ---
Will try to find some time this WE, sorry for delaying.

[Bug tree-optimization/90240] [10 Regression] ICE in try_improve_iv_set, at tree-ssa-loop-ivopts.c:6694

2019-06-18 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90240

--- Comment #12 from bin cheng  ---
(In reply to Richard Biener from comment #11)
> Is this now fixed?

yes, fixed on trunk.  Only if it should be backported to GCC-9?

[Bug tree-optimization/57534] [7/8/9/10 Regression]: Performance regression versus 4.7.3, 4.8.1 is ~15% slower

2019-05-08 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57534

--- Comment #34 from bin cheng  ---
So we could have three different addressing modes here.
  1. What we have now:
leaq0(,%rbp,8), %rax
movsd   8(%rbx,%rax), %xmm0
addsd   (%rbx,%rbp,8), %xmm0
addq$8, %rbp
addsd   16(%rbx,%rax), %xmm0
addsd   24(%rbx,%rax), %xmm0
addsd   %xmm0, %xmm1
movsd   32(%rbx,%rax), %xmm0
addsd   40(%rbx,%rax), %xmm0
addsd   48(%rbx,%rax), %xmm0
addsd   56(%rbx,%rax), %xmm0
addsd   %xmm0, %xmm2
cmpq%rsi, %rbp
  2. GCC-4.7:
fldl   (%esi,%ebx,8)
lea0x8(%ebx),%eax
faddl  0x8(%esi,%ebx,8)
cmp%eax,%edi
faddl  0x10(%esi,%ebx,8)
faddl  0x18(%esi,%ebx,8)
faddp  %st,%st(2)
fldl   0x20(%esi,%ebx,8)
faddl  0x28(%esi,%ebx,8)
faddl  0x30(%esi,%ebx,8)
faddl  0x38(%esi,%ebx,8)
faddp  %st,%st(1)
  3. With slsr change:
leaq0(%rbp,%rbx,8), %rax
addq$8, %rbx
movsd   (%rax), %xmm0
addsd   8(%rax), %xmm0
addsd   16(%rax), %xmm0
addsd   24(%rax), %xmm0
addsd   %xmm0, %xmm1
movsd   32(%rax), %xmm0
addsd   40(%rax), %xmm0
addsd   48(%rax), %xmm0
addsd   56(%rax), %xmm0
addsd   %xmm0, %xmm2
cmpq%rsi, %rbx

This was reported that 2. is better than 1.  Also Jeff recommended 3.

What I don't understand are:
A) why 2. is better than 1.?  It seems to have more computations in address.
B) Is 3. the best one?  It has the simplest addressing mode, but does require
one additional lea because of strength reduction.

Thanks.

[Bug tree-optimization/90078] [7/8 Regression] ICE with deep templates caused by overflow

2019-05-08 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90078

--- Comment #14 from bin cheng  ---
Author: amker
Date: Wed May  8 11:37:45 2019
New Revision: 271008

URL: https://gcc.gnu.org/viewcvs?rev=271008=gcc=rev
Log:
PR tree-optimization/90078
* tree-ssa-loop-ivopts.c (INFTY): Increase value for infinite cost.
(struct comp_cost): Promote type of members to int64_t.
(infinite_cost): Don't set complexity in initialization.
(comp_cost::operator +,-,+=,-+,/=,*=): Assert when cost computation
overflows to infinite_cost.
(adjust_setup_cost): Promote type of parameter and cost computation
to int64_t.
(struct ainc_cost_data, struct iv_ca): Promote type of member to
int64_t.
(get_scaled_computation_cost_at, determine_iv_cost): Promote type of
cost computation to int64_t.
(determine_group_iv_costs, iv_ca_dump, find_optimal_iv_set): Use
int64_t's format specifier in dump.

gcc/testsuite
* g++.dg/tree-ssa/pr90078.C: New test.

Added:
trunk/gcc/testsuite/g++.dg/tree-ssa/pr90078.C
Modified:
trunk/gcc/ChangeLog
trunk/gcc/testsuite/ChangeLog
trunk/gcc/tree-ssa-loop-ivopts.c

[Bug tree-optimization/90240] [10 Regression] ICE in try_improve_iv_set, at tree-ssa-loop-ivopts.c:6694

2019-05-08 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90240

--- Comment #10 from bin cheng  ---
Author: amker
Date: Wed May  8 11:24:38 2019
New Revision: 271007

URL: https://gcc.gnu.org/viewcvs?rev=271007=gcc=rev
Log:
PR tree-optimization/90240
* tree-ssa-loop-ivopts.c (get_scaled_computation_cost_at): Scale cost
with respect to scaling factor pre-computed for each basic block.
(try_improve_iv_set): Return bool if best_cost equals to iv_ca cost.
(find_optimal_iv_set_1): Free iv_ca set if it has infinite_cost.
(COST_SCALING_FACTOR_BOUND, determine_scaling_factor): New.
(tree_ssa_iv_optimize_loop): Call determine_scaling_factor.  Extend
live range for array of loop's basic blocks.  Cleanup aux field of
loop's basic blocks.

gcc/testsuite
* gfortran.dg/graphite/pr90240.f: New test.

Added:
trunk/gcc/testsuite/gfortran.dg/graphite/pr90240.f
Modified:
trunk/gcc/ChangeLog
trunk/gcc/testsuite/ChangeLog
trunk/gcc/tree-ssa-loop-ivopts.c

[Bug tree-optimization/57534] [7/8/9/10 Regression]: Performance regression versus 4.7.3, 4.8.1 is ~15% slower

2019-05-08 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57534

bin cheng  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |amker at gcc dot gnu.org

--- Comment #33 from bin cheng  ---
Came back to this one.


void timer_stop();

volatile long keepgoing = 0;
double hand_benchmark_cache_ronly( double *x, long limit, long *oloops, double
*ous) {
long index = 0, loops = 0;
double sum = (double)0;
double sum2 = (double)0;
again:   sum += x[index] + x[index+1] + x[index+2] + x[index+3];
sum2 += x[index+4] + x[index+5] + x[index+6] + x[index+7];
if ((index += 8) < limit) goto again;
else if (keepgoing) {
index = 0;
goto again;
}
timer_stop();
x[0] = (double)sum + (double)sum2;
x[1] = (double)index;
}

The idea fix to above test would be identifying the first goto as a loop, so
IVOPTs can do strength reduction on address ivs.

While for below case:
int ind;
int cond(void);

double hand_benchmark_cache_ronly( double *x) {
double sum=0.0;
while (cond())
sum += x[ind] + x[ind+1] + x[ind+2] + x[ind+3];
return sum;
}

It's hard to handle in IVOPTs, because neither niter nor scev analysis
succeeds.  The IVOPTs implementation is centralized to induction variable.  It
would non-trivial change to support such case.

However, I wondered why we missed slsr in previous analysis?  It's designed to
strength reduce such code.  Quoting from its comment:

   Specifically, we are interested in references for which 
   get_inner_reference returns a base address, offset, and bitpos as
   follows:

 base:MEM_REF (T1, C1)
 offset:  MULT_EXPR (PLUS_EXPR (T2, C2), C3)
 bitpos:  C4 * BITS_PER_UNIT

   Here T1 and T2 are arbitrary trees, and C1, C2, C3, C4 are 
   arbitrary integer constants.  Note that C2 may be zero, in which
   case the offset will be MULT_EXPR (T2, C3).

   When this pattern is recognized, the original memory reference
   can be replaced with:

 MEM_REF (POINTER_PLUS_EXPR (T1, MULT_EXPR (T2, C3)),
  C1 + (C2 * C3) + C4)

It explicitly states that addresses here should be tracked, associated and
reduced as we wanted:  (X + index * 8) + const_offset_x.

I think it's a missed address slsr optimization, i.e, clearly it failed to
identify CAND_REF candidate for memory reference.  After looking into the code,
I think the problem is in slsr_process_ref and restructure_reference.

Trying if I can fix this...

[Bug tree-optimization/90078] [7/8 Regression] ICE with deep templates caused by overflow

2019-04-30 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90078

--- Comment #13 from bin cheng  ---
Reverted 270500 on trunk too for easier backport to GCC9.

[Bug tree-optimization/90078] [7/8 Regression] ICE with deep templates caused by overflow

2019-04-29 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90078

--- Comment #12 from bin cheng  ---
Author: amker
Date: Tue Apr 30 03:00:59 2019
New Revision: 270673

URL: https://gcc.gnu.org/viewcvs?rev=270673=gcc=rev
Log:
PR tree-optimization/90240
Revert:
2019-04-23  Bin Cheng  

PR tree-optimization/90078
* tree-ssa-loop-ivopts.c (comp_cost::operator +,-,+=,-+,/=,*=): Add
checks for infinite_cost overflow.

* gcc/testsuite/g++.dg/tree-ssa/pr90078.C: New test.

Removed:
trunk/gcc/testsuite/g++.dg/tree-ssa/pr90078.C
Modified:
trunk/gcc/ChangeLog
trunk/gcc/testsuite/ChangeLog
trunk/gcc/tree-ssa-loop-ivopts.c

[Bug tree-optimization/90240] [10 Regression] ICE in try_improve_iv_set, at tree-ssa-loop-ivopts.c:6694

2019-04-29 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90240

--- Comment #9 from bin cheng  ---
Author: amker
Date: Tue Apr 30 03:00:59 2019
New Revision: 270673

URL: https://gcc.gnu.org/viewcvs?rev=270673=gcc=rev
Log:
PR tree-optimization/90240
Revert:
2019-04-23  Bin Cheng  

PR tree-optimization/90078
* tree-ssa-loop-ivopts.c (comp_cost::operator +,-,+=,-+,/=,*=): Add
checks for infinite_cost overflow.

* gcc/testsuite/g++.dg/tree-ssa/pr90078.C: New test.

Removed:
trunk/gcc/testsuite/g++.dg/tree-ssa/pr90078.C
Modified:
trunk/gcc/ChangeLog
trunk/gcc/testsuite/ChangeLog
trunk/gcc/tree-ssa-loop-ivopts.c

[Bug tree-optimization/90270] [8/9/10 Regression] Do not select best induction variable optimization

2019-04-29 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90270

--- Comment #11 from bin cheng  ---
For the record, this test reveals another issue that original iv cand is not
considered:

Group 0:
  Type: REFERENCE ADDRESS
  Use 0.0:
At stmt:_1 = final_counts[i_21];
At pos: final_counts[i_21]
IV struct:
  Type: unsigned int *
  Base: (unsigned int *) _counts
  Step: 4
  Object:   (void *) _counts
  Biv:  N
  Overflowness wrto loop niter: Overflow

Candidate 7:
  Incr POS: orig biv
  IV struct:
Type:   unsigned int
Base:   0
Step:   1
Biv:N
Overflowness wrto loop niter:   No-overflow

:
Group 0:
  cand  costcompl.  inv.expr.   inv.vars
  1 9   2   NIL;NIL;
  6 2   2   1;  NIL;
  8 0   0   NIL;NIL;
  109   1   NIL;NIL;

Group 1:
  cand  costcompl.  inv.expr.   inv.vars
  1 9   2   NIL;NIL;
  6 2   2   2;  NIL;
  9 0   0   NIL;NIL;
  109   1   NIL;NIL;

[Bug tree-optimization/90270] [8/9/10 Regression] Do not select best induction variable optimization

2019-04-29 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90270

--- Comment #10 from bin cheng  ---
(In reply to Richard Biener from comment #9)
> (In reply to bin cheng from comment #7)
> > Also, when calling move_fixed_address_to_symbol, fixed_address_object_p
> > looks too restricted, it only considers link time constant address.  In this
> > case, it's an array object in stack.
> 
> But this is because a stack access isn't $reloc but $sp + offset and thus
> _not_ a symbol.
>From ivopts/loop's point of view, the address ($sp + offset) is loaded into
register, then the register is used to address elements in array.  In other
words, it doesn't really matter if the address is global and determined by
linker or local and determined by stack frame.

> 
> But as you noticed IVOPTs computing TARGET_MEM_REF so "early" is a bit
> brittle due to later eventual forwardings.  And those forwardings are
> hard to avoid because they affect fundamental predicates like
> may_propagate_copy where we decide early whether we can propagte into
> all uses before actually visiting them.
Can we avoid propagating into TARGET_MEM_REF if it creates invalid addressing
mode?  IIUC, passes (like ivopts, slsr) creating TARGET_MEM_REF do generate
"correct" addressing mode, it doesn't make much sense to create invalid ones
afterwards.

[Bug tree-optimization/90270] [8/9/10 Regression] Do not select best induction variable optimization

2019-04-28 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90270

--- Comment #7 from bin cheng  ---
Also, when calling move_fixed_address_to_symbol, fixed_address_object_p looks
too restricted, it only considers link time constant address.  In this case,
it's an array object in stack.

[Bug tree-optimization/90270] [8/9/10 Regression] Do not select best induction variable optimization

2019-04-28 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90270

--- Comment #6 from bin cheng  ---
(In reply to Andrew Pinski from comment #5)
> (In reply to bin cheng from comment #4)
> > On AArch64, iovpts generates following code:
> >[local count: 954449108]:
> >   # crc_20 = PHI 
> >   # ivtmp.5_18 = PHI <1(2), ivtmp.5_17(5)>
> >   _19 = _counts + 18446744073709551612;
> >   _1 = MEM[base: _19, index: ivtmp.5_18, step: 4, offset: 0B];
> >   crc_10 = crcu32 (_1, crc_20);
> >   _5 = _counts + 18446744073709551612;
> 
> I thought we had decided _counts + 18446744073709551612 would be
> invalid gimple anyways as we are taking the address of one element before.

Could you direct me to the discussion about this decision?  I remember once
raised this question (probably in private).  In this case, we need to revision
ivopts to avoid adding candidates which could violates this.

Anyway, it's an independent issue because the iv_cand could be one element
forwarded as:

> >[local count: 954449108]:
> >   # crc_20 = PHI 
> >   # ivtmp.5_18 = PHI <0(2), ivtmp.5_17(5)>
> >   _19 = _counts;
> >   _1 = MEM[base: _19, index: ivtmp.5_18, step: 4, offset: 0B];
> >   crc_10 = crcu32 (_1, crc_20);
> >   _5 = _counts;
> 

Unfortunately, cost computation still has problem to generate this code.

[Bug tree-optimization/90270] [8/9/10 Regression] Do not select best induction variable optimization

2019-04-28 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90270

--- Comment #4 from bin cheng  ---
On AArch64, iovpts generates following code:
   [local count: 954449108]:
  # crc_20 = PHI 
  # ivtmp.5_18 = PHI <1(2), ivtmp.5_17(5)>
  _19 = _counts + 18446744073709551612;
  _1 = MEM[base: _19, index: ivtmp.5_18, step: 4, offset: 0B];
  crc_10 = crcu32 (_1, crc_20);
  _5 = _counts + 18446744073709551612;
  _2 = MEM[base: _5, index: ivtmp.5_18, step: 4, offset: 0B];
  crc_12 = crcu32 (_2, crc_10);
  ivtmp.5_17 = ivtmp.5_18 + 1;
  if (ivtmp.5_17 != 9)
goto ; [87.50%]
  else
goto ; [12.50%]
Which looks optimal to me if _19/_5 can be hoisted out of loop.  And it is
intended to be hoisted by rtl liv.  (TREE liv doesn't help much, that's another
story)

Problem is in dom3 pass, cprop_operand, _19/_5 is propagated into memory access
although it causes invalid addressing mode on AArch64:
  [[(void *)_counts + -4B], [(void *)_counts + -4B]] 
EQUIVALENCES: { _19 } (1 elements)
Optimizing statement _1 = MEM[base: _19, index: ivtmp.5_18, step: 4, offset:
0B];
  Replaced '_19' with constant '[(void *)_counts + -4B]'
  Folded to: _1 = MEM[symbol: final_counts, index: ivtmp.5_18, step: 4, offset:
-4B];
LKUP STMT _1 = MEM[symbol: final_counts, index: ivtmp.5_18, step: 4, offset:
-4B] with .MEM_22
2>>> STMT _1 = MEM[symbol: final_counts, index: ivtmp.5_18, step: 4, offset:
-4B] with .MEM_22

it's kept in this form to the end of GIMPLE, then badly legitimized.

So ivopts worked hard to get addressing mode and invariant expression correct
in this case, we need to avoid immature transformations afterwards.

BTW, with dom disabled by -fno-tree-dominator-opts, vrp2 does the same
transformation too.  -fno-tree-vrp is also necessary to get the optimal code.

Well, you can argue [base + iv << 2] is sub-optimal comparing to [base + iv],
but that's hard to tune.  Also bias to the original IV is in general preferred
for reasons like smaller setup code, better debug info, and even for
performance in complicated loops.

[Bug tree-optimization/90240] [10 Regression] ICE in try_improve_iv_set, at tree-ssa-loop-ivopts.c:6694

2019-04-28 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90240

--- Comment #8 from bin cheng  ---
Patch proposed at:
https://gcc.gnu.org/ml/gcc-patches/2019-04/msg01101.html

[Bug tree-optimization/90240] [9 Regression] ICE in try_improve_iv_set, at tree-ssa-loop-ivopts.c:6694

2019-04-25 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90240

--- Comment #4 from bin cheng  ---
(In reply to Jakub Jelinek from comment #3)
> Graphite, so IMHO not a release blocker.

but the issue is critical, it could happen with general optimization level for
loop nest with huge scaling factor.

So, find_optimal_iv_set_1 first chooses a candidate set, then makes different
tries to do cost descent by modifying the candidate set.  The facts are:
  1) algorithm uses a global variable of following structure and keeps track of
cost in place during computation.
   struct iv_ca {
 //...
 comp_cost cand_use_cost;
 //...
 comp_cost cost;
   };
  2) algorithm is heuristic, so it's possible to reach an intermediate state
with higher cost.
  3) as in previous comment, loop nest with huge scaling factor can easily
result in infinite_cost.
  4) once the global variable of iv_ca.{cand_use_cost, cost} reaches
infinite_cost, ICE is the best thing could happen.

We could replace gcc_assert with algorithm failure then give up ivopts, but
IMHO that would miss quite lot of optimizations.

The conclusion, candidate choosing algorithm doesn't work well with
infinite_cost.  
 I Don't know how to fix this trivially.  For now, even restricting scaling
factor is a practical change now.  Will give it a try.

[Bug tree-optimization/90240] [9 Regression] ICE in try_improve_iv_set, at tree-ssa-loop-ivopts.c:6694

2019-04-25 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90240

--- Comment #2 from bin cheng  ---
Also, cost in inner loop is scaled by big number:
Scaling cost based on bb prob by 1.00: 0 (scratch: 0) -> 0 (1/1)
Scaling cost based on bb prob by 1.00: 32 (scratch: 0) -> 32 (1/1)
Scaling cost based on bb prob by 1.00: 41 (scratch: 0) -> 41 (1/1)
Scaling cost based on bb prob by 1.00: 21 (scratch: 0) -> 21 (1/1)
Scaling cost based on bb prob by 1.00: 45 (scratch: 0) -> 45 (1/1)
Scaling cost based on bb prob by 1.00: 21 (scratch: 0) -> 21 (1/1)
Scaling cost based on bb prob by 1.00: 17 (scratch: 0) -> 17 (1/1)

Resulting:
Group 19:
  cand  costcompl.  inv.expr.   inv.vars
  1 41  0   NIL;1, 4
  2 21  0   NIL;4
  3 45  0   NIL;1, 4
  4 21  0   NIL;4
  5 17  0   35; NIL;
  300   0   NIL;NIL;
  6732  0   NIL;1, 4

Given we have 70 groups of iv_use, this easily overflow infinite_cost which is
10,000,000.

One thing unclear is the overflow happens in the middle of cost candidate
choosing algorithm.

[Bug tree-optimization/90240] [9 Regression] ICE in try_improve_iv_set, at tree-ssa-loop-ivopts.c:6694

2019-04-25 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90240

bin cheng  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |amker at gcc dot gnu.org

--- Comment #1 from bin cheng  ---
probably something with recent changes on comp_cost::operators

[Bug debug/90231] ivopts causes iterator in the loop

2019-04-24 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90231

bin cheng  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |amker at gcc dot gnu.org

--- Comment #5 from bin cheng  ---
I will try to fix it for GCC10.  Thanks

[Bug tree-optimization/90021] [9 Regression] ICE in index_in_loop_nest, at tree-data-ref.h:587 since r270203

2019-04-22 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90021

--- Comment #5 from bin cheng  ---
(In reply to Jakub Jelinek from comment #4)
> From what I can see, a fix for this has been acked 11 days ago:
> https://gcc.gnu.org/ml/gcc-patches/2019-04/msg00413.html
> Bin, are you going to commit it?

I just commit it.  There was a typo in PR number of ChangeLog entry, so this PR
is not update.  For the record, it's
https://gcc.gnu.org/viewcvs/gcc?view=revision=270499

[Bug tree-optimization/90078] [7/8/9 Regression] ICE with deep templates caused by overflow [PATCH]

2019-04-22 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90078

--- Comment #9 from bin cheng  ---
Author: amker
Date: Tue Apr 23 04:07:46 2019
New Revision: 270500

URL: https://gcc.gnu.org/viewcvs?rev=270500=gcc=rev
Log:
PR tree-optimization/90078
* tree-ssa-loop-ivopts.c (comp_cost::operator +,-,+=,-+,/=,*=): Add
checks for infinite_cost overflow.

gcc/testsuite
* gcc/testsuite/g++.dg/tree-ssa/pr90078.C: New test.

Also fix typo in ChangeLog entry for revision 270499.

Added:
trunk/gcc/testsuite/g++.dg/tree-ssa/pr90078.C
Modified:
trunk/gcc/ChangeLog
trunk/gcc/testsuite/ChangeLog
trunk/gcc/tree-ssa-loop-ivopts.c

[Bug testsuite/86153] [8 regression] test case g++.dg/pr83239.C fails starting with r261585

2019-04-16 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86153

bin cheng  changed:

   What|Removed |Added

 CC||amker at gcc dot gnu.org

--- Comment #16 from bin cheng  ---
Should this be backported to GCC8?

[Bug c++/90078] [7/8/9 Regression] ICE with deep templates caused by overflow [PATCH]

2019-04-16 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90078

--- Comment #6 from bin cheng  ---
(In reply to Martin Liška from comment #5)
> (In reply to bin cheng from comment #4)
> > In get_scaled_computation_cost_at, we have very big ratio between
> > bb_count/loop_count:
> > 
> > (gdb) p data->current_loop->latch->count   
> > $50 = {static n_bits = 61, static max_count = 2305843009213693950, static
> > uninitialized_count = 2305843009213693951, m_val = 158483, m_quality =
> > profile_guessed_local}
> > (gdb) p gimple_bb(at)->count
> > $51 = {static n_bits = 61, static max_count = 2305843009213693950, static
> > uninitialized_count = 2305843009213693951, m_val = 1569139790, m_quality =
> > profile_guessed_local}
> > (gdb) p 1569139790 / 158483
> > $52 = 9900
> > (gdb) p cost
> > $53 = {cost = 20, complexity = 2, scratch = 1}
> > (gdb) p 19 * 9900
> > $54 = 188100
> > 
> > as a result, sum_cost soon reaches to overflow of infinite_cost.  Shall we
> > cap the ratio so that it doesn't grow too quick?  Of course, some benchmark
> > data is needed for this heuristic tuning.
> 
> I would implement the capping in comp_cost struct where each individual
> operator
> can cap to infinite. What do you think Bin?
Implementing the capping in comp_cost::operators to infinite_cost is less
invasive.  OTOH, capping bb_freq/loop_freq has its own advantages, because:
Once cost reaches to infinite, it becomes meaningless in comparison as well as
candidate choosing;  capping bb_freq/loop_freq can still express hotness of
code to some extend.
Let's fix the issue by capping comp_cost::operators first for this stage 4 and
revisit the idea capping bb_freq/loop_freq with more benchmark data in next
Stage 1.  How about that?

Thanks.
> 
> > 
> > 
> > Another problem is the generated binary has segment fault issue even
> > compiled O0:
> > 
> > $ ./g++ -O0 pr90078.cc -o a.out -ftemplate-depth=100 -ftime-report  -g
> > -std=c++14
> > $ gdb --args ./a.out
> > 
> > Dump of assembler code for function main():
> >0x00400572 <+0>: push   %rbp
> >0x00400573 <+1>: mov%rsp,%rbp
> >0x00400576 <+4>: sub$0x2625a020,%rsp
> >0x0040057d <+11>:lea-0x2625a020(%rbp),%rax
> >0x00400584 <+18>:mov%rax,%rdi
> > => 0x00400587 <+21>:callq  0x4006c0  > 100, 100>::Tensor4()>
> >0x0040058c <+26>:lea-0x4c4b410(%rbp),%rax
> >0x00400593 <+33>:lea-0xe4e1c10(%rbp),%rdx
> > 
> > The segment fault happens at the callq instruction.
> 
> Yes, same happens also for clang. It's a stack overflow:
> 
> $ g++ pr90078.cpp  -ftemplate-depth=111 -fsanitize=address && ./a.out 
> AddressSanitizer:DEADLYSIGNAL
> =
> ==5750==ERROR: AddressSanitizer: stack-overflow on address 0x7fffd9da3af0
> (pc 0x004011cb bp 0x7fffdc60 sp 0x7fffd9da3af0 T0)
> #0 0x4011ca in main (/home/marxin/Programming/testcases/a.out+0x4011ca)
> #1 0x76d32b7a in __libc_start_main ../csu/libc-start.c:308
> #2 0x401109 in _start (/home/marxin/Programming/testcases/a.out+0x401109)
> 
> SUMMARY: AddressSanitizer: stack-overflow
> (/home/marxin/Programming/testcases/a.out+0x4011ca) in main
> ==5750==ABORTING

[Bug c++/90078] [7/8/9 Regression] ICE with deep templates caused by overflow [PATCH]

2019-04-15 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90078

--- Comment #4 from bin cheng  ---
In get_scaled_computation_cost_at, we have very big ratio between
bb_count/loop_count:

(gdb) p data->current_loop->latch->count   
$50 = {static n_bits = 61, static max_count = 2305843009213693950, static
uninitialized_count = 2305843009213693951, m_val = 158483, m_quality =
profile_guessed_local}
(gdb) p gimple_bb(at)->count
$51 = {static n_bits = 61, static max_count = 2305843009213693950, static
uninitialized_count = 2305843009213693951, m_val = 1569139790, m_quality =
profile_guessed_local}
(gdb) p 1569139790 / 158483
$52 = 9900
(gdb) p cost
$53 = {cost = 20, complexity = 2, scratch = 1}
(gdb) p 19 * 9900
$54 = 188100

as a result, sum_cost soon reaches to overflow of infinite_cost.  Shall we cap
the ratio so that it doesn't grow too quick?  Of course, some benchmark data is
needed for this heuristic tuning.


Another problem is the generated binary has segment fault issue even compiled
O0:

$ ./g++ -O0 pr90078.cc -o a.out -ftemplate-depth=100 -ftime-report  -g
-std=c++14
$ gdb --args ./a.out

Dump of assembler code for function main():
   0x00400572 <+0>: push   %rbp
   0x00400573 <+1>: mov%rsp,%rbp
   0x00400576 <+4>: sub$0x2625a020,%rsp
   0x0040057d <+11>:lea-0x2625a020(%rbp),%rax
   0x00400584 <+18>:mov%rax,%rdi
=> 0x00400587 <+21>:callq  0x4006c0 ::Tensor4()>
   0x0040058c <+26>:lea-0x4c4b410(%rbp),%rax
   0x00400593 <+33>:lea-0xe4e1c10(%rbp),%rdx

The segment fault happens at the callq instruction.

[Bug tree-optimization/90021] [9 Regression] ICE in index_in_loop_nest, at tree-data-ref.h:587 since r270203

2019-04-09 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90021

--- Comment #2 from bin cheng  ---
We have {{0, +, 1}_6, +, 1}_4 in this case, and _6 is an outer loop of
loop_nest.  Function add_multivariate_self_dist was intentionally skipped in
PR89725 patch, but control flow gets to it because
  1) In analyze_miv_subscript, equal access_fn case is specially handled,
rather than general miv analysis.
  2) In add_other_self_distances, evolution_function_is_univariate_p returns
false for above access_fn.

It looks we can also introduce another parameter loopnum to
evolution_function_is_univariate_p, just like
evolution_function_is_affine_multivariate_p to consider outer loop's chrec as
invariant symbol here.  OTOH, making changes in add_multivariate_self_dist
still doesn't seem right in this case.

[Bug tree-optimization/90021] [9 Regression] ICE in index_in_loop_nest, at tree-data-ref.h:587 since r270203

2019-04-09 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90021

--- Comment #1 from bin cheng  ---
Sorry for the breakage, I will have a look.

[Bug middle-end/89725] [8/9 Regression] ICE in get_fnname_from_decl, at varasm.c:1723

2019-03-31 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89725

--- Comment #11 from bin cheng  ---
In case of data reference has more access functions than loop_nest of data
dependence analysis, we need to skip/ignore access functions corresponding
loops not in the loop_nest.  So far this only happens in loop interchange since
we want to reuse data references collected in outer loop.

During computing classic dist/dir vector, we need to avoid out-of-bound memory
access.

Univariate SCEV can be simply bypassed by checking the loop/chrec_variable as
patch in comment #7.  Of course, add_other_self_distances needs to be handled
as well.  

On the other hand, bypassing multivariate would be harder and the impact is not
yet clear, however, we can take another strategy handling SCEV of outer loop as
invariant (symbol) to loop_nest during dependence analysis.  As a matter of
fact, current code already does in various places, i.e, with calling to
evolution_function_is_invariant_rec_p etc.  After scanning, I think the only
piece missing is in analyze_miv_subscript.

I am testing a patch.

[Bug middle-end/89725] ICE in get_fnname_from_decl, at varasm.c:1723

2019-03-29 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89725

--- Comment #9 from bin cheng  ---
(In reply to Richard Biener from comment #8)
> (In reply to bin cheng from comment #7)
> > I am testing below simple fix, it bypass access functions doesn't belong to
> > analyzing loop_nest:
> > 
> > diff --git a/gcc/tree-data-ref.c b/gcc/tree-data-ref.c
> > index e536b463e96..410d44f43e8 100644
> > --- a/gcc/tree-data-ref.c
> > +++ b/gcc/tree-data-ref.c
> > @@ -4272,6 +4272,7 @@ build_classic_dist_vector_1 (struct
> > data_dependence_relation *ddr,
> >  {
> >unsigned i;
> >lambda_vector init_v = lambda_vector_new (DDR_NB_LOOPS (ddr));
> > +  struct loop *loop = DDR_LOOP_NEST (ddr)[0];
> >  
> >for (i = 0; i < DDR_NUM_SUBSCRIPTS (ddr); i++)
> >  {
> > @@ -4302,6 +4303,15 @@ build_classic_dist_vector_1 (struct
> > data_dependence_relation *ddr,
> >   return false;
> > }
> >  
> > + /* When data references are collected in a loop while data
> > +dependences are analyzed in loop nest nested in the loop, we
> > +would have more number of access functions than number of
> > +loops.  Skip access functions of loops not in the loop nest.
> > +
> > +See PR89725 for more information.  */
> > + if (flow_loop_nested_p (get_loop (cfun, var_a), loop))
> > +   continue;
> > +
> >   dist = int_cst_value (SUB_DISTANCE (subscript));
> >   index = index_in_loop_nest (var_a, DDR_LOOP_NEST (ddr));
> >   *index_carry = MIN (index, *index_carry);
> > 
> > Plus the assert in index_in_loop_nest.
> 
> I wondered about chrecs like { 1, +, { 0 +, 1 }_1 }_2 (inner loop step
> or initial value evolves wrt outer loop).  We'd not catch that here.
> 
> Also if the above is possible then why not simply strip those
> subscripts when we build the DDR?  That way the few other cases
> we do index_in_loop_nest also are "fixed".
> 
> Meanwhile testing of my patch finished but shows an ICE for
> 
> FAIL: gfortran.dg/vect/pr81303.f   -O   scan-tree-dump-times linterchange
> "is in
> terchanged" 1
> FAIL: gfortran.dg/vect/pr81303.f   -O  (internal compiler error)
> FAIL: gfortran.dg/vect/pr81303.f   -O  (test for excess errors)
> 
> #1  0x00a61759 in vec::operator[] (
> this=0x3119f50 = {...}, ix=3)
> at /space/rguenther/src/gcc-sccvn/gcc/vec.h:845
> 845   gcc_checking_assert (ix < m_vecpfx.m_num);
> (gdb) 
> #3  0x01f2723a in should_interchange_loops (i_idx=3, o_idx=2, 
> datarefs=..., i_stmt_cost=41, o_stmt_cost=5, innermost_loops_p=true, 
> dump_info_p=true)
> at /space/rguenther/src/gcc-sccvn/gcc/gimple-loop-interchange.cc:1460
> 1460  tree iloop_stride = (*stride)[i_idx], oloop_stride =
> (*stride)[o_idx];
> 
> where the interchange code would need further changes for my change of the
> loop-nest for DDRs.
> 
> That said, can we strip subscripts for outer loops in
> initialize_data_dependence_relation when we compute them?
> OTOH the cases where we can ignore the subscript are not so clear
> given that the outer loop behavior can very well compute
Agree there may be more opportunities to disambiguate dependence with more
SCEVed access function of outer loop. 

> non-aliasing.  So selectively pruning just the unwanted distance
> vectors looks safe.
As you mentioned, multivariate needs to be handled with outer loop SCEV handled
as some kind of invariant.  This is necessary no matter we bypass it in dist
vector construction or DDR initialization/computation.  As you suggested, we
can't undo it yet...

> 
> But what about similar code in add_multivariate_self_dist or
> add_other_self_distances?

[Bug middle-end/89725] ICE in get_fnname_from_decl, at varasm.c:1723

2019-03-29 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89725

--- Comment #7 from bin cheng  ---
I am testing below simple fix, it bypass access functions doesn't belong to
analyzing loop_nest:

diff --git a/gcc/tree-data-ref.c b/gcc/tree-data-ref.c
index e536b463e96..410d44f43e8 100644
--- a/gcc/tree-data-ref.c
+++ b/gcc/tree-data-ref.c
@@ -4272,6 +4272,7 @@ build_classic_dist_vector_1 (struct
data_dependence_relation *ddr,
 {
   unsigned i;
   lambda_vector init_v = lambda_vector_new (DDR_NB_LOOPS (ddr));
+  struct loop *loop = DDR_LOOP_NEST (ddr)[0];

   for (i = 0; i < DDR_NUM_SUBSCRIPTS (ddr); i++)
 {
@@ -4302,6 +4303,15 @@ build_classic_dist_vector_1 (struct
data_dependence_relation *ddr,
  return false;
}

+ /* When data references are collected in a loop while data
+dependences are analyzed in loop nest nested in the loop, we
+would have more number of access functions than number of
+loops.  Skip access functions of loops not in the loop nest.
+
+See PR89725 for more information.  */
+ if (flow_loop_nested_p (get_loop (cfun, var_a), loop))
+   continue;
+
  dist = int_cst_value (SUB_DISTANCE (subscript));
  index = index_in_loop_nest (var_a, DDR_LOOP_NEST (ddr));
  *index_carry = MIN (index, *index_carry);

Plus the assert in index_in_loop_nest.

[Bug middle-end/89725] ICE in get_fnname_from_decl, at varasm.c:1723

2019-03-28 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89725

--- Comment #6 from bin cheng  ---
(In reply to Richard Biener from comment #4)
> I think the issue is that the DDR is bogus - loop interchange computes
> data-refs
> for a deeper nest (including some outer loops) than it ends up doing
> dependence checking later on.  But we have access functions analyzed with
> respect to outer loops already.
> 
> I think it would be possible to handle this in data dependence computation,
> simply treating evolutions in outer loops as invariants.  Eventually the
> access functions evolving in outer loops can also be pruned?  We can't
> really undo SCEV analysis on them.
> 
> I think that Jakubs fix is too conservative though.
> 
> Since we fail when we cannot compute the "invalid" subscript distance at the
> moment the safest fix would probably to create the DDR with the loop-nest
> we originally analyzed?  Bin?
Unfortunately No.  The access functions are analyzed wrto outer loops in order
to cache find-data-reference process, thus save compilation time.  Actually, we
end up with computing ddr wrto deeper loop_nest here because computation with
the originally analyzed loop_nest has failed.  So this change won't do anything
other than compute the same DDRs twice (and both would fail).

There may be couple ways out.
1. Cancel the data reference caching by collecting DRs for loop_nest.  At this
stage, this might be the safest fix but very expensive.
2. Fix the DDR analysis code.  For example as you suggested, or maybe we can
simply bypass the irrelevant part when computing dir/dist vector?
3. Note we already prune_data_refs_not_in_loop, we can also prune the access
functions too.  Not sure if this is feasible.  Also not sure if it's useful
enough to be exposed as an tree-data-ref.h interface.  Will have a check.


> diff --git a/gcc/tree-data-ref.h b/gcc/tree-data-ref.h
> index 11aa806a64d..54651e903ff 100644
> --- a/gcc/tree-data-ref.h
> +++ b/gcc/tree-data-ref.h
> @@ -585,6 +585,7 @@ index_in_loop_nest (int var, vec loop_nest)
>  if (loopi->num == var)
>break;
>  
> +  gcc_assert (var_index < loop_nest.length ());
>return var_index;
>  }
Guess this code should be included anyway, right?

Thanks

[Bug middle-end/89849] New: Worse code at O3 because of slp

2019-03-27 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89849

Bug ID: 89849
   Summary: Worse code at O3 because of slp
   Product: gcc
   Version: 9.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: amker at gcc dot gnu.org
  Target Milestone: ---

Hi,
This is the code sample from scovit@IRC:


struct ciao { long a; long b; };

//__declspec(noinline)
__attribute((noinline))
struct ciao square(int num) {
struct ciao beta;
beta.a = num;
beta.b = num*num;
return beta;
}

int main(int a) {
struct ciao tje = square(a);
return tje.a * tje.b;
}

O3 generates:
square:
.LFB0:
.cfi_startproc
movslq  %edi, %rax
imull   %edi, %edi
movq%rax, %xmm0
movslq  %edi, %rdi
movq%rdi, %xmm1
punpcklqdq  %xmm1, %xmm0
movaps  %xmm0, -24(%rsp)
movq-24(%rsp), %rax
movq-16(%rsp), %rdx
ret
.cfi_endproc
.LFE0:
.size   square, .-square
.section.text.startup,"ax",@progbits
.p2align 4
.globl  main
.type   main, @function
main:
.LFB1:
.cfi_startproc
subq$8, %rsp
.cfi_def_cfa_offset 16
callsquare
addq$8, %rsp
.cfi_def_cfa_offset 8
imull   %edx, %eax
ret

While O1/O2 generate:
square:
.LFB0:
.cfi_startproc
movslq  %edi, %rax
imull   %edi, %edi
movslq  %edi, %rdx
ret
.cfi_endproc
.LFE0:
.size   square, .-square
.globl  main
.type   main, @function
main:
.LFB1:
.cfi_startproc
callsquare
imull   %edx, %eax
ret

Looks like SLP gives:
square (int num)
{
  vector(2) long int * vectp.7;
  vector(2) long int * vectp.6;
  struct ciao D.1917;
  long int _1;
  int _2;
  long int _3;
  vector(2) long int _8;
  vector(2) long int vect_cst__9;

   [local count: 1073741824]:
  _1 = (long int) num_4(D);
  _2 = num_4(D) * num_4(D);
  _3 = (long int) _2;
  _8 = {_1, _3};
  vect_cst__9 = _8;
  MEM[(struct ciao *)] = vect_cst__9;
  return D.1917;

}

And latter passes failed to resolve it.

[Bug testsuite/89834] New test case gcc.dg/vect/pr81740-2.c introduced in r269938 fails on power 7

2019-03-26 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89834

--- Comment #5 from bin cheng  ---
Thanks very much for reporting and fixing the issue.

[Bug rtl-optimization/89487] [7/8 Regression] ICE in expand_expr_addr_expr_1, at expr.c:7993

2019-03-16 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89487

--- Comment #9 from bin cheng  ---
(In reply to Jakub Jelinek from comment #8)
> *** Bug 89731 has been marked as a duplicate of this bug. ***

Hi Jakub, is this (and the duplication) fixed by the previous patches or the
issue is still there?  Thanks.

  1   2   3   4   5   6   7   8   9   >