[Bug tree-optimization/102131] [12 Regression] wrong code at -O1 and above on x86_64-linux-gnu
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102131 --- Comment #4 from bin cheng --- (In reply to Jiu Fu Guo from comment #3) > The issue may come from 'iv0 cmp iv1' transform: > >if (c -->if (c>=b) in-loop > -->if (b<=c) in-loop > > c: {4, +, 3} > b: {1, +, 1} > > if ({1, +, 1} <= {4, +, 3}) > ==> if ({1,+,-2} <= {4,+,0}) here, error occur > ==> if ({1,+,-2} < {5,+,0}) le-->lt So this duplicates to PR100740? Thanks
[Bug tree-optimization/101145] niter analysis fails for until-wrap condition
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101145 --- Comment #7 from bin cheng --- (In reply to Jiu Fu Guo from comment #5) > (In reply to bin cheng from comment #4) > > (In reply to Jiu Fu Guo from comment #3) > > > Yes, while the code in adjust_cond_for_loop_until_wrap seems somehow > > > tricky: > > > > > > /* Only support simple cases for the moment. */ > > > if (TREE_CODE (iv0->base) != INTEGER_CST > > > || TREE_CODE (iv1->base) != INTEGER_CST) > > > return false; > > > > > > This code requires both sides are constant. > > Actually it requires an IV with constant base. > > I also feel that the intention of this function may only require one side > constant for IV0 CODE IV1. > As tests, for below loop, adjust_cond_for_loop_until_wrap return false: > > foo (int *__restrict__ a, int *__restrict__ b, unsigned i) > { > while (++i > 100) > *a++ = *b++ + 1; > } > > For below code, adjust_cond_for_loop_until_wrap returns true: > i = UINT_MAX - 200; > while (++i > 100) > *a++ = *b++ + 1; Oh sorry for being misleading. When I mentioned it requires something(...), I was describing the current behavior, not that the conditions are necessary. Feel free to improve such cases. Looking into niter analysis, these cases(trade-offs) are not rare. Thanks
[Bug tree-optimization/101145] niter analysis fails for until-wrap condition
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101145 --- Comment #4 from bin cheng --- (In reply to Jiu Fu Guo from comment #3) > Yes, while the code in adjust_cond_for_loop_until_wrap seems somehow tricky: > > /* Only support simple cases for the moment. */ > if (TREE_CODE (iv0->base) != INTEGER_CST > || TREE_CODE (iv1->base) != INTEGER_CST) > return false; > > This code requires both sides are constant. Actually it requires an IV with constant base.
[Bug tree-optimization/101145] niter analysis fails for until-wrap condition
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101145 --- Comment #2 from bin cheng --- (In reply to Richard Biener from comment #1) > This comes up with a pending patch to split loops like > > void > foo (int *a, int *b, unsigned l, unsigned n) > { > while (++l != n) > a[l] = b[l] + 1; > } > > into > > while (++l > n) > a[l] = b[l] + 1; > while (++l < n) > a[l] = b[l] + 1; > > since for the second loop (the "usual" case involving no wrapping of the IV) > this results in affine IVs and thus analyzable data dependence. Special case like "i++ > constant" are handled in function adjust_cond_for_loop_until_wrap, however, it only handles constant invariant on the other side right now. Will see how to cover simple cases as reported here.
[Bug tree-optimization/101173] [9/10/11/12 Regression] wrong code at -O3 on x86_64-linux-gnu
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101173 --- Comment #5 from bin cheng --- (In reply to Richard Biener from comment #3) > So we're exchanging the inner two loops > > a[1][3] = 8; > for (int b = 1; b <= 5; b++) > for (int d = 0; d <= 5; d++) > for (c = 0; c <= 5; c++) > a[b][c] = a[b][c + 2] & 216; > > to > > a[1][3] = 8; > for (int b = 1; b <= 5; b++) > for (c = 0; c <= 5; c++) > for (int d = 0; d <= 5; d++) > a[b][c] = a[b][c + 2] & 216; > > but that looks wrong from a dependence analysis perspective. We have > > (compute_affine_dependence > ref_a: a[b_33][_1], stmt_a: _2 = a[b_33][_1]; > ref_b: a[b_33][c.3_32], stmt_b: a[b_33][c.3_32] = _3; > (analyze_overlapping_iterations > (chrec_a = {2, +, 1}_5) > (chrec_b = {0, +, 1}_5) > (analyze_siv_subscript > (analyze_subscript_affine_affine > (overlaps_a = [0 + 1 * x_1]) > (overlaps_b = [2 + 1 * x_1])) > ) > (overlap_iterations_a = [0 + 1 * x_1]) > (overlap_iterations_b = [2 + 1 * x_1])) > (analyze_overlapping_iterations > (chrec_a = {1, +, 1}_1) > (chrec_b = {1, +, 1}_1) > (overlap_iterations_a = [0]) > (overlap_iterations_b = [0])) > (analyze_overlapping_iterations > (chrec_a = {0, +, 1}_5) > (chrec_b = {2, +, 1}_5) > (analyze_siv_subscript > (analyze_subscript_affine_affine > (overlaps_a = [2 + 1 * x_1]) > (overlaps_b = [0 + 1 * x_1])) > ) > (overlap_iterations_a = [2 + 1 * x_1]) > (overlap_iterations_b = [0 + 1 * x_1])) > (analyze_overlapping_iterations > (chrec_a = {1, +, 1}_1) > (chrec_b = {1, +, 1}_1) > (overlap_iterations_a = [0]) > (overlap_iterations_b = [0])) > (build_classic_dist_vector > dist_vector = ( 0 0 2 > ) > ) > ) > > I don't see anything wrong with that at a first glance so the bug must be in > tree_loop_interchange::valid_data_dependences it checks > > /* Be conservative, skip case if either direction at i_idx/o_idx > levels is not '=' or '<'. */ > if (dist_vect[i_idx] < 0 || dist_vect[o_idx] < 0) > return false; > > dist_vect is [0 0 2], i_idx 2 and o_idx 1 but I think that dist_vect[o_idx] > should exclude zero, thus > > diff --git a/gcc/gimple-loop-interchange.cc b/gcc/gimple-loop-interchange.cc > index f45b9364644..265e36c48d4 100644 > --- a/gcc/gimple-loop-interchange.cc > +++ b/gcc/gimple-loop-interchange.cc > @@ -1043,8 +1043,8 @@ tree_loop_interchange::valid_data_dependences > (unsigned i_idx, unsigned o_idx, > continue; > > /* Be conservative, skip case if either direction at i_idx/o_idx > -levels is not '=' or '<'. */ > - if (dist_vect[i_idx] < 0 || dist_vect[o_idx] < 0) > +levels is not '=' (for the inner loop) or '<'. */ > + if (dist_vect[i_idx] < 0 || dist_vect[o_idx] <= 0) > return false; > } > } > > Bin - does this analysis look sound? Hi Richard, Thanks very much for helping on this. Sorry I would need a bit more time to answer this question. Thanks again.
[Bug tree-optimization/100740] [9/10/11/12 Regression] wrong code at -O1 and above on x86_64-linux-gnu since r9-4145
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100740 bin cheng changed: What|Removed |Added Assignee|unassigned at gcc dot gnu.org |amker at gcc dot gnu.org --- Comment #2 from bin cheng --- mine. Sorry for the breakage.
[Bug tree-optimization/100499] Different results with -fpeel-loops -ftree-loop-vectorize options
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100499 --- Comment #19 from bin cheng --- (In reply to bin cheng from comment #18) > Did some experiments, there are two fallouts after explicitly returning > false for unsigned/wrapping types in MULT_EXPR/MINUS_EXPR/PLUS_EXPR. One is > the mentioned use of multiple_of_p in number_of_iterations_ne, the other is > for alignment warning in stor-layout.c. As pointed out, the latter case is > known not overflow/wrap. > > So I am thinking to introduce an additional parameter indicating that caller > knows "top" doesn't overfow/wrap, otherwise, try to get rid of the > undocumented assumption. we can always improve the accuracy using ranger or > other tools. Not sure if this is the right way to do. > > As for MULT_NO_OVERFLOW/PLUS_NO_OVERFLOW, IMHO, it's not that simple? For > example, unsigned_num(multiple of 4, and larger than 0) + 0xfffc is > multiple of 4, but it's overflow behavior on which we rely here. Hmm, 4 is special and not a correct example. Considering: n (unsigned, multiple of 3, and > 0) + 0xfffd It's multiple of 3, but we need to rely on wrapping to get answer.
[Bug tree-optimization/100499] Different results with -fpeel-loops -ftree-loop-vectorize options
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100499 --- Comment #18 from bin cheng --- Did some experiments, there are two fallouts after explicitly returning false for unsigned/wrapping types in MULT_EXPR/MINUS_EXPR/PLUS_EXPR. One is the mentioned use of multiple_of_p in number_of_iterations_ne, the other is for alignment warning in stor-layout.c. As pointed out, the latter case is known not overflow/wrap. So I am thinking to introduce an additional parameter indicating that caller knows "top" doesn't overfow/wrap, otherwise, try to get rid of the undocumented assumption. we can always improve the accuracy using ranger or other tools. Not sure if this is the right way to do. As for MULT_NO_OVERFLOW/PLUS_NO_OVERFLOW, IMHO, it's not that simple? For example, unsigned_num(multiple of 4, and larger than 0) + 0xfffc is multiple of 4, but it's overflow behavior on which we rely here.
[Bug tree-optimization/100499] Different results with -fpeel-loops -ftree-loop-vectorize options
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100499 --- Comment #14 from bin cheng --- (In reply to Richard Biener from comment #12) > So in number_of_iterations_ne it looks like the step 's' is always constant > which makes me wonder if we can somehow use ranger to tell multiple_of_p > (type, c, s) > or at least whether, if c is x * s, the multiplication could have overflowed? Yeah, I am looking if "multiple of" can be feasibly checked in niter analysis, with help of some basic information from multiple_of_p. BTW, I am not following changes in "ranger", how should I used in analysis? or similar to value range info? Thanks
[Bug tree-optimization/100499] Different results with -fpeel-loops -ftree-loop-vectorize options
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100499 --- Comment #13 from bin cheng --- (In reply to Richard Biener from comment #10) > (In reply to bin cheng from comment #9) > > Seems we have a long standing bug in fold-const.c:multiple_of_p in case of > > wrapping types. Take unsigned int as an example: > > (0xfffc * 0x3) % 0x3 = 0x1 > > But multiple_of_p returns true here. > > > > The same issue also stands for MINUS_EXPR and PLUS_EXPR. Given > > multiple_of_p is used elsewhere, the fix might break existing optimizations. > > Especially, number of loop iterations is computed in unsigned types > > multiple_of_p is mostly used in contexts where overflow "cannot happen" > (in TYPE/DECL_SIZE computation context), and in niter analysis it seems to > be guarded similarly. This restriction of multiple_of_p seems undocumented, Oh, I am not aware of this. Actually my previous change to it seems broke this assumption already. Will see how to fix or revert the change. > so fixing that might be good. > > Now, you don't say what's the chain of events that lead to a multiple_of_p > call > eventually leading to the wrong answer, but I guess it's the code added > under the > > + if (!niter->control.no_overflow > + && (integer_onep (s) || multiple_of_p (type, c, s))) > > check as !niter->control.no_overflow seems to suggest that the multiple_of_p > check is not properly guarded?
[Bug tree-optimization/90078] [9 Regression] ICE with deep templates caused by overflow
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90078 --- Comment #19 from bin cheng --- I will check if the latter fix can be easily backported to GCC-9.
[Bug tree-optimization/100499] Different results with -fpeel-loops -ftree-loop-vectorize options
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100499 --- Comment #9 from bin cheng --- Seems we have a long standing bug in fold-const.c:multiple_of_p in case of wrapping types. Take unsigned int as an example: (0xfffc * 0x3) % 0x3 = 0x1 But multiple_of_p returns true here. The same issue also stands for MINUS_EXPR and PLUS_EXPR. Given multiple_of_p is used elsewhere, the fix might break existing optimizations. Especially, number of loop iterations is computed in unsigned types
[Bug tree-optimization/100499] Different results with -fpeel-loops -ftree-loop-vectorize options
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100499 bin cheng changed: What|Removed |Added Assignee|unassigned at gcc dot gnu.org |amker at gcc dot gnu.org --- Comment #7 from bin cheng --- (In reply to Martin Liška from comment #4) > (In reply to Martin Liška from comment #3) > > But expected result is end g_2823 = 32768, right? > > Clang returns the same result 32768. > > Which regresses since r7-2373-g69b806f6a60efcf1. Hmm, that was a fix long ago. Will investigate this. Sorry for the breakage.
[Bug tree-optimization/98736] [10/11 Regression] Wrong partition order generated in loop distribution pass since r10-619-g5879ab5fafedc8f6
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98736 --- Comment #6 from bin cheng --- Shall this be backported to 10/11 later? Thanks.
[Bug tree-optimization/95638] [10 Regression] Legit-looking code doesn't work with -O2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95638 bin cheng changed: What|Removed |Added Status|ASSIGNED|RESOLVED Known to work||10.2.0 Resolution|--- |FIXED --- Comment #15 from bin cheng --- Confirmed fixed for 10.2.0 also. Closing.
[Bug tree-optimization/98736] [10/11 Regression] Wrong partition order generated in loop distribution pass since r10-619-g5879ab5fafedc8f6
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98736 --- Comment #4 from bin cheng --- (In reply to bin cheng from comment #3) > hmm, seems topological order isn't enough for distributing a loop nest, we > need topological order plus inner loop depth-first. Well, not really. In this case, problem is that rev-post order algorithm puts "a[c] = d[3];" before the inner loop which violates the original program order. Seems that it can be fixed by inner loop depth-first order wrto how we distribute inner loop, but I am not sure if this always preserves programming order because loop has been reformed by various optimizers.
[Bug tree-optimization/99067] Missed optimization for induction variable elimination
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99067 --- Comment #3 from bin cheng --- Though not sure if the underlying root causes are the same, I think these are two different issues, at least, they are handled by different parts of code in IVOPTs. For the first one, it's a known issue in GCC and IV elimination is complicated yet quite conservative for long time, while for the second one, we indeed don't know whether "i*N+j" wraps or not. Even though we might be able to improve IVOPTs under condition of wrapping behavior.
[Bug tree-optimization/98736] [10/11 Regression] Wrong partition order generated in loop distribution pass since r10-619-g5879ab5fafedc8f6
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98736 --- Comment #3 from bin cheng --- hmm, seems topological order isn't enough for distributing a loop nest, we need topological order plus inner loop depth-first.
[Bug tree-optimization/99067] Missed optimization for induction variable elimination
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99067 bin cheng changed: What|Removed |Added Assignee|unassigned at gcc dot gnu.org |amker at gcc dot gnu.org --- Comment #2 from bin cheng --- Mine, will have a look. Thanks for reporting.
[Bug c++/97627] [9/10/11 Regression] loop end condition missing - endless loop with -fPIC
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97627 --- Comment #12 from bin cheng --- a. why the loop is considered as infinite b. we need to skip fake exit edges in niter analysis?
[Bug c++/97627] [9/10/11 Regression] loop end condition missing - endless loop with -fPIC
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97627 --- Comment #11 from bin cheng --- (In reply to bin cheng from comment #10) > hmm, > For below basic block: > 128 ;; basic block 4, loop depth 2, maybe hot > 129 ;;prev block 3, next block 9, flags: (NEW, VISITED) > 130 ;;pred: 3 (FALLTHRU,EXECUTABLE) > 131 ;;7 (FALLTHRU,DFS_BACK,EXECUTABLE) > 132 # RANGE [0, 2147483647] NONZERO 2147483647 > 133 # c_5 = PHI <0(3), c_17(7)> > 134 # .MEM_8 = PHI <.MEM_7(3), .MEM_9(7)> > 135 if (_2 < c_5) > 136 goto ; [INV] > 137 else > 138 goto ; [INV] > 139 ;;succ: 8 (TRUE_VALUE,EXECUTABLE) > 140 ;;9 (FALSE_VALUE,EXECUTABLE) > > Code in : > 4276 > 4277 basic_block *body = get_loop_body (loop); > 4278 exits = get_loop_exit_edges (loop, body); > 4279 likely_exit = single_likely_exit (loop, exits); > 4280 FOR_EACH_VEC_ELT (exits, i, ex) > 4281 { > 4282 if (ex == likely_exit) > 4283 { > 4284 gimple *stmt = last_stmt (ex->src); > 4285 if (stmt != NULL) > 4286 { > > gets three exit edges, one of which is bb1>, as a result, 0 niter is > computed for this exit in function number_of_iterations_exit_assumptions. > This seems strange, is it a fake edge added for some reason? > > Thanks Right, it's added by connect_infinite_loops_to_exit.
[Bug c++/97627] [9/10/11 Regression] loop end condition missing - endless loop with -fPIC
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97627 --- Comment #10 from bin cheng --- hmm, For below basic block: 128 ;; basic block 4, loop depth 2, maybe hot 129 ;;prev block 3, next block 9, flags: (NEW, VISITED) 130 ;;pred: 3 (FALLTHRU,EXECUTABLE) 131 ;;7 (FALLTHRU,DFS_BACK,EXECUTABLE) 132 # RANGE [0, 2147483647] NONZERO 2147483647 133 # c_5 = PHI <0(3), c_17(7)> 134 # .MEM_8 = PHI <.MEM_7(3), .MEM_9(7)> 135 if (_2 < c_5) 136 goto ; [INV] 137 else 138 goto ; [INV] 139 ;;succ: 8 (TRUE_VALUE,EXECUTABLE) 140 ;;9 (FALSE_VALUE,EXECUTABLE) Code in : 4276 4277 basic_block *body = get_loop_body (loop); 4278 exits = get_loop_exit_edges (loop, body); 4279 likely_exit = single_likely_exit (loop, exits); 4280 FOR_EACH_VEC_ELT (exits, i, ex) 4281 { 4282 if (ex == likely_exit) 4283 { 4284 gimple *stmt = last_stmt (ex->src); 4285 if (stmt != NULL) 4286 { gets three exit edges, one of which is bb1>, as a result, 0 niter is computed for this exit in function number_of_iterations_exit_assumptions. This seems strange, is it a fake edge added for some reason? Thanks
[Bug c++/97627] [9/10/11 Regression] loop end condition missing - endless loop with -fPIC
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97627 --- Comment #9 from bin cheng --- (In reply to Jakub Jelinek from comment #8) > Still broken on current 10 branch, as written works fine on the trunk due to > the C++ FE loop changes. > Bin, did you have time to look into this yet? I am very sorry, seems I have two correctness PRs now? Will try to investigate these on this WE.
[Bug tree-optimization/98598] Missed opportunity to optimize dependent loads in loops
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98598 --- Comment #4 from bin cheng --- Didn't go deep into the case. For simple cases taken as examples, it's possible to interchange the two loops thus enables loop invariant code motion. Though loop interchange may fail because of complicated data dependences, we may take some useful points from it, for example, the cost model checking new loop invariants wrto the outer loop.
[Bug c++/97627] [9/10/11 Regression] loop end condition missing - endless loop with -fPIC
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97627 --- Comment #5 from bin cheng --- (In reply to Jakub Jelinek from comment #3) > Started with r9-4145-ga81e2c6240655f60a49c16e0d8bbfd2ba40bba51 Sorry for the breakage. Will fix this.
[Bug tree-optimization/78427] missed optimization of loop condition
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78427 bin cheng changed: What|Removed |Added CC||amker at gcc dot gnu.org --- Comment #5 from bin cheng --- (In reply to Antony Polukhin from comment #4) > Any progress? Oh, I missed this one. Will try to find time later. Thanks
[Bug target/96201] x86 movsd/movsq string instructions and alignment inference
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96201 bin cheng changed: What|Removed |Added CC||amker at gcc dot gnu.org --- Comment #2 from bin cheng --- Reason is that memory references in f3 are not identified as address type IV uses. I don't remember details, but it's intended by below commit: commit 653a4b32fe72e33bfd4cdd4c25493049524a3805 Author: Bin Cheng Date: Thu Mar 2 11:25:11 2017 + re PR tree-optimization/66768 (address space gets lost on literal pointer) PR tree-optimization/66768 * tree-ssa-loop-ivopts.c (find_interesting_uses_address): Skip addr iv_use if base object can't be determined. gcc/testsuite * gcc.target/i386/pr66768.c: New test. From-SVN: r245837 For f1/f2, IVOPTs fails to identify base object because pointers are converted from integer. We need to tell the difference better. For f3, __builtin_assume_aligned is optimized away by GCC-10 before IVOPTs.
[Bug tree-optimization/95638] [10 Regression] Legit-looking code doesn't work with -O2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95638 --- Comment #14 from bin cheng --- (In reply to Richard Biener from comment #13) > GCC 10.2 is released, adjusting target milestone. Hmm, this should be fixed on GCC10/GCC9. I backported PR95638/PR95804 separately using cherry-pick, so the backport information for latter PR is not reflected here.
[Bug rtl-optimization/96031] suboptimal codegen for store low 16-bits value
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96031 --- Comment #2 from bin cheng --- Interesting case, I see two issues in generated asm. One is the unnecessary bitwise and, the other is allocating different registers for induction variable and the base address. However, looks like neither issue is caused by ivopts. Check the dump: 431[local count: 105119324]: 432 _12 = (short unsigned int) step_8(D); 433 ivtmp.10_11 = (unsigned long) 434 _18 = len_7(D) + 4294967294; 435 _19 = (unsigned long) _18; 436 _20 = _19 * 2; 437 _21 = (unsigned long) 438 _22 = _21 + 2; 439 _23 = _20 + _22; 440 441[local count: 955630224]: 442 # ivtmp.8_15 = PHI <_12(4), ivtmp.8_5(6)> 443 # ivtmp.10_16 = PHI 444 _3 = ivtmp.8_15; 445 _2 = (void *) ivtmp.10_16; 446 MEM[base: _2, offset: 2B] = _3; 447 ivtmp.8_5 = ivtmp.8_15 + _12; 448 ivtmp.10_4 = ivtmp.10_16 + 2; 449 if (ivtmp.10_4 != _23) 450 goto ; [89.00%] 451 else 452 goto ; [11.00%] 453 454[local count: 105119324]: 455 goto ; [100.00%] 456 457[local count: 850510900]: 458 goto ; [100.00%] As far as I can tell, it's optimal. The register allocation issue is introduced by rtl PRE, apparently we should not save the add 2 instruction in the last iteration with a false dependence which is more harmful. As for ivopt, I can see a minor improvement by replacing != exit condition with <=, thus saving add 2 instruction computing _22, which happens to "disable" the wrong PRE transformation. Ah, I see it's already classified as rtl-optimization. Thanks
[Bug tree-optimization/95804] [11 Regression] ICE in generate_code_for_partition, at tree-loop-distribution.c:1323 since r11-1565-g2c0069fafb53ccb7
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95804 --- Comment #11 from bin cheng --- (In reply to Richard Biener from comment #8) > Fixed - note it needs to be backported when the PR95638 fix is backported. I backported PR95638/PR95804 to GCC-10/GCC-9 branches. However, unnecessary to backport to GCC-8, because the starting issue (pr94125) is not exposed on it.
[Bug tree-optimization/95804] [11 Regression] ICE in generate_code_for_partition, at tree-loop-distribution.c:1323 since r11-1565-g2c0069fafb53ccb7
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95804 --- Comment #6 from bin cheng --- (In reply to Martin Liška from comment #5) > @Bin: Any news about this? Patch is approved, will apply soon. Thanks
[Bug tree-optimization/95638] [10/11 Regression] Legit-looking code doesn't work with -O2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95638 --- Comment #9 from bin cheng --- (In reply to Jakub Jelinek from comment #8) > So fixed on the trunk, waiting for 10 backport? Sorry, https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95804 is also in this part which I believe is related to this fix. Will backport the full patch after fixing 95804. Thanks
[Bug tree-optimization/95804] ice in generate_code_for_partition, at tree-loop-distribution.c:1323
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95804 --- Comment #2 from bin cheng --- (In reply to Richard Biener from comment #1) > Confirmed. We seem to end up with a reduction partition not in the last > position thus miss some required partition merging. Sorry for the breakage. Whew, this part IS can of worms. Will investigate it.
[Bug tree-optimization/94969] [8/10 Regression] Invalid loop distribution since r8-2390-gdfbddbeb1ca912c9
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94969 --- Comment #16 from bin cheng --- (In reply to Richard Biener from comment #15) > I don't see the commit on the GCC 10 branch nor the GCC 8 branch. Master > and GCC 9 are fixed though. Will backport the 10 and 8, thanks for reminding.
[Bug c++/95638] [10/11 Regression] Legit-looking code doesn't work with -O2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95638 --- Comment #6 from bin cheng --- We call graphds_scc twice to break alias dependence, with alias dependence edges skipped in the second call. The code (both before and after r10-7184-ge4e9a59105a81cdd6c1328b0a5ed9fe4cc82840e) tries to rectify post order information after the second call, however it never gets it right. Actually I don't think it can be easily rectified (if possible). Will test another patch which records/restores post order information for the second call.
[Bug c++/95638] Legit-looking code doesn't work with -O2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95638 --- Comment #5 from bin cheng --- (In reply to Jakub Jelinek from comment #1) > All I can say is that bisection shows (at least when preprocessed with g++ > 8.3.1 first) that this changed behavior in > r10-7184-ge4e9a59105a81cdd6c1328b0a5ed9fe4cc82840e > No time to analyze if it is a bug in the code or on the GCC side. > CCing patch author. Thanks for ccing, I will look into it this WE.
[Bug tree-optimization/95199] Remove extra variable created for memory reference in loop vectorization.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95199 --- Comment #7 from bin cheng --- (In reply to rguent...@suse.de from comment #6) > On Thu, 21 May 2020, zhoukaipeng3 at huawei dot com wrote: > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95199 > > > > --- Comment #4 from Kaipeng Zhou --- > > Sorry for not expressing clearly. > > > > I have debugged the testcase you provided. Not eliminating them is not > > caused > > by IFN. The relevant code is in the "get_computation_aff_1" function. > > > > In IVOPTs the IV_STEPs must be checked by function "constant_multiple_of" > > before using an IV variable to eliminate the other. But if the tree_code of > > input IV_STEP is SSA_NAME, the function will return false. In your > > testcase, > > the tree_code of IV_STEP is MULT_EXPR, so it return true. > > > > Gimple for my testcase: > >[local count: 8589933]: > > _83 = (sizetype) inc_y_22(D); > > _84 = _83 * POLY_INT_CST [16, 16]; > > _85 = (long unsigned int) inc_y_22(D); > > _86 = _85 * 8; > > _87 = (ssizetype) _86; > > _88 = _87 /[ex] 8; > > _89 = (long unsigned int) _88; > > _90 = VEC_SERIES_EXPR <0, _89>; > > vect_cst__95 = [vec_duplicate_expr] m_17(D); > > _97 = (sizetype) inc_x_20(D); > > _98 = _97 * POLY_INT_CST [16, 16]; > > _99 = (long unsigned int) inc_x_20(D); > > _100 = _99 * 8; > > _101 = (ssizetype) _100; > > _102 = _101 /[ex] 8; > > _103 = (long unsigned int) _102; > > _104 = VEC_SERIES_EXPR <0, _103>; > > _109 = (sizetype) inc_x_20(D); > > _110 = _109 * POLY_INT_CST [16, 16]; > > _111 = (long unsigned int) inc_x_20(D); > > The issue is you have two copies of > (sizetype) inc_x_20(D) * POLY_INT_CST [16, 16]; > and IVOPTs does not perform CSE. vinfo->ivexpr_map is supposed to > catch those "IV base and/or step expressions". So look where > they are inserted and check the CSE map is used. Alternatively > fixup hashing/comparing to handle POLY_INT_CST [16, 16] if that > is the reason for the missed CSE. > Yes, it's because cse_and_gimplify_to_preheader is not called for gathering/scattering. Should be easily fixed by following patch: diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c index e7822c44951..ba9ee5c4996 100644 --- a/gcc/tree-vect-stmts.c +++ b/gcc/tree-vect-stmts.c @@ -2961,6 +2961,7 @@ vect_get_strided_load_store_ops (stmt_vec_info stmt_info, tree bump = size_binop (MULT_EXPR, fold_convert (sizetype, unshare_expr (DR_STEP (dr))), size_int (TYPE_VECTOR_SUBPARTS (vectype))); + bump = cse_and_gimplify_to_preheader (loop_vinfo, bump); *dataref_bump = force_gimple_operand (bump, , true, NULL_TREE); if (stmts) gsi_insert_seq_on_edge_immediate (loop_preheader_edge (loop), stmts);
[Bug tree-optimization/95199] Remove extra variable created for memory reference in loop vectorization.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95199 --- Comment #5 from bin cheng --- (In reply to Richard Biener from comment #1) > But IVOPTs is supposed to know how to eliminate equal IVs. Maybe it's > confused > about the IFN uses? It's an known issue that IVOPTs has difficulty in recognizing equal BASEs. For now it tries to identify/eliminate with limited expanding work which isn't enough for complicate cases. I sent a patch to do IVOPTs a favor in vectorization, but didn't follow up. Without digging into the code, I am not sure if this is a similar issue. Will have a look this WE. Thanks
[Bug tree-optimization/94969] [8/9/10/11 Regression] Invalid loop distribution since r8-2390-gdfbddbeb1ca912c9
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94969 --- Comment #10 from bin cheng --- Hi,should I backport this and PR95110 to branches? Thanks
[Bug tree-optimization/95019] Optimizer produces suboptimal code related to -ftree-ivopts
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95019 --- Comment #3 from bin cheng --- (In reply to zhongyu...@tom.com from comment #2) > It is a generic issue for all targets, such as x86, it also don't enpand Yes, as said it's because SCEV currently doesn't model this, so it's not target specific. > IVOPTs as index is not used for DEST and Src directly. we may need expand Yes, extending IVOPTs to handle this case (and cases from other PRs) seems promising. Anyway, patch is welcome, and I can do the review. Thanks, > IVOPTs, then different targets can select different one according their Cost > model. > Now, it seems ok for x86 as it have load/store insns folded the lshift > operand, so it doesn't need separate lshift operand in loop body . > > == base on the ARM gcc 9.2.1 on https://gcc.godbolt.org, You'll get > separate lshift operand lsl in loop kernel, and ARM64 gcc 8.2 will use ldr > x3, [x1, x4, lsl 3] to avoid the separate lshift operand. so we can see all > target dont select an IV with Step 8. > C0ADA(unsigned long long, long long*, long long*): > push{r4, r5, r6, r7, lr}@ > mov r4, r0@ len, tmp135 > mov r5, r1@ len, tmp136 > orrsr1, r4, r5 @ tmp137, len > beq .L1 @, > mov r1, #0@ C05A1, > .L3: > lsl r0, r1, #3@ _2, C05A1, > add ip, r2, r1, lsl #3@ tmp120, Src, C05A1, > ldr lr, [r2, r0] @ _4, *_3 > ldr ip, [ip, #4] @ _4, *_3 > umull r6, r7, lr, lr@ tmp125, _4, _4 > mul ip, lr, ip@ tmp122, _4, tmp122 > addsr1, r1, r4 @ C05A1, C05A1, len > subsr4, r4, #1 @ len, len, > sbc r5, r5, #0@ len, len, > add r0, r3, r0@ tmp121, Dest, _2 > add r7, r7, ip, lsl #1@,, tmp122, > orrslr, r4, r5 @ tmp138, len > stm r0, {r6-r7} @ *_5, tmp125 > bne .L3 @, > .L1: > pop {r4, r5, r6, r7, lr} @ > bx lr @ > > Thanks for your notice.
[Bug tree-optimization/95019] Optimizer produces suboptimal code related to -ftree-ivopts
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95019 --- Comment #1 from bin cheng --- Please provide the exact configuration/compilation command lines in bug report next time, which could save others' time to reproduce. Considering I didn't touch mips for years. As for this specific issue, note right now SCEV can't model C05A1, thus DEST[C05A1] and Src[C05A1], so there is not much IVOPTs can do with its current shape. We did discuss about extending the pass to handle non-scev memory references in other PRs, but unless that is implemented, I see no easy fix here. Thanks
[Bug tree-optimization/94969] [8/9/10/11 Regression] Invalid loop distribution since r8-2390-gdfbddbeb1ca912c9
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94969 --- Comment #8 from bin cheng --- Root cause is in build_classic_dist_vector -> constant_access_functions which adds unit distance vector only in case of constant access function. It should cover invariant cases. Testing a patch. Thanks
[Bug tree-optimization/94969] [8/9/10/11 Regression] Invalid loop distribution since r8-2390-gdfbddbeb1ca912c9
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94969 --- Comment #7 from bin cheng --- (In reply to Richard Biener from comment #5) > So I think the issue is not dependence testing but loop distribution > accepting a > zero dependence distance as OK. Of course dependence analysis is quite > useless > here since the accesses are to the same location in every iteration. > > Bin, maybe you can share your thoughts on this issue? > > The testcase doesn't need bitfields - those just disable the cost model > which otherwise prevents the distribution. > > diff --git a/gcc/tree-loop-distribution.c b/gcc/tree-loop-distribution.c > index 44423215332..ac272d63c3d 100644 > --- a/gcc/tree-loop-distribution.c > +++ b/gcc/tree-loop-distribution.c > @@ -2852,6 +2852,7 @@ loop_distribution::finalize_partitions (class loop > *loop, >/* Don't distribute current loop into too many loops given we don't have > memory stream cost model. Be even more conservative in case of loop > nest distribution. */ > +#if 0 >if ((same_type_p && num_builtin == 0 > && (loop->inner == NULL || num_normal != 2 || num_partial_memset != > 1)) >|| (loop->inner != NULL > @@ -2867,6 +2868,7 @@ loop_distribution::finalize_partitions (class loop > *loop, > } >partitions->truncate (1); > } > +#endif > >/* Fuse memset builtins if possible. */ >if (partitions->length () > 1) > > > makes the testcase miscompiled even with the : 7 and : 2 commented, so plain > > struct S { > signed m; > signed e; > }; I think there is something wrong in data dependence analysis, however, Richard's change just exposed it. Given below loop and data refs: for (...) { array[loop_invariant] = x; // ref1 array[loop_invariant] ^= 1; // ref2 } There are both output dependence for ref2(iteration i) -> ref1 (iteration i + 1), and for ref1(iteration i) -> ref2(iteration i). It seems to me now the first one is missing. Will dig deeper.
[Bug tree-optimization/93674] [8/9 Regression] GCC eliminates conditions it should not, when strict-enums is on
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93674 --- Comment #18 from bin cheng --- (In reply to Richard Earnshaw from comment #17) > Has not been backported yet. Will do it. Thanks
[Bug tree-optimization/94125] [9 Regression] wrong code at -O3 on x86_64-linux-gnu
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94125 --- Comment #11 from bin cheng --- (In reply to Richard Biener from comment #10) > Thanks Bin, fixed on trunk sofar. Hmm, if it's fine, I will backport this to GCC9. Thanks
[Bug tree-optimization/94125] [9/10 Regression] wrong code at -O3 on x86_64-linux-gnu
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94125 --- Comment #7 from bin cheng --- Patch at https://gcc.gnu.org/pipermail/gcc-patches/2020-March/542038.html It's a latent bug exposed by the mentioned alias analysis change, however: unsigned char b, f; short d[1][8][1], *g = [0][3][0]; int main () { int k[] = { 0, 0, 0, 4, 0, 0 }; for (int c = 2; c >= 0; c--) { b = f; *g = k[c + 3]; k[c + 1] = 0; } for (int i = 0; i < 8; i++) if (d[0][i][0] != 0) __builtin_abort (); return 0; } We can't tell no-alias info for pairs and . Is this expected or should be improved? Thanks
[Bug tree-optimization/94125] [9/10 Regression] wrong code at -O3 on x86_64-linux-gnu
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94125 --- Comment #5 from bin cheng --- Thanks for CCing, I will have a look this WE.
[Bug tree-optimization/93674] [8/9/10 Regression] GCC eliminates conditions it should not, when strict-enums is on
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93674 bin cheng changed: What|Removed |Added Assignee|unassigned at gcc dot gnu.org |amker at gcc dot gnu.org --- Comment #13 from bin cheng --- Sorry for missing this.
[Bug tree-optimization/92244] vectorized loop updating 2 copies of the same pointer (for in-place reversal cross in the middle)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92244 --- Comment #5 from bin cheng --- Vectorizer generates following address bases: _79 = (sizetype) len_6(D); _80 = _79 + 18446744073709551600; vectp.14_78 = head_7(D) + _80; _89 = (sizetype) len_6(D); _90 = _89 + 18446744073709551600; vectp.20_88 = head_7(D) + _90; IVOPTS only does limited expansion of base by calling expand_simple_operations, which is not enough for this case. Let me do experiment on aggressive expansion using tree_to_aff_combination_expand. It should be able to fix this issue, however, aggressive expansion itself might regress.
[Bug tree-optimization/93334] -O3 generates useless code checking for overlapping memset ?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93334 --- Comment #2 from bin cheng --- (In reply to Richard Biener from comment #1) > Confirmed. The issue is that the overlap would be an issue if the stores > were using different values like > > void test_simple_code(long l, double* mem, long ofs2) { > for (long k=0; k mem[k] = 0.0; > mem[ofs2 +k] = 1.0; > } > } > > and we're simply not optimizing the case where the write-after-write > dependence can be ignored because the stored value is always the same. > I'm also not sure whether that's easy to do ... Bin? I will check if it can be handled as a special case. Thanks.
[Bug c++/93143] [10 Regression] Multiple calls to static constexpr member function gives wrong code
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93143 --- Comment #7 from bin cheng --- (In reply to bin cheng from comment #6) > (In reply to bin cheng from comment #5) > > (In reply to Martin Sebor from comment #4) > > > *** Bug 92926 has been marked as a duplicate of this bug. *** > > > > I sent a patch fixing this a > > https://gcc.gnu.org/ml/gcc-patches/2019-12/msg00920.html > > The only question is if this one has already fixed by PR93033. > > Sorry, wrong comment. Hmm, seems my original comment is not wrong, and this issue still exists. I will update the patch.
[Bug c++/93143] [10 Regression] Multiple calls to static constexpr member function gives wrong code
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93143 --- Comment #6 from bin cheng --- (In reply to bin cheng from comment #5) > (In reply to Martin Sebor from comment #4) > > *** Bug 92926 has been marked as a duplicate of this bug. *** > > I sent a patch fixing this a > https://gcc.gnu.org/ml/gcc-patches/2019-12/msg00920.html > The only question is if this one has already fixed by PR93033. Sorry, wrong comment.
[Bug c++/93143] [10 Regression] Multiple calls to static constexpr member function gives wrong code
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93143 --- Comment #5 from bin cheng --- (In reply to Martin Sebor from comment #4) > *** Bug 92926 has been marked as a duplicate of this bug. *** I sent a patch fixing this a https://gcc.gnu.org/ml/gcc-patches/2019-12/msg00920.html The only question is if this one has already fixed by PR93033.
[Bug c++/92926] New: Wrong code generated because of shared tree node in gimplify
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92926 Bug ID: 92926 Summary: Wrong code generated because of shared tree node in gimplify Product: gcc Version: 10.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: amker at gcc dot gnu.org Target Milestone: --- Following code is reduced from cppcoro but is irrelevant to coroutine. #include #include class ipv6_address { public: constexpr ipv6_address( std::uint16_t part0, std::uint16_t part1, std::uint16_t part2, std::uint16_t part3, std::uint16_t part4, std::uint16_t part5, std::uint16_t part6, std::uint16_t part7); static constexpr ipv6_address loopback(); std::string to_string() const; private: alignas(std::uint64_t) std::uint8_t m_bytes[16]; }; constexpr ipv6_address::ipv6_address( std::uint16_t part0, std::uint16_t part1, std::uint16_t part2, std::uint16_t part3, std::uint16_t part4, std::uint16_t part5, std::uint16_t part6, std::uint16_t part7) : m_bytes{ static_cast(part0 >> 8), static_cast(part0), static_cast(part1 >> 8), static_cast(part1), static_cast(part2 >> 8), static_cast(part2), static_cast(part3 >> 8), static_cast(part3), static_cast(part4 >> 8), static_cast(part4), static_cast(part5 >> 8), static_cast(part5), static_cast(part6 >> 8), static_cast(part6), static_cast(part7 >> 8), static_cast(part7) } {} constexpr ipv6_address ipv6_address::loopback() { return ipv6_address{ 0, 0, 0, 0, 0, 0, 0, 1 }; } char hex_char(std::uint8_t value) { return value < 10 ? static_cast('0' + value) : static_cast('a' + value - 10); } std::string ipv6_address::to_string() const { std::uint32_t longestZeroRunStart = 0; std::uint32_t longestZeroRunLength = 0; for (std::uint32_t i = 0; i < 8; ) { if (m_bytes[2 * i] == 0 && m_bytes[2 * i + 1] == 0) { const std::uint32_t zeroRunStart = i; ++i; while (i < 8 && m_bytes[2 * i] == 0 && m_bytes[2 * i + 1] == 0) { ++i; } std::uint32_t zeroRunLength = i - zeroRunStart; if (zeroRunLength > longestZeroRunLength) { longestZeroRunLength = zeroRunLength; longestZeroRunStart = zeroRunStart; } } else { ++i; } } char buffer[40]; char* c = [0]; auto appendPart = [&](std::uint32_t index) { const std::uint8_t highByte = m_bytes[index * 2]; const std::uint8_t lowByte = m_bytes[index * 2 + 1]; if (highByte > 0 || lowByte > 15) { if (highByte > 0) { if (highByte > 15) { *c++ = hex_char(highByte >> 4); } *c++ = hex_char(highByte & 0xF); } *c++ = hex_char(lowByte >> 4); } *c++ = hex_char(lowByte & 0xF); }; if (longestZeroRunLength >= 2) { for (std::uint32_t i = 0; i < longestZeroRunStart; ++i) { if (i > 0) { *c++ = ':'; } appendPart(i); } *c++ = ':'; *c++ = ':'; for (std::uint32_t i = longestZeroRunStart + longestZeroRunLength; i < 8; ++i) { appendPart(i); if (i < 7) { *c++ = ':'; } } } else { appendPart(0); for (std::uint32_t i = 1; i < 8; ++i) { *c++ = ':'; appendPart(i); } } assert((c - [0]) <= sizeof(buffer)); return std::string{ [0], c }; } std::string __attribute__((noinline)) foo () { return ipv6_address::loopback().to_string(); } ipv6_address __attribute__((noinline)) bar () { return ipv6_address::loopback(); } int main() { std::string s = foo (); ipv6_address a = bar (); assert(a.to_string() == s); return 0; } Compiling using following command line: $ ./g++ -std=c++17 -m64 -O3 z.cc -o a.out $ ./a.out a.out: z.cc:168: int main(): Assertion `a.to_string() == s' failed. Aborted Root cause is ipv6_address::loopback as a constexpr function, what it returns is folded into const ctor by C++ FE, also the const ctor is shared translation unit wide by constexpr_call_table. As a result the ctor as well as its vector elements are shared between foo and bar. In gimplify, CONSTRUCTOR_ELTS is optimized and cleared, causing shared node changed in the other function. Will send a patch for discussion.
[Bug middle-end/92574] Inefficient code for multidimensional array assess
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92574 --- Comment #2 from bin cheng --- Similar to https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57534 The original idea was handle this as much as possible in ivopt which is difficult given ivopt code has lots of (scev/niter) validity checks. In aforementioned straight-line "ivopts", we only need to factor out common part, choose addressing mode, rewrite memory references. Maybe a light-weight pass to do the job using existing ivopt facility.
[Bug c++/85471] closing a "thread" in "C++" using "pthread_exit(NULL)" creates a "SIGABRT"
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85471 bin cheng changed: What|Removed |Added CC||amker at gcc dot gnu.org --- Comment #6 from bin cheng --- I ran into a stackoverflow entry with following code: #include #include #include #include static void cleanup(void *ptr) { } void *child(void *ptr) { pthread_cleanup_push(cleanup, NULL); pthread_exit(NULL); pthread_cleanup_pop(1); return NULL; } int main() { pthread_t foo; pthread_create(, NULL, child, NULL); pthread_join(foo, NULL); return 0; } The abort can be reproduced when compiled using gcc-8.3 with following options: $ g++ -o a.out test.cc -g -Wall -fexceptions -pthread -static-libstdc++ -static-libgcc $ gdb --args ./a.out (gdb) r (gdb) bt #0 0xbf4972c8 in raise () from /lib64/libc.so.6 #1 0xbf498940 in abort () from /lib64/libc.so.6 #2 0x0040ec94 in _Unwind_SetGR () #3 0x00401c4c in __gxx_personality_v0 () #4 0xbec3fab8 in _Unwind_ForcedUnwind_Phase2 (exc=exc@entry=0xbf462670, context=context@entry=0xbf461560, frames_p=frames_p@entry=0xbf461198) at ../../../libgcc/unwind.inc:182 #5 0xbec3fea0 in _Unwind_ForcedUnwind (exc=0xbf462670, stop=0xbf5f7950 , stop_argument=0xbf461a30) at ../../../libgcc/unwind.inc:217 #6 0xbf5fa15c in _Unwind_ForcedUnwind () from /lib64/libpthread.so.0 #7 0xbf5f7aac in __pthread_unwind () from /lib64/libpthread.so.0 #8 0xbf5f1a08 in pthread_exit () from /lib64/libpthread.so.0 #9 0x00401460 in child (ptr=0x0) at test.cc:13 #10 0xbf5f0bb0 in start_thread () from /lib64/libpthread.so.0 #11 0xbf53e4c0 in thread_start () from /lib64/libc.so.6 Issue with this case is because of static-libgcc, not sure if it's the same to the original case. Thanks
[Bug debug/90231] ivopts causes iterator in the loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90231 --- Comment #12 from bin cheng --- (In reply to Jakub Jelinek from comment #10) > Actually (int) ((ivtmp.11 - (unsigned long) dst_10) / 4), sorry. > On 64-bit targets this will never be a problem, are you worried about 32-bit > targets where int and pointers are the same width and for a loop with say up > to INT_MAX iterations ivtmp.11 would wrap around? Then dst[i] would be > invalid too. > So as long as the IVs aren't added there out of the blue sky, with larger > steps than what is really used, it shouldn't be an issue. > Or can say a loop that does: > unsigned int j = x; > for (int i = 0; i < n; i++) > { > j += 32; > use (i, j); > } > use j as unsigned int IV with step 32 replace the i int IV with step 1? If > yes, then I'd understand that (int) ((j - x) / 32) might not be correct > expression all the time, e.g. if j == x, then i might be 0, or 0x800 > etc., but (int) ((j - x) / 32) will be 0. Yes, as mentioned in #11, we need to choose the same class IV in rewriting. And reuse of existing code makes it harder, after all, I don't want to disturb existing code because of debug-stmt rewriting.
[Bug debug/90231] ivopts causes iterator in the loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90231 --- Comment #11 from bin cheng --- (In reply to Richard Biener from comment #9) > (In reply to bin cheng from comment #7) > > The orignal iv needs to be represented in debug bind stmt is: > > 64 IV struct: > > 65 SSA_NAME: i_18 > > 66 Type: int > > 67 Base: 0 > > 68 Step: 1 > > 69 Biv: Y > > 70 Overflowness wrto loop niter: No-overflow > > > > While the possible candidate is: > > 185 Candidate 8: > > 186 Var befor: ivtmp.11 > > 187 Var after: ivtmp.11 > > 188 Incr POS: before exit test > > 189 IV struct: > > 190 Type: unsigned long > > 191 Base: (unsigned long) dst_10(D) > > 192 Step: 4 > > 193 Object: (void *) dst_10(D) > > 194 Biv:N > > 195 Overflowness wrto loop niter: Overflow > > > > Strictly speaking, with above information, we can't compute i_18 using > > ivtmp.11 correctly in all cases, because ivtmp.11 could overflow. Of > > course, the overflow-ness in this case could be improved, thus solve the > > problem. Or there is another method: we can do the computation anyway, it > > may give wrong value in some cases, but we are in debug stmt, value which is > > correct in most cases is better than optimized away, sensible? > > Actually we do know that ivtmp.11 doesn't overflow. > > Since we can express the use of i in dst[i] by the new IV we can express > i in terms of the new IV at the point of its original use as well, I see > no way the transform isn't bijective. The complication here is just It's bijective if we can choose candidate derived from the same class of induction variables as "i", however, code rewriting debug-stmt currently selects cand using simple heuristic, it's not guaranteed cand from right class would be chosen. Also we reuse existing iv_use -> iv_cand computation code in rewriting debug-stmt. > that we have to undo the 'use in dst[i]' effect somehow, but for simple > cases or rewrite_use_* this should be doable.
[Bug debug/90231] ivopts causes iterator in the loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90231 --- Comment #7 from bin cheng --- The orignal iv needs to be represented in debug bind stmt is: 64 IV struct: 65 SSA_NAME: i_18 66 Type: int 67 Base: 0 68 Step: 1 69 Biv: Y 70 Overflowness wrto loop niter: No-overflow While the possible candidate is: 185 Candidate 8: 186 Var befor: ivtmp.11 187 Var after: ivtmp.11 188 Incr POS: before exit test 189 IV struct: 190 Type: unsigned long 191 Base: (unsigned long) dst_10(D) 192 Step: 4 193 Object: (void *) dst_10(D) 194 Biv:N 195 Overflowness wrto loop niter: Overflow Strictly speaking, with above information, we can't compute i_18 using ivtmp.11 correctly in all cases, because ivtmp.11 could overflow. Of course, the overflow-ness in this case could be improved, thus solve the problem. Or there is another method: we can do the computation anyway, it may give wrong value in some cases, but we are in debug stmt, value which is correct in most cases is better than optimized away, sensible? Thanks, bin
[Bug tree-optimization/91775] Can eliminate compare from loop with known number of iterations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91775 --- Comment #6 from bin cheng --- The address type iv_use has pointer type and 64-bit precision, while iv_cands added (by ivcanon pass) has unsigned int type. So decremental candidates are skipped because of following code: 4620│ /* Check if we have enough precision to express the values of use. */ 4621│ if (TYPE_PRECISION (utype) > TYPE_PRECISION (ctype)) 4622├───> return infinite_cost; Looks like better overflow-ness analysis is required here: Candidate 6: Incr POS: orig biv IV struct: Type: unsigned int Base: 1024 Step: 4294967295 Biv:N Overflowness wrto loop niter: Overflow <--- here.
[Bug rtl-optimization/91137] [7 Regression] Wrong code with -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91137 --- Comment #15 from bin cheng --- Author: amker Date: Mon Sep 2 10:10:44 2019 New Revision: 275304 URL: https://gcc.gnu.org/viewcvs?rev=275304=gcc=rev Log: Backport from mainline 2019-07-18 Bin Cheng PR tree-optimization/91137 * tree-ssa-loop-ivopts.c (struct ivopts_data): New field. (tree_ssa_iv_optimize_init, alloc_iv, tree_ssa_iv_optimize_finalize): Init, use and fini the above new field. (determine_base_object_1): New function. (determine_base_object): Reimplement using walk_tree. 2019-07-18 Bin Cheng PR tree-optimization/91137 * gcc.c-torture/execute/pr91137.c: New test. Added: branches/gcc-7-branch/gcc/testsuite/gcc.c-torture/execute/pr91137.c Modified: branches/gcc-7-branch/gcc/ChangeLog branches/gcc-7-branch/gcc/testsuite/ChangeLog branches/gcc-7-branch/gcc/tree-ssa-loop-ivopts.c
[Bug rtl-optimization/91137] [7/8 Regression] Wrong code with -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91137 --- Comment #14 from bin cheng --- Author: amker Date: Fri Aug 30 11:02:48 2019 New Revision: 275064 URL: https://gcc.gnu.org/viewcvs?rev=275064=gcc=rev Log: Backport from mainline 2019-07-18 Bin Cheng PR tree-optimization/91137 * tree-ssa-loop-ivopts.c (struct ivopts_data): New field. (tree_ssa_iv_optimize_init, alloc_iv, tree_ssa_iv_optimize_finalize): Init, use and fini the above new field. (determine_base_object_1): New function. (determine_base_object): Reimplement using walk_tree. 2019-07-18 Bin Cheng PR tree-optimization/91137 * gcc.c-torture/execute/pr91137.c: New test. Added: branches/gcc-8-branch/gcc/testsuite/gcc.c-torture/execute/pr91137.c Modified: branches/gcc-8-branch/gcc/ChangeLog branches/gcc-8-branch/gcc/testsuite/ChangeLog branches/gcc-8-branch/gcc/tree-ssa-loop-ivopts.c
[Bug rtl-optimization/91137] [7/8/9 Regression] Wrong code with -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91137 --- Comment #13 from bin cheng --- Author: amker Date: Wed Jul 24 01:28:33 2019 New Revision: 273754 URL: https://gcc.gnu.org/viewcvs?rev=273754=gcc=rev Log: Backport from mainline 2019-07-18 Bin Cheng PR tree-optimization/91137 * tree-ssa-loop-ivopts.c (struct ivopts_data): New field. (tree_ssa_iv_optimize_init, alloc_iv, tree_ssa_iv_optimize_finalize): Init, use and fini the above new field. (determine_base_object_1): New function. (determine_base_object): Reimplement using walk_tree. gcc/testsuite 2019-07-18 Bin Cheng PR tree-optimization/91137 * gcc.c-torture/execute/pr91137.c: New test. Added: branches/gcc-9-branch/gcc/testsuite/gcc.c-torture/execute/pr91137.c Modified: branches/gcc-9-branch/gcc/ChangeLog branches/gcc-9-branch/gcc/testsuite/ChangeLog branches/gcc-9-branch/gcc/tree-ssa-loop-ivopts.c
[Bug rtl-optimization/91137] [7/8/9/10 Regression] Wrong code with -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91137 --- Comment #11 from bin cheng --- Hi, suppose this patch should be backported to 8/7 if no further issues.
[Bug rtl-optimization/91137] [7/8/9/10 Regression] Wrong code with -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91137 --- Comment #10 from bin cheng --- Author: amker Date: Thu Jul 18 08:38:09 2019 New Revision: 273570 URL: https://gcc.gnu.org/viewcvs?rev=273570=gcc=rev Log: PR tree-optimization/91137 * tree-ssa-loop-ivopts.c (struct ivopts_data): New field. (tree_ssa_iv_optimize_init, alloc_iv, tree_ssa_iv_optimize_finalize): Init, use and fini the above new field. (determine_base_object_1): New function. (determine_base_object): Reimplement using walk_tree. gcc/testsuite PR tree-optimization/91137 * gcc.c-torture/execute/pr91137.c: New test. Added: trunk/gcc/testsuite/gcc.c-torture/execute/pr91137.c Modified: trunk/gcc/ChangeLog trunk/gcc/testsuite/ChangeLog trunk/gcc/tree-ssa-loop-ivopts.c
[Bug rtl-optimization/91137] [7/8/9/10 Regression] Wrong code with -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91137 --- Comment #8 from bin cheng --- (In reply to rguent...@suse.de from comment #7) > On Mon, 15 Jul 2019, amker at gcc dot gnu.org wrote: > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91137 > > > > --- Comment #6 from bin cheng --- > > (In reply to Richard Biener from comment #2) > > > > > > and I can very well imagine we're getting confused by find_base_term > > > logic here. > > > > > > There's logic in IVOPTs to not generate IVs based on two different > > > objects but somehow it doesn't trigger here. > > > > Hmm, it's because determine_base_object failed to identify the `base_object` > > for IV because it has non-pointer type: > > IV struct: > > SSA_NAME: _32 > > Type: unsigned long > > Base: (unsigned long) + 19600 > > Step: 4 > > Biv: N > > Overflowness wrto loop niter: Overflow > > > > And we have short-circuit in determine_base_object: > > > > static tree > > determine_base_object (tree expr) > > { > > enum tree_code code = TREE_CODE (expr); > > tree base, obj; > > > > /* If this is a pointer casted to any type, we need to determine > > the base object for the pointer; so handle conversions before > > throwing away non-pointer expressions. */ > > if (CONVERT_EXPR_P (expr)) > > return determine_base_object (TREE_OPERAND (expr, 0)); > > > > if (!POINTER_TYPE_P (TREE_TYPE (expr))) > > return NULL_TREE; > > > > The IV is generated from inner loop ivopts as we rewrite using unsigned > > type. > > > > Any suggestion how to fix this? > > I think we need to elide this check and make the following code > more powerful which includes actually handling PLUS/MINUS_EXPR. > There's also ptr + (unsigned) to be considered for the > POINTER_PLUS_EXPR case - thus we cannot simply only search the > pointer chain. > > Which then leads us into exponential behavior if this is asked > on SCEV results which may have tree sharing, thus we need some > 'visited' machinery. In the end I think we should re-do Will work on a patch. One thing I am unclear about is the ptr + (unsigned) stuff, when would we have this? Could you provide an example please? > it like I re-did contains_abnormal_ssa_name_p, use > walk_tree_without_duplicates. Btw, what should happen if the > walk discovers two bases are used in the expression? I guess it depends on why there are multiple bases in the first place?
[Bug rtl-optimization/91137] [7/8/9/10 Regression] Wrong code with -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91137 --- Comment #6 from bin cheng --- (In reply to Richard Biener from comment #2) > > and I can very well imagine we're getting confused by find_base_term > logic here. > > There's logic in IVOPTs to not generate IVs based on two different > objects but somehow it doesn't trigger here. Hmm, it's because determine_base_object failed to identify the `base_object` for IV because it has non-pointer type: IV struct: SSA_NAME: _32 Type: unsigned long Base: (unsigned long) + 19600 Step: 4 Biv: N Overflowness wrto loop niter: Overflow And we have short-circuit in determine_base_object: static tree determine_base_object (tree expr) { enum tree_code code = TREE_CODE (expr); tree base, obj; /* If this is a pointer casted to any type, we need to determine the base object for the pointer; so handle conversions before throwing away non-pointer expressions. */ if (CONVERT_EXPR_P (expr)) return determine_base_object (TREE_OPERAND (expr, 0)); if (!POINTER_TYPE_P (TREE_TYPE (expr))) return NULL_TREE; The IV is generated from inner loop ivopts as we rewrite using unsigned type. Any suggestion how to fix this?
[Bug rtl-optimization/91137] [7/8/9/10 Regression] Wrong code with -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91137 --- Comment #5 from bin cheng --- Will try to find some time this WE, sorry for delaying.
[Bug tree-optimization/90240] [10 Regression] ICE in try_improve_iv_set, at tree-ssa-loop-ivopts.c:6694
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90240 --- Comment #12 from bin cheng --- (In reply to Richard Biener from comment #11) > Is this now fixed? yes, fixed on trunk. Only if it should be backported to GCC-9?
[Bug tree-optimization/57534] [7/8/9/10 Regression]: Performance regression versus 4.7.3, 4.8.1 is ~15% slower
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57534 --- Comment #34 from bin cheng --- So we could have three different addressing modes here. 1. What we have now: leaq0(,%rbp,8), %rax movsd 8(%rbx,%rax), %xmm0 addsd (%rbx,%rbp,8), %xmm0 addq$8, %rbp addsd 16(%rbx,%rax), %xmm0 addsd 24(%rbx,%rax), %xmm0 addsd %xmm0, %xmm1 movsd 32(%rbx,%rax), %xmm0 addsd 40(%rbx,%rax), %xmm0 addsd 48(%rbx,%rax), %xmm0 addsd 56(%rbx,%rax), %xmm0 addsd %xmm0, %xmm2 cmpq%rsi, %rbp 2. GCC-4.7: fldl (%esi,%ebx,8) lea0x8(%ebx),%eax faddl 0x8(%esi,%ebx,8) cmp%eax,%edi faddl 0x10(%esi,%ebx,8) faddl 0x18(%esi,%ebx,8) faddp %st,%st(2) fldl 0x20(%esi,%ebx,8) faddl 0x28(%esi,%ebx,8) faddl 0x30(%esi,%ebx,8) faddl 0x38(%esi,%ebx,8) faddp %st,%st(1) 3. With slsr change: leaq0(%rbp,%rbx,8), %rax addq$8, %rbx movsd (%rax), %xmm0 addsd 8(%rax), %xmm0 addsd 16(%rax), %xmm0 addsd 24(%rax), %xmm0 addsd %xmm0, %xmm1 movsd 32(%rax), %xmm0 addsd 40(%rax), %xmm0 addsd 48(%rax), %xmm0 addsd 56(%rax), %xmm0 addsd %xmm0, %xmm2 cmpq%rsi, %rbx This was reported that 2. is better than 1. Also Jeff recommended 3. What I don't understand are: A) why 2. is better than 1.? It seems to have more computations in address. B) Is 3. the best one? It has the simplest addressing mode, but does require one additional lea because of strength reduction. Thanks.
[Bug tree-optimization/90078] [7/8 Regression] ICE with deep templates caused by overflow
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90078 --- Comment #14 from bin cheng --- Author: amker Date: Wed May 8 11:37:45 2019 New Revision: 271008 URL: https://gcc.gnu.org/viewcvs?rev=271008=gcc=rev Log: PR tree-optimization/90078 * tree-ssa-loop-ivopts.c (INFTY): Increase value for infinite cost. (struct comp_cost): Promote type of members to int64_t. (infinite_cost): Don't set complexity in initialization. (comp_cost::operator +,-,+=,-+,/=,*=): Assert when cost computation overflows to infinite_cost. (adjust_setup_cost): Promote type of parameter and cost computation to int64_t. (struct ainc_cost_data, struct iv_ca): Promote type of member to int64_t. (get_scaled_computation_cost_at, determine_iv_cost): Promote type of cost computation to int64_t. (determine_group_iv_costs, iv_ca_dump, find_optimal_iv_set): Use int64_t's format specifier in dump. gcc/testsuite * g++.dg/tree-ssa/pr90078.C: New test. Added: trunk/gcc/testsuite/g++.dg/tree-ssa/pr90078.C Modified: trunk/gcc/ChangeLog trunk/gcc/testsuite/ChangeLog trunk/gcc/tree-ssa-loop-ivopts.c
[Bug tree-optimization/90240] [10 Regression] ICE in try_improve_iv_set, at tree-ssa-loop-ivopts.c:6694
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90240 --- Comment #10 from bin cheng --- Author: amker Date: Wed May 8 11:24:38 2019 New Revision: 271007 URL: https://gcc.gnu.org/viewcvs?rev=271007=gcc=rev Log: PR tree-optimization/90240 * tree-ssa-loop-ivopts.c (get_scaled_computation_cost_at): Scale cost with respect to scaling factor pre-computed for each basic block. (try_improve_iv_set): Return bool if best_cost equals to iv_ca cost. (find_optimal_iv_set_1): Free iv_ca set if it has infinite_cost. (COST_SCALING_FACTOR_BOUND, determine_scaling_factor): New. (tree_ssa_iv_optimize_loop): Call determine_scaling_factor. Extend live range for array of loop's basic blocks. Cleanup aux field of loop's basic blocks. gcc/testsuite * gfortran.dg/graphite/pr90240.f: New test. Added: trunk/gcc/testsuite/gfortran.dg/graphite/pr90240.f Modified: trunk/gcc/ChangeLog trunk/gcc/testsuite/ChangeLog trunk/gcc/tree-ssa-loop-ivopts.c
[Bug tree-optimization/57534] [7/8/9/10 Regression]: Performance regression versus 4.7.3, 4.8.1 is ~15% slower
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57534 bin cheng changed: What|Removed |Added Assignee|unassigned at gcc dot gnu.org |amker at gcc dot gnu.org --- Comment #33 from bin cheng --- Came back to this one. void timer_stop(); volatile long keepgoing = 0; double hand_benchmark_cache_ronly( double *x, long limit, long *oloops, double *ous) { long index = 0, loops = 0; double sum = (double)0; double sum2 = (double)0; again: sum += x[index] + x[index+1] + x[index+2] + x[index+3]; sum2 += x[index+4] + x[index+5] + x[index+6] + x[index+7]; if ((index += 8) < limit) goto again; else if (keepgoing) { index = 0; goto again; } timer_stop(); x[0] = (double)sum + (double)sum2; x[1] = (double)index; } The idea fix to above test would be identifying the first goto as a loop, so IVOPTs can do strength reduction on address ivs. While for below case: int ind; int cond(void); double hand_benchmark_cache_ronly( double *x) { double sum=0.0; while (cond()) sum += x[ind] + x[ind+1] + x[ind+2] + x[ind+3]; return sum; } It's hard to handle in IVOPTs, because neither niter nor scev analysis succeeds. The IVOPTs implementation is centralized to induction variable. It would non-trivial change to support such case. However, I wondered why we missed slsr in previous analysis? It's designed to strength reduce such code. Quoting from its comment: Specifically, we are interested in references for which get_inner_reference returns a base address, offset, and bitpos as follows: base:MEM_REF (T1, C1) offset: MULT_EXPR (PLUS_EXPR (T2, C2), C3) bitpos: C4 * BITS_PER_UNIT Here T1 and T2 are arbitrary trees, and C1, C2, C3, C4 are arbitrary integer constants. Note that C2 may be zero, in which case the offset will be MULT_EXPR (T2, C3). When this pattern is recognized, the original memory reference can be replaced with: MEM_REF (POINTER_PLUS_EXPR (T1, MULT_EXPR (T2, C3)), C1 + (C2 * C3) + C4) It explicitly states that addresses here should be tracked, associated and reduced as we wanted: (X + index * 8) + const_offset_x. I think it's a missed address slsr optimization, i.e, clearly it failed to identify CAND_REF candidate for memory reference. After looking into the code, I think the problem is in slsr_process_ref and restructure_reference. Trying if I can fix this...
[Bug tree-optimization/90078] [7/8 Regression] ICE with deep templates caused by overflow
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90078 --- Comment #13 from bin cheng --- Reverted 270500 on trunk too for easier backport to GCC9.
[Bug tree-optimization/90078] [7/8 Regression] ICE with deep templates caused by overflow
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90078 --- Comment #12 from bin cheng --- Author: amker Date: Tue Apr 30 03:00:59 2019 New Revision: 270673 URL: https://gcc.gnu.org/viewcvs?rev=270673=gcc=rev Log: PR tree-optimization/90240 Revert: 2019-04-23 Bin Cheng PR tree-optimization/90078 * tree-ssa-loop-ivopts.c (comp_cost::operator +,-,+=,-+,/=,*=): Add checks for infinite_cost overflow. * gcc/testsuite/g++.dg/tree-ssa/pr90078.C: New test. Removed: trunk/gcc/testsuite/g++.dg/tree-ssa/pr90078.C Modified: trunk/gcc/ChangeLog trunk/gcc/testsuite/ChangeLog trunk/gcc/tree-ssa-loop-ivopts.c
[Bug tree-optimization/90240] [10 Regression] ICE in try_improve_iv_set, at tree-ssa-loop-ivopts.c:6694
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90240 --- Comment #9 from bin cheng --- Author: amker Date: Tue Apr 30 03:00:59 2019 New Revision: 270673 URL: https://gcc.gnu.org/viewcvs?rev=270673=gcc=rev Log: PR tree-optimization/90240 Revert: 2019-04-23 Bin Cheng PR tree-optimization/90078 * tree-ssa-loop-ivopts.c (comp_cost::operator +,-,+=,-+,/=,*=): Add checks for infinite_cost overflow. * gcc/testsuite/g++.dg/tree-ssa/pr90078.C: New test. Removed: trunk/gcc/testsuite/g++.dg/tree-ssa/pr90078.C Modified: trunk/gcc/ChangeLog trunk/gcc/testsuite/ChangeLog trunk/gcc/tree-ssa-loop-ivopts.c
[Bug tree-optimization/90270] [8/9/10 Regression] Do not select best induction variable optimization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90270 --- Comment #11 from bin cheng --- For the record, this test reveals another issue that original iv cand is not considered: Group 0: Type: REFERENCE ADDRESS Use 0.0: At stmt:_1 = final_counts[i_21]; At pos: final_counts[i_21] IV struct: Type: unsigned int * Base: (unsigned int *) _counts Step: 4 Object: (void *) _counts Biv: N Overflowness wrto loop niter: Overflow Candidate 7: Incr POS: orig biv IV struct: Type: unsigned int Base: 0 Step: 1 Biv:N Overflowness wrto loop niter: No-overflow : Group 0: cand costcompl. inv.expr. inv.vars 1 9 2 NIL;NIL; 6 2 2 1; NIL; 8 0 0 NIL;NIL; 109 1 NIL;NIL; Group 1: cand costcompl. inv.expr. inv.vars 1 9 2 NIL;NIL; 6 2 2 2; NIL; 9 0 0 NIL;NIL; 109 1 NIL;NIL;
[Bug tree-optimization/90270] [8/9/10 Regression] Do not select best induction variable optimization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90270 --- Comment #10 from bin cheng --- (In reply to Richard Biener from comment #9) > (In reply to bin cheng from comment #7) > > Also, when calling move_fixed_address_to_symbol, fixed_address_object_p > > looks too restricted, it only considers link time constant address. In this > > case, it's an array object in stack. > > But this is because a stack access isn't $reloc but $sp + offset and thus > _not_ a symbol. >From ivopts/loop's point of view, the address ($sp + offset) is loaded into register, then the register is used to address elements in array. In other words, it doesn't really matter if the address is global and determined by linker or local and determined by stack frame. > > But as you noticed IVOPTs computing TARGET_MEM_REF so "early" is a bit > brittle due to later eventual forwardings. And those forwardings are > hard to avoid because they affect fundamental predicates like > may_propagate_copy where we decide early whether we can propagte into > all uses before actually visiting them. Can we avoid propagating into TARGET_MEM_REF if it creates invalid addressing mode? IIUC, passes (like ivopts, slsr) creating TARGET_MEM_REF do generate "correct" addressing mode, it doesn't make much sense to create invalid ones afterwards.
[Bug tree-optimization/90270] [8/9/10 Regression] Do not select best induction variable optimization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90270 --- Comment #7 from bin cheng --- Also, when calling move_fixed_address_to_symbol, fixed_address_object_p looks too restricted, it only considers link time constant address. In this case, it's an array object in stack.
[Bug tree-optimization/90270] [8/9/10 Regression] Do not select best induction variable optimization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90270 --- Comment #6 from bin cheng --- (In reply to Andrew Pinski from comment #5) > (In reply to bin cheng from comment #4) > > On AArch64, iovpts generates following code: > >[local count: 954449108]: > > # crc_20 = PHI > > # ivtmp.5_18 = PHI <1(2), ivtmp.5_17(5)> > > _19 = _counts + 18446744073709551612; > > _1 = MEM[base: _19, index: ivtmp.5_18, step: 4, offset: 0B]; > > crc_10 = crcu32 (_1, crc_20); > > _5 = _counts + 18446744073709551612; > > I thought we had decided _counts + 18446744073709551612 would be > invalid gimple anyways as we are taking the address of one element before. Could you direct me to the discussion about this decision? I remember once raised this question (probably in private). In this case, we need to revision ivopts to avoid adding candidates which could violates this. Anyway, it's an independent issue because the iv_cand could be one element forwarded as: > >[local count: 954449108]: > > # crc_20 = PHI > > # ivtmp.5_18 = PHI <0(2), ivtmp.5_17(5)> > > _19 = _counts; > > _1 = MEM[base: _19, index: ivtmp.5_18, step: 4, offset: 0B]; > > crc_10 = crcu32 (_1, crc_20); > > _5 = _counts; > Unfortunately, cost computation still has problem to generate this code.
[Bug tree-optimization/90270] [8/9/10 Regression] Do not select best induction variable optimization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90270 --- Comment #4 from bin cheng --- On AArch64, iovpts generates following code: [local count: 954449108]: # crc_20 = PHI # ivtmp.5_18 = PHI <1(2), ivtmp.5_17(5)> _19 = _counts + 18446744073709551612; _1 = MEM[base: _19, index: ivtmp.5_18, step: 4, offset: 0B]; crc_10 = crcu32 (_1, crc_20); _5 = _counts + 18446744073709551612; _2 = MEM[base: _5, index: ivtmp.5_18, step: 4, offset: 0B]; crc_12 = crcu32 (_2, crc_10); ivtmp.5_17 = ivtmp.5_18 + 1; if (ivtmp.5_17 != 9) goto ; [87.50%] else goto ; [12.50%] Which looks optimal to me if _19/_5 can be hoisted out of loop. And it is intended to be hoisted by rtl liv. (TREE liv doesn't help much, that's another story) Problem is in dom3 pass, cprop_operand, _19/_5 is propagated into memory access although it causes invalid addressing mode on AArch64: [[(void *)_counts + -4B], [(void *)_counts + -4B]] EQUIVALENCES: { _19 } (1 elements) Optimizing statement _1 = MEM[base: _19, index: ivtmp.5_18, step: 4, offset: 0B]; Replaced '_19' with constant '[(void *)_counts + -4B]' Folded to: _1 = MEM[symbol: final_counts, index: ivtmp.5_18, step: 4, offset: -4B]; LKUP STMT _1 = MEM[symbol: final_counts, index: ivtmp.5_18, step: 4, offset: -4B] with .MEM_22 2>>> STMT _1 = MEM[symbol: final_counts, index: ivtmp.5_18, step: 4, offset: -4B] with .MEM_22 it's kept in this form to the end of GIMPLE, then badly legitimized. So ivopts worked hard to get addressing mode and invariant expression correct in this case, we need to avoid immature transformations afterwards. BTW, with dom disabled by -fno-tree-dominator-opts, vrp2 does the same transformation too. -fno-tree-vrp is also necessary to get the optimal code. Well, you can argue [base + iv << 2] is sub-optimal comparing to [base + iv], but that's hard to tune. Also bias to the original IV is in general preferred for reasons like smaller setup code, better debug info, and even for performance in complicated loops.
[Bug tree-optimization/90240] [10 Regression] ICE in try_improve_iv_set, at tree-ssa-loop-ivopts.c:6694
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90240 --- Comment #8 from bin cheng --- Patch proposed at: https://gcc.gnu.org/ml/gcc-patches/2019-04/msg01101.html
[Bug tree-optimization/90240] [9 Regression] ICE in try_improve_iv_set, at tree-ssa-loop-ivopts.c:6694
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90240 --- Comment #4 from bin cheng --- (In reply to Jakub Jelinek from comment #3) > Graphite, so IMHO not a release blocker. but the issue is critical, it could happen with general optimization level for loop nest with huge scaling factor. So, find_optimal_iv_set_1 first chooses a candidate set, then makes different tries to do cost descent by modifying the candidate set. The facts are: 1) algorithm uses a global variable of following structure and keeps track of cost in place during computation. struct iv_ca { //... comp_cost cand_use_cost; //... comp_cost cost; }; 2) algorithm is heuristic, so it's possible to reach an intermediate state with higher cost. 3) as in previous comment, loop nest with huge scaling factor can easily result in infinite_cost. 4) once the global variable of iv_ca.{cand_use_cost, cost} reaches infinite_cost, ICE is the best thing could happen. We could replace gcc_assert with algorithm failure then give up ivopts, but IMHO that would miss quite lot of optimizations. The conclusion, candidate choosing algorithm doesn't work well with infinite_cost. I Don't know how to fix this trivially. For now, even restricting scaling factor is a practical change now. Will give it a try.
[Bug tree-optimization/90240] [9 Regression] ICE in try_improve_iv_set, at tree-ssa-loop-ivopts.c:6694
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90240 --- Comment #2 from bin cheng --- Also, cost in inner loop is scaled by big number: Scaling cost based on bb prob by 1.00: 0 (scratch: 0) -> 0 (1/1) Scaling cost based on bb prob by 1.00: 32 (scratch: 0) -> 32 (1/1) Scaling cost based on bb prob by 1.00: 41 (scratch: 0) -> 41 (1/1) Scaling cost based on bb prob by 1.00: 21 (scratch: 0) -> 21 (1/1) Scaling cost based on bb prob by 1.00: 45 (scratch: 0) -> 45 (1/1) Scaling cost based on bb prob by 1.00: 21 (scratch: 0) -> 21 (1/1) Scaling cost based on bb prob by 1.00: 17 (scratch: 0) -> 17 (1/1) Resulting: Group 19: cand costcompl. inv.expr. inv.vars 1 41 0 NIL;1, 4 2 21 0 NIL;4 3 45 0 NIL;1, 4 4 21 0 NIL;4 5 17 0 35; NIL; 300 0 NIL;NIL; 6732 0 NIL;1, 4 Given we have 70 groups of iv_use, this easily overflow infinite_cost which is 10,000,000. One thing unclear is the overflow happens in the middle of cost candidate choosing algorithm.
[Bug tree-optimization/90240] [9 Regression] ICE in try_improve_iv_set, at tree-ssa-loop-ivopts.c:6694
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90240 bin cheng changed: What|Removed |Added Assignee|unassigned at gcc dot gnu.org |amker at gcc dot gnu.org --- Comment #1 from bin cheng --- probably something with recent changes on comp_cost::operators
[Bug debug/90231] ivopts causes iterator in the loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90231 bin cheng changed: What|Removed |Added Assignee|unassigned at gcc dot gnu.org |amker at gcc dot gnu.org --- Comment #5 from bin cheng --- I will try to fix it for GCC10. Thanks
[Bug tree-optimization/90021] [9 Regression] ICE in index_in_loop_nest, at tree-data-ref.h:587 since r270203
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90021 --- Comment #5 from bin cheng --- (In reply to Jakub Jelinek from comment #4) > From what I can see, a fix for this has been acked 11 days ago: > https://gcc.gnu.org/ml/gcc-patches/2019-04/msg00413.html > Bin, are you going to commit it? I just commit it. There was a typo in PR number of ChangeLog entry, so this PR is not update. For the record, it's https://gcc.gnu.org/viewcvs/gcc?view=revision=270499
[Bug tree-optimization/90078] [7/8/9 Regression] ICE with deep templates caused by overflow [PATCH]
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90078 --- Comment #9 from bin cheng --- Author: amker Date: Tue Apr 23 04:07:46 2019 New Revision: 270500 URL: https://gcc.gnu.org/viewcvs?rev=270500=gcc=rev Log: PR tree-optimization/90078 * tree-ssa-loop-ivopts.c (comp_cost::operator +,-,+=,-+,/=,*=): Add checks for infinite_cost overflow. gcc/testsuite * gcc/testsuite/g++.dg/tree-ssa/pr90078.C: New test. Also fix typo in ChangeLog entry for revision 270499. Added: trunk/gcc/testsuite/g++.dg/tree-ssa/pr90078.C Modified: trunk/gcc/ChangeLog trunk/gcc/testsuite/ChangeLog trunk/gcc/tree-ssa-loop-ivopts.c
[Bug testsuite/86153] [8 regression] test case g++.dg/pr83239.C fails starting with r261585
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86153 bin cheng changed: What|Removed |Added CC||amker at gcc dot gnu.org --- Comment #16 from bin cheng --- Should this be backported to GCC8?
[Bug c++/90078] [7/8/9 Regression] ICE with deep templates caused by overflow [PATCH]
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90078 --- Comment #6 from bin cheng --- (In reply to Martin Liška from comment #5) > (In reply to bin cheng from comment #4) > > In get_scaled_computation_cost_at, we have very big ratio between > > bb_count/loop_count: > > > > (gdb) p data->current_loop->latch->count > > $50 = {static n_bits = 61, static max_count = 2305843009213693950, static > > uninitialized_count = 2305843009213693951, m_val = 158483, m_quality = > > profile_guessed_local} > > (gdb) p gimple_bb(at)->count > > $51 = {static n_bits = 61, static max_count = 2305843009213693950, static > > uninitialized_count = 2305843009213693951, m_val = 1569139790, m_quality = > > profile_guessed_local} > > (gdb) p 1569139790 / 158483 > > $52 = 9900 > > (gdb) p cost > > $53 = {cost = 20, complexity = 2, scratch = 1} > > (gdb) p 19 * 9900 > > $54 = 188100 > > > > as a result, sum_cost soon reaches to overflow of infinite_cost. Shall we > > cap the ratio so that it doesn't grow too quick? Of course, some benchmark > > data is needed for this heuristic tuning. > > I would implement the capping in comp_cost struct where each individual > operator > can cap to infinite. What do you think Bin? Implementing the capping in comp_cost::operators to infinite_cost is less invasive. OTOH, capping bb_freq/loop_freq has its own advantages, because: Once cost reaches to infinite, it becomes meaningless in comparison as well as candidate choosing; capping bb_freq/loop_freq can still express hotness of code to some extend. Let's fix the issue by capping comp_cost::operators first for this stage 4 and revisit the idea capping bb_freq/loop_freq with more benchmark data in next Stage 1. How about that? Thanks. > > > > > > > Another problem is the generated binary has segment fault issue even > > compiled O0: > > > > $ ./g++ -O0 pr90078.cc -o a.out -ftemplate-depth=100 -ftime-report -g > > -std=c++14 > > $ gdb --args ./a.out > > > > Dump of assembler code for function main(): > >0x00400572 <+0>: push %rbp > >0x00400573 <+1>: mov%rsp,%rbp > >0x00400576 <+4>: sub$0x2625a020,%rsp > >0x0040057d <+11>:lea-0x2625a020(%rbp),%rax > >0x00400584 <+18>:mov%rax,%rdi > > => 0x00400587 <+21>:callq 0x4006c0 > 100, 100>::Tensor4()> > >0x0040058c <+26>:lea-0x4c4b410(%rbp),%rax > >0x00400593 <+33>:lea-0xe4e1c10(%rbp),%rdx > > > > The segment fault happens at the callq instruction. > > Yes, same happens also for clang. It's a stack overflow: > > $ g++ pr90078.cpp -ftemplate-depth=111 -fsanitize=address && ./a.out > AddressSanitizer:DEADLYSIGNAL > = > ==5750==ERROR: AddressSanitizer: stack-overflow on address 0x7fffd9da3af0 > (pc 0x004011cb bp 0x7fffdc60 sp 0x7fffd9da3af0 T0) > #0 0x4011ca in main (/home/marxin/Programming/testcases/a.out+0x4011ca) > #1 0x76d32b7a in __libc_start_main ../csu/libc-start.c:308 > #2 0x401109 in _start (/home/marxin/Programming/testcases/a.out+0x401109) > > SUMMARY: AddressSanitizer: stack-overflow > (/home/marxin/Programming/testcases/a.out+0x4011ca) in main > ==5750==ABORTING
[Bug c++/90078] [7/8/9 Regression] ICE with deep templates caused by overflow [PATCH]
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90078 --- Comment #4 from bin cheng --- In get_scaled_computation_cost_at, we have very big ratio between bb_count/loop_count: (gdb) p data->current_loop->latch->count $50 = {static n_bits = 61, static max_count = 2305843009213693950, static uninitialized_count = 2305843009213693951, m_val = 158483, m_quality = profile_guessed_local} (gdb) p gimple_bb(at)->count $51 = {static n_bits = 61, static max_count = 2305843009213693950, static uninitialized_count = 2305843009213693951, m_val = 1569139790, m_quality = profile_guessed_local} (gdb) p 1569139790 / 158483 $52 = 9900 (gdb) p cost $53 = {cost = 20, complexity = 2, scratch = 1} (gdb) p 19 * 9900 $54 = 188100 as a result, sum_cost soon reaches to overflow of infinite_cost. Shall we cap the ratio so that it doesn't grow too quick? Of course, some benchmark data is needed for this heuristic tuning. Another problem is the generated binary has segment fault issue even compiled O0: $ ./g++ -O0 pr90078.cc -o a.out -ftemplate-depth=100 -ftime-report -g -std=c++14 $ gdb --args ./a.out Dump of assembler code for function main(): 0x00400572 <+0>: push %rbp 0x00400573 <+1>: mov%rsp,%rbp 0x00400576 <+4>: sub$0x2625a020,%rsp 0x0040057d <+11>:lea-0x2625a020(%rbp),%rax 0x00400584 <+18>:mov%rax,%rdi => 0x00400587 <+21>:callq 0x4006c0 ::Tensor4()> 0x0040058c <+26>:lea-0x4c4b410(%rbp),%rax 0x00400593 <+33>:lea-0xe4e1c10(%rbp),%rdx The segment fault happens at the callq instruction.
[Bug tree-optimization/90021] [9 Regression] ICE in index_in_loop_nest, at tree-data-ref.h:587 since r270203
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90021 --- Comment #2 from bin cheng --- We have {{0, +, 1}_6, +, 1}_4 in this case, and _6 is an outer loop of loop_nest. Function add_multivariate_self_dist was intentionally skipped in PR89725 patch, but control flow gets to it because 1) In analyze_miv_subscript, equal access_fn case is specially handled, rather than general miv analysis. 2) In add_other_self_distances, evolution_function_is_univariate_p returns false for above access_fn. It looks we can also introduce another parameter loopnum to evolution_function_is_univariate_p, just like evolution_function_is_affine_multivariate_p to consider outer loop's chrec as invariant symbol here. OTOH, making changes in add_multivariate_self_dist still doesn't seem right in this case.
[Bug tree-optimization/90021] [9 Regression] ICE in index_in_loop_nest, at tree-data-ref.h:587 since r270203
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90021 --- Comment #1 from bin cheng --- Sorry for the breakage, I will have a look.
[Bug middle-end/89725] [8/9 Regression] ICE in get_fnname_from_decl, at varasm.c:1723
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89725 --- Comment #11 from bin cheng --- In case of data reference has more access functions than loop_nest of data dependence analysis, we need to skip/ignore access functions corresponding loops not in the loop_nest. So far this only happens in loop interchange since we want to reuse data references collected in outer loop. During computing classic dist/dir vector, we need to avoid out-of-bound memory access. Univariate SCEV can be simply bypassed by checking the loop/chrec_variable as patch in comment #7. Of course, add_other_self_distances needs to be handled as well. On the other hand, bypassing multivariate would be harder and the impact is not yet clear, however, we can take another strategy handling SCEV of outer loop as invariant (symbol) to loop_nest during dependence analysis. As a matter of fact, current code already does in various places, i.e, with calling to evolution_function_is_invariant_rec_p etc. After scanning, I think the only piece missing is in analyze_miv_subscript. I am testing a patch.
[Bug middle-end/89725] ICE in get_fnname_from_decl, at varasm.c:1723
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89725 --- Comment #9 from bin cheng --- (In reply to Richard Biener from comment #8) > (In reply to bin cheng from comment #7) > > I am testing below simple fix, it bypass access functions doesn't belong to > > analyzing loop_nest: > > > > diff --git a/gcc/tree-data-ref.c b/gcc/tree-data-ref.c > > index e536b463e96..410d44f43e8 100644 > > --- a/gcc/tree-data-ref.c > > +++ b/gcc/tree-data-ref.c > > @@ -4272,6 +4272,7 @@ build_classic_dist_vector_1 (struct > > data_dependence_relation *ddr, > > { > >unsigned i; > >lambda_vector init_v = lambda_vector_new (DDR_NB_LOOPS (ddr)); > > + struct loop *loop = DDR_LOOP_NEST (ddr)[0]; > > > >for (i = 0; i < DDR_NUM_SUBSCRIPTS (ddr); i++) > > { > > @@ -4302,6 +4303,15 @@ build_classic_dist_vector_1 (struct > > data_dependence_relation *ddr, > > return false; > > } > > > > + /* When data references are collected in a loop while data > > +dependences are analyzed in loop nest nested in the loop, we > > +would have more number of access functions than number of > > +loops. Skip access functions of loops not in the loop nest. > > + > > +See PR89725 for more information. */ > > + if (flow_loop_nested_p (get_loop (cfun, var_a), loop)) > > + continue; > > + > > dist = int_cst_value (SUB_DISTANCE (subscript)); > > index = index_in_loop_nest (var_a, DDR_LOOP_NEST (ddr)); > > *index_carry = MIN (index, *index_carry); > > > > Plus the assert in index_in_loop_nest. > > I wondered about chrecs like { 1, +, { 0 +, 1 }_1 }_2 (inner loop step > or initial value evolves wrt outer loop). We'd not catch that here. > > Also if the above is possible then why not simply strip those > subscripts when we build the DDR? That way the few other cases > we do index_in_loop_nest also are "fixed". > > Meanwhile testing of my patch finished but shows an ICE for > > FAIL: gfortran.dg/vect/pr81303.f -O scan-tree-dump-times linterchange > "is in > terchanged" 1 > FAIL: gfortran.dg/vect/pr81303.f -O (internal compiler error) > FAIL: gfortran.dg/vect/pr81303.f -O (test for excess errors) > > #1 0x00a61759 in vec::operator[] ( > this=0x3119f50 = {...}, ix=3) > at /space/rguenther/src/gcc-sccvn/gcc/vec.h:845 > 845 gcc_checking_assert (ix < m_vecpfx.m_num); > (gdb) > #3 0x01f2723a in should_interchange_loops (i_idx=3, o_idx=2, > datarefs=..., i_stmt_cost=41, o_stmt_cost=5, innermost_loops_p=true, > dump_info_p=true) > at /space/rguenther/src/gcc-sccvn/gcc/gimple-loop-interchange.cc:1460 > 1460 tree iloop_stride = (*stride)[i_idx], oloop_stride = > (*stride)[o_idx]; > > where the interchange code would need further changes for my change of the > loop-nest for DDRs. > > That said, can we strip subscripts for outer loops in > initialize_data_dependence_relation when we compute them? > OTOH the cases where we can ignore the subscript are not so clear > given that the outer loop behavior can very well compute Agree there may be more opportunities to disambiguate dependence with more SCEVed access function of outer loop. > non-aliasing. So selectively pruning just the unwanted distance > vectors looks safe. As you mentioned, multivariate needs to be handled with outer loop SCEV handled as some kind of invariant. This is necessary no matter we bypass it in dist vector construction or DDR initialization/computation. As you suggested, we can't undo it yet... > > But what about similar code in add_multivariate_self_dist or > add_other_self_distances?
[Bug middle-end/89725] ICE in get_fnname_from_decl, at varasm.c:1723
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89725 --- Comment #7 from bin cheng --- I am testing below simple fix, it bypass access functions doesn't belong to analyzing loop_nest: diff --git a/gcc/tree-data-ref.c b/gcc/tree-data-ref.c index e536b463e96..410d44f43e8 100644 --- a/gcc/tree-data-ref.c +++ b/gcc/tree-data-ref.c @@ -4272,6 +4272,7 @@ build_classic_dist_vector_1 (struct data_dependence_relation *ddr, { unsigned i; lambda_vector init_v = lambda_vector_new (DDR_NB_LOOPS (ddr)); + struct loop *loop = DDR_LOOP_NEST (ddr)[0]; for (i = 0; i < DDR_NUM_SUBSCRIPTS (ddr); i++) { @@ -4302,6 +4303,15 @@ build_classic_dist_vector_1 (struct data_dependence_relation *ddr, return false; } + /* When data references are collected in a loop while data +dependences are analyzed in loop nest nested in the loop, we +would have more number of access functions than number of +loops. Skip access functions of loops not in the loop nest. + +See PR89725 for more information. */ + if (flow_loop_nested_p (get_loop (cfun, var_a), loop)) + continue; + dist = int_cst_value (SUB_DISTANCE (subscript)); index = index_in_loop_nest (var_a, DDR_LOOP_NEST (ddr)); *index_carry = MIN (index, *index_carry); Plus the assert in index_in_loop_nest.
[Bug middle-end/89725] ICE in get_fnname_from_decl, at varasm.c:1723
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89725 --- Comment #6 from bin cheng --- (In reply to Richard Biener from comment #4) > I think the issue is that the DDR is bogus - loop interchange computes > data-refs > for a deeper nest (including some outer loops) than it ends up doing > dependence checking later on. But we have access functions analyzed with > respect to outer loops already. > > I think it would be possible to handle this in data dependence computation, > simply treating evolutions in outer loops as invariants. Eventually the > access functions evolving in outer loops can also be pruned? We can't > really undo SCEV analysis on them. > > I think that Jakubs fix is too conservative though. > > Since we fail when we cannot compute the "invalid" subscript distance at the > moment the safest fix would probably to create the DDR with the loop-nest > we originally analyzed? Bin? Unfortunately No. The access functions are analyzed wrto outer loops in order to cache find-data-reference process, thus save compilation time. Actually, we end up with computing ddr wrto deeper loop_nest here because computation with the originally analyzed loop_nest has failed. So this change won't do anything other than compute the same DDRs twice (and both would fail). There may be couple ways out. 1. Cancel the data reference caching by collecting DRs for loop_nest. At this stage, this might be the safest fix but very expensive. 2. Fix the DDR analysis code. For example as you suggested, or maybe we can simply bypass the irrelevant part when computing dir/dist vector? 3. Note we already prune_data_refs_not_in_loop, we can also prune the access functions too. Not sure if this is feasible. Also not sure if it's useful enough to be exposed as an tree-data-ref.h interface. Will have a check. > diff --git a/gcc/tree-data-ref.h b/gcc/tree-data-ref.h > index 11aa806a64d..54651e903ff 100644 > --- a/gcc/tree-data-ref.h > +++ b/gcc/tree-data-ref.h > @@ -585,6 +585,7 @@ index_in_loop_nest (int var, vec loop_nest) > if (loopi->num == var) >break; > > + gcc_assert (var_index < loop_nest.length ()); >return var_index; > } Guess this code should be included anyway, right? Thanks
[Bug middle-end/89849] New: Worse code at O3 because of slp
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89849 Bug ID: 89849 Summary: Worse code at O3 because of slp Product: gcc Version: 9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: amker at gcc dot gnu.org Target Milestone: --- Hi, This is the code sample from scovit@IRC: struct ciao { long a; long b; }; //__declspec(noinline) __attribute((noinline)) struct ciao square(int num) { struct ciao beta; beta.a = num; beta.b = num*num; return beta; } int main(int a) { struct ciao tje = square(a); return tje.a * tje.b; } O3 generates: square: .LFB0: .cfi_startproc movslq %edi, %rax imull %edi, %edi movq%rax, %xmm0 movslq %edi, %rdi movq%rdi, %xmm1 punpcklqdq %xmm1, %xmm0 movaps %xmm0, -24(%rsp) movq-24(%rsp), %rax movq-16(%rsp), %rdx ret .cfi_endproc .LFE0: .size square, .-square .section.text.startup,"ax",@progbits .p2align 4 .globl main .type main, @function main: .LFB1: .cfi_startproc subq$8, %rsp .cfi_def_cfa_offset 16 callsquare addq$8, %rsp .cfi_def_cfa_offset 8 imull %edx, %eax ret While O1/O2 generate: square: .LFB0: .cfi_startproc movslq %edi, %rax imull %edi, %edi movslq %edi, %rdx ret .cfi_endproc .LFE0: .size square, .-square .globl main .type main, @function main: .LFB1: .cfi_startproc callsquare imull %edx, %eax ret Looks like SLP gives: square (int num) { vector(2) long int * vectp.7; vector(2) long int * vectp.6; struct ciao D.1917; long int _1; int _2; long int _3; vector(2) long int _8; vector(2) long int vect_cst__9; [local count: 1073741824]: _1 = (long int) num_4(D); _2 = num_4(D) * num_4(D); _3 = (long int) _2; _8 = {_1, _3}; vect_cst__9 = _8; MEM[(struct ciao *)] = vect_cst__9; return D.1917; } And latter passes failed to resolve it.
[Bug testsuite/89834] New test case gcc.dg/vect/pr81740-2.c introduced in r269938 fails on power 7
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89834 --- Comment #5 from bin cheng --- Thanks very much for reporting and fixing the issue.
[Bug rtl-optimization/89487] [7/8 Regression] ICE in expand_expr_addr_expr_1, at expr.c:7993
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89487 --- Comment #9 from bin cheng --- (In reply to Jakub Jelinek from comment #8) > *** Bug 89731 has been marked as a duplicate of this bug. *** Hi Jakub, is this (and the duplication) fixed by the previous patches or the issue is still there? Thanks.