[Bug tree-optimization/115060] Probable an issue around usage of vect_look_through_possible_promotion in tree-vect-patterns.cc
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115060 --- Comment #3 from Feng Xue --- Linaro reported a regression: https://linaro.atlassian.net/browse/GNU-1226 Actually, this is not, but exposes a new bug in vect_look_through_possible_promotion. The function fails to figure out root definition if casts involves more than two promotions with sign change as: long a = (long)b; // promotion cast -> int b = (int)c; // promotion cast, sign change -> unsigned short c = ...; For this case, the function thinks the 2nd cast has different sign as the 1st, so stop looking through, while "unsigned short -> integer" is a nature sign extension.
[Bug tree-optimization/115228] New: Suspicious code in tree-vect-data-refs.cc
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115228 Bug ID: 115228 Summary: Suspicious code in tree-vect-data-refs.cc Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: fxue at os dot amperecomputing.com Target Milestone: --- Does the below condition code miss the case of "SAD_EXPR", in function vect_get_smallest_scalar_type(): if (assign) { scalar_type = TREE_TYPE (gimple_assign_lhs (assign)); if (gimple_assign_cast_p (assign) || gimple_assign_rhs_code (assign) == DOT_PROD_EXPR || gimple_assign_rhs_code (assign) == WIDEN_SUM_EXPR || gimple_assign_rhs_code (assign) == WIDEN_MULT_EXPR || gimple_assign_rhs_code (assign) == WIDEN_LSHIFT_EXPR || gimple_assign_rhs_code (assign) == FLOAT_EXPR)
[Bug tree-optimization/115060] New: Probable an issue around usage of vect_look_through_possible_promotion in tree-vect-patterns.cc
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115060 Bug ID: 115060 Summary: Probable an issue around usage of vect_look_through_possible_promotion in tree-vect-patterns.cc Product: gcc Version: 15.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: fxue at os dot amperecomputing.com Target Milestone: --- The purpose of the function is to peel off type casts to find out root definition for a given value. If patterns are involved, intermediate pattern-defined SSAs would be traversed instead of the originals. A subtlety here is that the root SSA (as the return value) might be the original one, even it has been recognized as a pattern. For example, a = (T1) patt_b; patt_b = (T2) c;// b = ... patt_c = not-a-cast;// c = old_seq Given 'a', the function will return 'c', instead of 'patt_c'. If a caller only does something based on type information of returned SSA, there is no problem. But if caller's pattern recog analysis is to parse definition statement of the SSA, the new pattern statement is bypassed. This tends to be inconsistent with processing logic of the pattern formation pass, which is not quite rational, and seems to be an issue, though does not cause any mistake. Anything that I missed here? Take one code snippet as example: vect_recog_mulhs_pattern () { ... vect_unpromoted_value unprom_rshift_input; tree rshift_input = vect_look_through_possible_promotion (vinfo, gimple_assign_rhs1 (last_stmt), _rshift_input); ... /* Get the definition of the shift input. */ stmt_vec_info rshift_input_stmt_info = vect_get_internal_def (vinfo, rshift_input); if (!rshift_input_stmt_info) return NULL; gassign *rshift_input_stmt = dyn_cast (rshift_input_stmt_info->stmt); if (!rshift_input_stmt) return NULL; ... if (gimple_assign_rhs_code (rshift_input_stmt) == PLUS_EXPR) // How about if rshift_input_stmt has a pattern replacement? { ... }
[Bug tree-optimization/114769] [14 Regression] Suspicious code in vect_recog_sad_pattern() since r14-1832
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114769 --- Comment #3 from Feng Xue --- When half_type is null, and call chain would be: vect_supportable_direct_optab_p (vinfo, sum_type, SAD_EXPR, NULL, ...) -> get_vectype_for_scalar_type (vinfo, NULL) -> get_related_vectype_for_scalar_type (vinfo, /*scalar_type= */NULL, ...) if ((!INTEGRAL_TYPE_P (scalar_type) && !POINTER_TYPE_P (scalar_type) && !SCALAR_FLOAT_TYPE_P (scalar_type)) Here will be a segfault.
[Bug tree-optimization/114769] New: Suspicious code in vect_recog_sad_pattern()
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114769 Bug ID: 114769 Summary: Suspicious code in vect_recog_sad_pattern() Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: fxue at os dot amperecomputing.com Target Milestone: --- This function may contain buggy code, which was introduced due to recent support to the new ABD pattern. It calls "vect_recog_absolute_difference" to check ABS/ABSU statement, if succeed, it expects that "vect_unpromoted_value unprom[2]" is properly set for the two operands. Though, at the final code snippet, this is missed, so "unprom" could have unwanted values. /* Failed to find a widen operation so we check for a regular MINUS_EXPR. */ gassign *diff = dyn_cast (STMT_VINFO_STMT (diff_stmt_vinfo)); if (diff_stmt && diff && gimple_assign_rhs_code (diff) == MINUS_EXPR && TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (abs_oprnd))) { *diff_stmt = diff; *half_type = NULL_TREE; // unprom is not set accordingly return true; } This execution path would be triggered for an architecture that merely support dot-product instruction, but w/o "IFN_ABD" and "IFN_VEC_WIDEN_ABD" instructions.
[Bug tree-optimization/114440] New: Fail to recognize a chain of lane-reduced operations for loop reduction vect
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114440 Bug ID: 114440 Summary: Fail to recognize a chain of lane-reduced operations for loop reduction vect Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: fxue at os dot amperecomputing.com Target Milestone: --- In a loop reduction path containing a lane-reduced operation (DOT_PROD/SAD/WIDEN_SUM), current vectorizer could not handle the pattern if there are other operations, which might be a normal or another lane-reduced one. A pseudo example is represented as: char *d0, *d1; char *s0, *s1; char *w; int *n; ... int sum = 0; for (i) { ... sum += d0[i] * d1[i]; /* DOT_PROD */ ... sum += abs(s0[i] - s1[i]); /* SAD */ ... sum += w[i];/* WIDEN_SUM */ ... sum += n[i];/* Normal */ ... } ... = sum; For the case, reduction vectype would vary with operations, and this causes mismatch on count of vectorized defs and uses, a possible means might be fixing that by generating extra trivial pass-through copies. Given a concrete example as: sum = 0; for (i) { sum += d0[i] * d1[i]; /* 16*char -> 4*int */ sum += n[i];/* 4*int -> 4*int */ } Final vetorized statements could be: sum_v0 = { 0, 0, 0, 0 }; sum_v1 = { 0, 0, 0, 0 }; sum_v2 = { 0, 0, 0, 0 }; sum_v3 = { 0, 0, 0, 0 }; for (i / 16) { sum_v0 += DOT_PROD (v_d0[i: 0 .. 15], v_d1[i: 0 .. 15]); sum_v1 += 0; // copy sum_v2 += 0; // copy sum_v3 += 0; // copy sum_v0 += v_n[i: 0 .. 3]; sum_v1 += v_n[i: 4 .. 7]; sum_v2 += v_n[i: 8 .. 11]; sum_v3 += v_n[i: 12 .. 15]; } sum = REDUC_PLUS(sum_v0 + sum_v1 + sum_v2 + sum_v3); In the above sequence, one summation statement simply forms one pattern. Though, we could easily compose a somewhat more complicated variant that gets into the similar situation. That is, a chain of lane-reduced operations comes from the non-reduction addend in one summation statement, like: sum += d0[i] * d1[i] + abs(s0[i] - s1[i]) + n[i]; Probably, this requires some extension in the vector pattern formation stage to split the patterns.
[Bug tree-optimization/113091] Over-estimate SLP vector-to-scalar cost for non-live pattern statement
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113091 Feng Xue changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED --- Comment #10 from Feng Xue --- Fixed
[Bug target/113326] Optimize vector shift with constant delta on shifting-count operand
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113326 --- Comment #7 from Feng Xue --- (In reply to Richard Biener from comment #6) > (In reply to Andrew Pinski from comment #5) > > One more thing: > > ``` > > vect_shift_0 = vect_value >> { 0, 1, 2, 3 }; > > vect_shift_1 = vect_value >> { 4, 5, 6, 7 }; > > vect_shift_2 = vect_value >> { 8, 9, 10, 11 }; > > vect_shift_3 = vect_value >> { 12, 13, 14, 15 }; > > ``` > > vs > > ``` > > vect_shift_0 = vect_value >> { 0, 1, 2, 3 }; > > vect_shift_1 = vect_shift_0 >> { 4, 4, 4, 4 }; > > vect_shift_2 = vect_shift_0 >> { 8, 8, 8, 8 }; > > vect_shift_3 = vect_shift_0 >> { 12, 12, 12, 12 }; > > ``` > > > > the first has fully independent operations while in the second case, there > > is one dependent and the are independent operations. > > > > On cores which has many vector units the first one might be faster than the > > second one. So this needs a cost model too. > > Note the vectorizer has the shift values dependent as well (across > iterations), > we just constant propagate after unrolling here. > > Note this is basically asking for "strength-reduction" of expensive > constants which could be more generally useful and not only for this > specific shift case. Consider the same example but with an add instead > of a shift for example, the same exact set of constants will appear. It is. But only find that vector shift has special treatment to constant operands based on its numerical pattern. No sure any other operator would be. BTW, here is a scalar-version strength-reduction for shift, like: int a = value >> n; int b = value >> (n + 6); ==> int a = value >> n; int b = a >> 6; // (n + 6) is not needed But this is not covered by current scalar strength-reduction pass.
[Bug target/113326] Optimize vector shift with constant delta on shifting-count operand
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113326 --- Comment #3 from Feng Xue --- (In reply to Andrew Pinski from comment #1) > Note on aarch64 with SVE, you should be able to generate those constants > without a load, using the index instruction. Ok. Thanks for the note. This still requires an extra instruction, while the constant delta could be nested in shift instruction as IMM operand. > Basically this requires an "un-shift" pass and most likely should be done at > the RTL level though that might be too late. > Maybe isel? I'm thinking of adding the processing in pass_lower_vector_ssa, which also contains other peephole vector ssa optimizations, not just lowering.
[Bug tree-optimization/113326] New: Optimize vector shift with constant delta on shifting-count operand
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113326 Bug ID: 113326 Summary: Optimize vector shift with constant delta on shifting-count operand Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: fxue at os dot amperecomputing.com Target Milestone: --- For shift by induction variable, loop vectorization could generate a series of vector shifts, whose shifting-count operands are advanced by constant delta. We could use result of the first vector shift as base, and transform others to be vector shift on the result by scalar delta, which would eliminate non-uniform shift vector count constant, and thereby save a load on most architectures. int test(int array[16]); int foo(int value) { int array[16]; for (int i = 0; i < 16; i++) array[i] = value >> i; return test(array); } // Current vect_value = { value, value, value, value }; vect_shift_0 = vect_value >> { 0, 1, 2, 3 }; vect_shift_1 = vect_value >> { 4, 5, 6, 7 }; vect_shift_2 = vect_value >> { 8, 9, 10, 11 }; vect_shift_3 = vect_value >> { 12, 13, 14, 15 }; // To be optimized to vect_value = { value, value, value, value }; vect_shift_0 = vect_value >> { 0, 1, 2, 3 }; vect_shift_1 = vect_shift_0 >> { 4, 4, 4, 4 }; vect_shift_2 = vect_shift_0 >> { 8, 8, 8, 8 }; vect_shift_3 = vect_shift_0 >> { 12, 12, 12, 12 };
[Bug tree-optimization/113104] Suboptimal loop-based slp node splicing across iterations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113104 Feng Xue changed: What|Removed |Added Resolution|FIXED |--- Status|RESOLVED|REOPENED --- Comment #7 from Feng Xue --- Partial permutation (especially extract-low/high) would incur inefficient splicing in some situation even the vector mode with lowest cost is used. int test(unsigned array[4][4]); int foo(unsigned short *a, unsigned long n) { unsigned array[4][4]; for (unsigned i = 0; i < 4; i++, a += n) { array[i][0] = (a[0] << 3) - (a[4] << 6); array[i][1] = (a[1] << 3) - (a[5] << 6); array[i][2] = (a[2] << 3) - (a[6] << 6); array[i][3] = (a[3] << 3) - (a[7] << 6); } return test(array); } // After vect-compare-cost fix mov x2, x0 stp x29, x30, [sp, -80]! add x3, x2, x1, lsl 1 lsl x1, x1, 1 mov x29, sp add x4, x3, x1 add x0, sp, 16 ldr q5, [x2] ldr q31, [x4, x1] ldr q0, [x3, x1] ldr q1, [x2, x1] moviv28.4s, 0// zip1v29.2d, v0.2d, v31.2d// zip1v2.2d, v5.2d, v1.2d // zip2v31.2d, v0.2d, v31.2d// zip2v1.2d, v5.2d, v1.2d // zip1v30.8h, v29.8h, v28.8h // zip1v4.8h, v2.8h, v28.8h // superfluous zip1v27.8h, v31.8h, v28.8h // zip1v3.8h, v1.8h, v28.8h // zip2v29.8h, v29.8h, v28.8h // zip2v31.8h, v31.8h, v28.8h // zip2v2.8h, v2.8h, v28.8h // zip2v1.8h, v1.8h, v28.8h // shl v30.4s, v30.4s, 3 shl v29.4s, v29.4s, 3 shl v4.4s, v4.4s, 3 shl v2.4s, v2.4s, 3 shl v27.4s, v27.4s, 6 shl v31.4s, v31.4s, 6 shl v3.4s, v3.4s, 6 shl v1.4s, v1.4s, 6 sub v27.4s, v30.4s, v27.4s sub v31.4s, v29.4s, v31.4s sub v3.4s, v4.4s, v3.4s sub v1.4s, v2.4s, v1.4s stp q27, q31, [sp, 48] stp q3, q1, [sp, 16] bl test ldp x29, x30, [sp], 80 ret // Expect it to be optimized as: lsl x3, x1, 1 mov x2, x0 stp x29, x30, [sp, -80]! add x1, x2, x1, lsl 1 add x4, x1, x3 mov x29, sp add x0, sp, 16 ldr q30, [x2, x3] ldr q0, [x2] ushll v31.4s, v30.4h, 3 ushll2 v30.4s, v30.8h, 6 ushll v29.4s, v0.4h, 3 ushll2 v0.4s, v0.8h, 6 sub v30.4s, v31.4s, v30.4s sub v0.4s, v29.4s, v0.4s str q0, [sp, 16] ldr q0, [x1, x3] str q30, [sp, 32] ldr q30, [x4, x3] ushll v29.4s, v0.4h, 3 ushll2 v0.4s, v0.8h, 6 ushll v31.4s, v30.4h, 3 ushll2 v30.4s, v30.8h, 6 sub v0.4s, v29.4s, v0.4s sub v30.4s, v31.4s, v30.4s stp q0, q30, [sp, 48] bl test ldp x29, x30, [sp], 80 ret Based on cost arising from splicing, we still need a way to select the most profitable vector mode per slp node, over the global vector mode specified in vinfo.
[Bug tree-optimization/113091] Over-estimate SLP vector-to-scalar cost for non-live pattern statement
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113091 --- Comment #8 from Feng Xue --- https://gcc.gnu.org/pipermail/gcc-patches/2023-December/641547.html
[Bug tree-optimization/113091] Over-estimate SLP vector-to-scalar cost for non-live pattern statement
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113091 --- Comment #7 from Feng Xue --- > The issue here is that because the "outer" pattern consumes > patt_64 = (int) patt_63 it should have adjusted _2 = (int) _1 > stmt-to-vectorize > as being the outer pattern root stmt for all this logic to work correctly. We could not simply make this adjustment since pattern recognition does not require replaced SSA to be of single-use. If we change the above case to attach another scalar use to "_2" as: int foo(char *a, char *b) { unsigned array[8]; int a0 = a[0]; // _2 = (int) _1; array[0] = (a0 - b[0]); array[1] = (a[1] - b[1]); array[2] = (a[2] - b[2]); array[3] = (a[3] - b[3]); array[4] = (a[4] - b[4]); array[5] = (a[5] - b[5]); array[6] = (a[6] - b[6]); array[7] = (a[7] - b[7]); return test(array) + a0; } The pattern statement "patt_64 = (int) patt_63" for "_2 = (int) _1" should be kept. So we also need the check of "identifying whether a scalar stmt takes part in vectorization or not" to ensure the adjustment is doable.
[Bug tree-optimization/113091] Over-estimate SLP vector-to-scalar cost for non-live pattern statement
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113091 --- Comment #6 from Feng Xue --- (In reply to Richard Sandiford from comment #5) > > The issue here is that because the "outer" pattern consumes > > patt_64 = (int) patt_63 it should have adjusted _2 = (int) _1 > > stmt-to-vectorize > > as being the outer pattern root stmt for all this logic to work correctly. > > I don't think it can though, at least not in general. The final pattern > stmt has to compute the same value as the original scalar stmt. Could current pattern replacement support N:1 mapping (N stmts -> 1 pattern)? If not, probably this handing would break related code somewhere.
[Bug tree-optimization/113104] Suboptimal loop-based slp node splicing across iterations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113104 --- Comment #2 from Feng Xue --- (In reply to Richard Biener from comment #1) > See my proposal on the mailing list to lift the restriction of sticking to a > single vector size, I think this is another example showing this. If you > use BB level vectorization by disabling loop vectorization but not SLP > vectorization the code should improve? Yes, the loop is fully unrolled, and BB SLP would. I could not find the proposal, would you share me a link? Thanks
[Bug tree-optimization/113104] New: Suboptimal loop-based slp node splicing across iterations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113104 Bug ID: 113104 Summary: Suboptimal loop-based slp node splicing across iterations Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: fxue at os dot amperecomputing.com Target Milestone: --- Given a partial vector-sized slp node in loop, code generation would utilize inter-iteration parallelism to archive full vectorization by splicing defs of the node in multiple iterations into one vector. This strategy is not always good, and could be refined in some situation. To be specific, we'd better not splice node if it participates in a full-vector-sized operation, otherwise a permute and vextract that are really unneeded would be introduced. Suppose target vector size is 128-bit, and a slp node is mapped to VEC_OP in an iteration. Depending on whether backend supports LO/HI version of the operation, there are two kinds code sequence for splicing. // Isolated 2 iterations res_v128_I0 = VEC_OP(opnd_v64_I0, ...) // iteration #0 res_v128_I1 = VEC_OP(opnd_v64_I1, ...) // iteration #1 // Spliced (1) opnd_v128_I0_I1 = { opnd_v64_I0, opnd_v64_I1 } // extra permute opnd_v64_lo = [vec_unpack_lo_expr] opnd_v128_I0_I1; // extra vextract opnd_v64_hi = [vec_unpack_hi_expr] opnd_v128_I0_I1; // extra vextract res_v128_I0 = VEC_OP(opnd_v64_lo, ...) res_v128_I1 = VEC_OP(opnd_v64_hi, ...) // Spliced (2) opnd_v128_I0_I1 = { opnd_v64_I0, opnd_v64_I1 } // extra permute res_v128_I0 = VEC_OP_LO(opnd_v128_i0_i1, ...) // similar or same as VEC_OP res_v128_I1 = VEC_OP_HI(opnd_v128_i0_i1, ...) // similar or same as VEC_OP Sometime, such permute and vextract might be optimized away by backend passes. But sometime, it can not. Here is a case on aarch64. int test(unsigned array[4][4]); int foo(unsigned short *a, unsigned long n) { unsigned array[4][4]; for (unsigned i = 0; i < 4; i++, a += n) { array[i][0] = a[0] << 6; array[i][1] = a[1] << 6; array[i][2] = a[2] << 6; array[i][3] = a[3] << 6; } return test(array); } // Current code generation mov x2, x0 stp x29, x30, [sp, -80]! add x3, x2, x1, lsl 1 lsl x1, x1, 1 mov x29, sp add x4, x3, x1 ldr d0, [x2] moviv30.4s, 0 add x0, sp, 16 ldr d31, [x2, x1] ldr d29, [x3, x1] ldr d28, [x4, x1] ins v0.d[1], v31.d[0]// ins v29.d[1], v28.d[0] // zip1v1.8h, v0.8h, v30.8h // superfluous zip2v0.8h, v0.8h, v30.8h // zip1v31.8h, v29.8h, v30.8h // zip2v29.8h, v29.8h, v30.8h // shl v1.4s, v1.4s, 6 shl v0.4s, v0.4s, 6 shl v31.4s, v31.4s, 6 shl v29.4s, v29.4s, 6 stp q1, q0, [sp, 16] stp q31, q29, [sp, 48] bl test ldp x29, x30, [sp], 80 ret // May be optimized to: stp x29, x30, [sp, -80]! mov x29, sp mov x2, x0 add x0, sp, 16 lsl x3, x1, 1 add x1, x2, x1, lsl 1 add x4, x1, x3 ldr d31, [x2, x3] ushll v31.4s, v31.4h, 6 ldr d30, [x2] ushll v30.4s, v30.4h, 6 str q30, [sp, 16] ldr d30, [x1, x3] ushll v30.4s, v30.4h, 6 str q31, [sp, 32] ldr d31, [x4, x3] ushll v31.4s, v31.4h, 6 stp q30, q31, [sp, 48] bl test ldp x29, x30, [sp], 80 ret
[Bug tree-optimization/113091] Over-estimate SLP vector-to-scalar cost for non-live pattern statement
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113091 --- Comment #3 from Feng Xue --- The function vect_bb_vectorization_profitable_p resorts to a recursive way to identify scalar use, for this case, setting STMT_VINFO_LIVE_P or not would not change scalar cost computation.
[Bug tree-optimization/113091] Over-estimate SLP vector-to-scalar cost for non-live pattern statement
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113091 --- Comment #2 from Feng Xue --- (In reply to Richard Biener from comment #1) > It's the logic > > FOR_EACH_VEC_ELT (SLP_TREE_SCALAR_STMTS (node), i, stmt_info) > { > if (svisited.contains (stmt_info)) > continue; > stmt_vec_info orig_stmt_info = vect_orig_stmt (stmt_info); > if (STMT_VINFO_IN_PATTERN_P (orig_stmt_info) > && STMT_VINFO_RELATED_STMT (orig_stmt_info) != stmt_info) > /* Only the pattern root stmt computes the original scalar value. */ > continue; > bool mark_visited = true; > gimple *orig_stmt = orig_stmt_info->stmt; > ssa_op_iter op_iter; > def_operand_p def_p; > FOR_EACH_PHI_OR_STMT_DEF (def_p, orig_stmt, op_iter, SSA_OP_DEF) > { > imm_use_iterator use_iter; > gimple *use_stmt; > stmt_vec_info use_stmt_info; > FOR_EACH_IMM_USE_STMT (use_stmt, use_iter, DEF_FROM_PTR (def_p)) > if (!is_gimple_debug (use_stmt)) > { > use_stmt_info = bb_vinfo->lookup_stmt (use_stmt); > if (!use_stmt_info > || !PURE_SLP_STMT (vect_stmt_to_vectorize > (use_stmt_info))) > { > STMT_VINFO_LIVE_P (stmt_info) = true; > > specifically the last check. That's supposed to pick up the "main" pattern > that's now covering the scalar stmt. > > But somehow the "main" pattern, > > patt_67 = .VEC_WIDEN_MINUS (_1, _3); // _5 = _2 - _4; > patt_68 = (signed short) patt_67; // _5 = _2 - _4; > patt_69 = (int) patt_68; // _5 = _2 - _4; > > doesn't get picked up here. I wonder what's the orig_stmt and the def > picked and what original scalar use we end up in where the > vect_stmt_to_vectorize isn't the "last" pattern. Maybe we really want This problem happens at slp node: note: node 0x425bc38 (max_nunits=8, refcnt=1) vector(8) char note: op template: _1 = *a_50(D); note: stmt 0 _1 = *a_50(D); note: stmt 1 _7 = MEM[(char *)a_50(D) + 1B]; note: stmt 2 _13 = MEM[(char *)a_50(D) + 2B]; note: stmt 3 _19 = MEM[(char *)a_50(D) + 3B]; note: stmt 4 _25 = MEM[(char *)a_50(D) + 4B]; note: stmt 5 _31 = MEM[(char *)a_50(D) + 5B]; note: stmt 6 _37 = MEM[(char *)a_50(D) + 6B]; note: stmt 7 _43 = MEM[(char *)a_50(D) + 7B]; The orig_stmt is "_1 = *a_50(D)" The use stmt is "_2 = (int) _1", whose pattern statement is "patt_64 = (int) patt_63", which is not referenced by any original or other pattern statements. Or in other word, the orig_stmt could be absorbed into a vector operation, without any outlier scalar use. The fore-mentioned "last check" in vect_bb_slp_mark_live_stmts would make the orig_stmt to be STMT_VINFO_LIVE_P, which actually implies it has scalar use (though it should not have), the difference is re-generating the def somewhere, rather than retaining the original scalar statement. And the following "vectorizable_live_operation" would account the new operations into vectorization cost of the SLP instance. The function vect_bb_vectorization_profitable_p resorts to a recursive way to identify scalar use, for this case, setting STMT_VINFO_LIVE_P or not would change scalar cost computation. If we can avoid such fake-liveness adjustment on the statements we are interested in, vectorization cost could beat scalar cost, and make the former succeed. Unvectorized: mov x2, x0 stp x29, x30, [sp, -48]! mov x29, sp ldrbw3, [x1] ldrbw4, [x1, 1] add x0, sp, 16 ldrbw9, [x2] ldrbw8, [x2, 1] sub w9, w9, w3 ldrbw7, [x2, 2] ldrbw3, [x1, 2] sub w8, w8, w4 ldrbw6, [x2, 3] ldrbw4, [x1, 3] sub w7, w7, w3 ldrbw10, [x1, 5] ldrbw3, [x1, 4] sub w6, w6, w4 ldrbw5, [x2, 4] ldrbw4, [x2, 5] sub w5, w5, w3 ldrbw3, [x2, 6] sub w4, w4, w10 ldrbw2, [x2, 7] ldrbw10, [x1, 6] ldrbw1, [x1, 7] sub w3, w3, w10 stp w9, w8, [sp, 16] sub w1, w2, w1 stp w7, w6, [sp, 24] stp w5, w4, [sp, 32] stp w3, w1, [sp, 40] bl test ldp x29, x30, [sp], 48 ret Vectorized: mov x2, x0 stp x29, x30, [sp, -48]! mov x29, sp ldr d1, [x1] add x0, sp, 16 ldr d0, [x2] usubl v0.8h, v0.8b, v1.8b sxtlv1.4s, v0.4h sxtl2 v0.4s, v0.8h stp q1, q0, [sp, 16] bl test ldp x29, x30, [sp], 48 ret > these "overlapping" patterns, but IMHO having "two entries" into > a chain of scalar stmts is bad and we should link up the whole matched > sequence to the final "root" instead? > > That said, the
[Bug tree-optimization/113091] New: Over-estimate SLP vector-to-scalar cost for non-live pattern statement
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113091 Bug ID: 113091 Summary: Over-estimate SLP vector-to-scalar cost for non-live pattern statement Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: fxue at os dot amperecomputing.com Target Milestone: --- Gcc fails to vectorize the below testcase on aarch64. int test(unsigned array[8]); int foo(char *a, char *b) { unsigned array[8]; array[0] = (a[0] - b[0]); array[1] = (a[1] - b[1]); array[2] = (a[2] - b[2]); array[3] = (a[3] - b[3]); array[4] = (a[4] - b[4]); array[5] = (a[5] - b[5]); array[6] = (a[6] - b[6]); array[7] = (a[7] - b[7]); return test(array); } The dump shows that loads to a[i] and b[i] are considered to be live as scalar references, which results in over-estimated vector-to-scalar cost. *a_50(D) 1 times vec_to_scalar costs 2 in epilogue MEM[(char *)a_50(D) + 1B] 1 times vec_to_scalar costs 2 in epilogue MEM[(char *)a_50(D) + 2B] 1 times vec_to_scalar costs 2 in epilogue MEM[(char *)a_50(D) + 3B] 1 times vec_to_scalar costs 2 in epilogue MEM[(char *)a_50(D) + 4B] 1 times vec_to_scalar costs 2 in epilogue MEM[(char *)a_50(D) + 5B] 1 times vec_to_scalar costs 2 in epilogue MEM[(char *)a_50(D) + 6B] 1 times vec_to_scalar costs 2 in epilogue MEM[(char *)a_50(D) + 7B] 1 times vec_to_scalar costs 2 in epilogue *b_51(D) 1 times vec_to_scalar costs 2 in epilogue MEM[(char *)b_51(D) + 1B] 1 times vec_to_scalar costs 2 in epilogue MEM[(char *)b_51(D) + 2B] 1 times vec_to_scalar costs 2 in epilogue MEM[(char *)b_51(D) + 3B] 1 times vec_to_scalar costs 2 in epilogue MEM[(char *)b_51(D) + 4B] 1 times vec_to_scalar costs 2 in epilogue MEM[(char *)b_51(D) + 5B] 1 times vec_to_scalar costs 2 in epilogue MEM[(char *)b_51(D) + 6B] 1 times vec_to_scalar costs 2 in epilogue MEM[(char *)b_51(D) + 7B] 1 times vec_to_scalar costs 2 in epilogue Subtraction on char type is recognized as widen-sub, and involves two kinds of pattern replacement. * Original _1 = *a_50(D); _2 = (int) _1; _3 = *b_51(D); _4 = (int) _3; _5 = _2 - _4; * After pattern replacement patt_63 = (unsigned short) _1; // _2 = (int) _1; patt_64 = (int) patt_63;// _2 = (int) _1; patt_65 = (unsigned short) _3; // _4 = (int) _3; patt_66 = (int) patt_65;// _4 = (int) _3; patt_67 = .VEC_WIDEN_MINUS (_1, _3); // _5 = _2 - _4; patt_68 = (signed short) patt_67; // _5 = _2 - _4; patt_69 = (int) patt_68; // _5 = _2 - _4; For the statement "_2 = (int) _1", its vectorization representative "patt_64 = (int) patt_63" is not marked as PURE_SLP, so it is conservatively considered to having scalar use and being live outside of SLP bb (in the function vect_bb_slp_mark_live_stmts). However, the pattern definition is actually dead, should not contribute to vector-to-scalar cost. Those defs from pattern statements are not part of function body, we could not track def/use chain as ordinary SSAs. Probably, we may have a quick fix for one situation, if the original SSA "_2" has single use, its existence should be only covered by vectorized operation, no matter what/how it would be w/o pattern replacement.
[Bug target/106671] aarch64: BTI instruction are not inserted for cross-section direct calls
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106671 Feng Xue changed: What|Removed |Added CC||fxue at os dot amperecomputing.com --- Comment #9 from Feng Xue --- On some occasions, we may not use the new ld, the kernel-building relies on its own runtime linker which is used for kernel modules. So I created a patch (https://gcc.gnu.org/pipermail/gcc-patches/2023-August/626084.html), and this provides user another option that could be done at the compiler side.
[Bug tree-optimization/109427] New: Wrong param description in param.opt
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109427 Bug ID: 109427 Summary: Wrong param description in param.opt Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: fxue at os dot amperecomputing.com Target Milestone: --- -param=vect-induction-float= Common Joined UInteger Var(param_vect_induction_float) Init(1) IntegerRage(0, 1) Param Optimization ^^^
[Bug ipa/108445] Address expression on global variable is not normalized
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108445 --- Comment #1 from Feng Xue --- Created attachment 54297 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=54297=edit testcase
[Bug ipa/108445] New: Address expression on global variable is not normalized
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108445 Bug ID: 108445 Summary: Address expression on global variable is not normalized Product: gcc Version: 13.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: ipa Assignee: unassigned at gcc dot gnu.org Reporter: fxue at os dot amperecomputing.com CC: marxin at gcc dot gnu.org Target Milestone: --- Created attachment 54296 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=54296=edit testcase Compile two files with option "-flto -g -O2", addr_f1.c int gArray[16]; addr_f2.c extern int gArray[]; int foo(int *a) { int *p = a; return *p; } int main(int argc, char *argv[]) { if (argc & 1) gArray[argc - 1] = 1; if (argc > 1) return foo(gArray); return 0; } We will get an ICE as: addr_f2.c:10:5: error: invalid address operand in ‘mem_ref’ 10 | int main(int argc, char *argv[]) | ^ MEM[(int *)]; # VUSE <.MEM_9> _8 = MEM[(int *)]; during GIMPLE pass: fixup_cfg addr_f2.c:10:5: internal compiler error: verify_gimple failed 0xe87e8b verify_gimple_in_cfg(function*, bool, bool) ../../gcc/tree-cfg.cc:5647 0xd55eb4 execute_function_todo ../../gcc/passes.cc:2091 0xd567e3 do_per_function ../../gcc/passes.cc:1694 0xd567e3 execute_todo ../../gcc/passes.cc:2145 Detailed representation of above is _REF[], not a normalized address expression, which is rejected by gimple verifier. For an address expression on global variable, LTO gimple serializer would do a trick transformation to change it to such kind of redundant form, then rely on gimple de-serializer to revert the change. However, in some situation, de-serializer may fail to do that(as the testcase), this does not cause ICE because it almost happens on gimple-debug statement. With the commit "r13-4743-gda85bfc75024a92b97e60e4436863dd5789786ec", an constant address expression might be shared by gimple-debug and other statements , and thus makes the issue exposed. 1. # DEBUG BEGIN_STMT 2. # DEBUG a => 3. # DEBUG INLINE_ENTRY foo 4. # DEBUG BEGIN_STMT 5. # DEBUG p => 6. # DEBUG BEGIN_STMT 7. _10 = MEM[(int *)]; After early-inlining, the "main" function will end up with above gimples, in which stmt "5" and "7" refer to shared
[Bug rtl-optimization/108117] Wrong instruction scheduling on value coming from abnormal SSA
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108117 --- Comment #8 from Feng Xue --- (In reply to Andrew Pinski from comment #7) > (In reply to Feng Xue from comment #6) > > (In reply to Andrew Pinski from comment #2) > > > https://en.cppreference.com/w/c/program/setjmp > > > > I think that most programmers are not aware of this, neither I for sure. > > Usage of volatile here is not that intuitive as for purpose of > > multi-execution synchronization > > You should read up on > https://en.cppreference.com/w/cpp/utility/program/signal too. >From viewpoint of programmer, setjmp/longjmp is somewhat similar to C++ exception handling, which happens in one logical execution context, while signal implies two unrelated contexts. In another angle, because gcc already model control flow and SSA web for setjmp/longjmp, explicit volatile specification is not really needed. But signal mechanism is kind of asynchronous exception, for which it is impractical for compiler to do that, so volatile is must.
[Bug rtl-optimization/108117] Wrong instruction scheduling on value coming from abnormal SSA
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108117 --- Comment #6 from Feng Xue --- (In reply to Andrew Pinski from comment #2) > https://en.cppreference.com/w/c/program/setjmp I think that most programmers are not aware of this, neither I for sure. Usage of volatile here is not that intuitive as for purpose of multi-execution synchronization, it just tells compiler to keep variable in memory, instead of register. With current SSA representation in GCC, probably, this could be automatically identified without explicit source level specifier. That is, any SSA name occurring in abnormal phi should go to memory during expansion to rtl.
[Bug rtl-optimization/108117] New: Wrong instruction scheduling on value coming from abnormal SSA
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108117 Bug ID: 108117 Summary: Wrong instruction scheduling on value coming from abnormal SSA Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: fxue at os dot amperecomputing.com Target Milestone: --- Compile the following code with "-O2" on AArch64. #include #include jmp_buf ex_buf; __attribute__((noinline)) int fn_throw(int x) { if (x == 1) longjmp(ex_buf, 1); return 1; } int main(int argc, char** argv) { int va = 0; int vb = 0; if (!setjmp(ex_buf)) { va = fn_throw(1); /* throw via longjmp */ vb = 1; } else printf("Got exception, va = %d\n", va); if (vb) printf("Failed, vb should not = %d!\n", vb); return 0; } Since "fn_throw" involves abnormal control flow transferring, any statement after "va = fn_throw(1)" should not be hoisted prior to the call. In this case, "vb = 1" is moved before it by RTL inst sched, and leads to incorrect result. Similar to C++ exception handling, setjmp/longjmp would generate SSA names occurring in abnormal phi, these should be specially treated. Though, it looks like that RTL passes do not respect this characteristics in any kind of code motions. Now the issue is only exposed on AArch64, for while inst sched1 pass is enabled, but it is also a potential one on other backends.
[Bug tree-optimization/107828] tree-inlining would generate SSA with incorrect def stmt
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107828 Feng Xue changed: What|Removed |Added Resolution|INVALID |--- Status|RESOLVED|WAITING --- Comment #4 from Feng Xue --- (In reply to Martin Jambor from comment #2) > I don't see any correctness issue here. It is true that when IPA-SRA > needs to create a temporary SSA_NAME for an assignment with > VIEW_CONVERT_EXPR (that is the only statement that extra_stmts can > contain at the moment), the SSA_NAME then get's re-mapped to another > while woe could have just re-use the existing one, it is already > created within the new function. > > We could use the knowledge that extra_stmts is really only this one > statement or nothing and avoid the remapping but that would be at the > expense of universality and extensibility of the interface between > tree-inline.cc and ipa-param-manipulations.cc. > > Please let me know if I am missing something (and preferably with a > test-case that demonstrates the problem). Sorry for late response. It is hard to compose a simple test case, since this could only be exposed with our own private code. To reproduce it is not that straightforward, but we could forcefully do that with a minor change to "ipa_param_body_adjustments::modify_assignment" in ipa-param-manipulation.cc. diff --git a/gcc/ipa-param-manipulation.cc b/gcc/ipa-param-manipulation.cc index cee0e23f946..202c0fda32b 100644 --- a/gcc/ipa-param-manipulation.cc +++ b/gcc/ipa-param-manipulation.cc @@ -1787,7 +1787,7 @@ ipa_param_body_adjustments::modify_assignment (gimple *stmt, any = modify_expression (lhs_p, false); any |= modify_expression (rhs_p, false); if (any - && !useless_type_conversion_p (TREE_TYPE (*lhs_p), TREE_TYPE (*rhs_p))) + && true /* !useless_type_conversion_p (TREE_TYPE (*lhs_p), TREE_TYPE (*rhs_p)) */) { if (TREE_CODE (*rhs_p) == CONSTRUCTOR) { @@ -1805,7 +1805,14 @@ ipa_param_body_adjustments::modify_assignment (gimple *stmt, *rhs_p); tree tmp = force_gimple_operand (new_rhs, extra_stmts, true, NULL_TREE); - gimple_assign_set_rhs1 (stmt, tmp); + + tree tmp1 = make_ssa_name (tmp); + gimple *copy = gimple_build_assign (tmp1, tmp); + + auto gsi = gsi_last (*extra_stmts); + gsi_insert_after_without_update (, copy, GSI_NEW_STMT); + + gimple_assign_set_rhs1 (stmt, tmp1); } return true; } The purpose of above code piece is to add trivial ssa copy statement to make the below "extra_stmts" not empty: remap_gimple_stmt() { ... if (id->param_body_adjs) { ... if (!gimple_seq_empty_p (extra_stmts)) { memset (, 0, sizeof (wi)); wi.info = id; for (gimple_stmt_iterator egsi = gsi_start (extra_stmts); !gsi_end_p (egsi); gsi_next ()) walk_gimple_op (gsi_stmt (egsi), remap_gimple_op_r, ); ^ ... } } ... } Now, we use the case gcc.dg/ipa/ipa-sra-1.c, and option is "-O3 -flto -fipa-sra". Trace into "remap_gimple_op_r" as above, check remapping on copy: <&0x76c282d0> ISRA.0_1 = ISRA.0; A new SSA name "ISRA.0_2" is created, after setting definition statement for it via: if (TREE_CODE (*tp) == SSA_NAME) { *tp = remap_ssa_name (*tp, id); *walk_subtrees = 0; if (is_lhs) SSA_NAME_DEF_STMT (*tp) = wi_p->stmt; ^ return NULL; } Here, both "ISRA.0_1" and "ISRA.O_2" belong to same function, and point to same def stmt.
[Bug lto/107829] New: Trivial compile time tracking code
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107829 Bug ID: 107829 Summary: Trivial compile time tracking code Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: lto Assignee: unassigned at gcc dot gnu.org Reporter: fxue at os dot amperecomputing.com CC: marxin at gcc dot gnu.org Target Milestone: --- In the function "materialize_cgraph" of lto.cc: /* Start the appropriate timer depending on the mode that we are operating in. */ lto_timer = (flag_wpa) ? TV_WHOPR_WPA : (flag_ltrans) ? TV_WHOPR_LTRANS : TV_LTO; timevar_push (lto_timer); current_function_decl = NULL; set_cfun (NULL); if (!quiet_flag) fprintf (stderr, "\n"); timevar_pop (lto_timer); Execution of code enclosed by time-variable "lto_timer" is expected to be of zero time. Adding time tracking here seems to be superfluous.
[Bug tree-optimization/107828] New: tree-inlining would generate SSA with incorrect def stmt
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107828 Bug ID: 107828 Summary: tree-inlining would generate SSA with incorrect def stmt Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: fxue at os dot amperecomputing.com Target Milestone: --- In the function "remap_gimple_op_r" of tree-inlining.cc: ... if (TREE_CODE (*tp) == SSA_NAME) { *tp = remap_ssa_name (*tp, id); *walk_subtrees = 0; if (is_lhs) SSA_NAME_DEF_STMT (*tp) = wi_p->stmt; return NULL; } The definition statement of original "*tp" may be same as wi_p->stmt. So, we will end up with a statement that it is pointed by both old and new SSA name, while the old one should have been reclaimed. And this happens when some parameter needs to be adjusted during inline versioning as: remap_gimple_stmt() { ... if (id->param_body_adjs) { ... if (!gimple_seq_empty_p (extra_stmts)) { memset (, 0, sizeof (wi)); wi.info = id; for (gimple_stmt_iterator egsi = gsi_start (extra_stmts); !gsi_end_p (egsi); gsi_next ()) walk_gimple_op (gsi_stmt (egsi), remap_gimple_op_r, ); ^ ... } } ... }
[Bug tree-optimization/107818] New: Overflow of linemap breaks its chronological order
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107818 Bug ID: 107818 Summary: Overflow of linemap breaks its chronological order Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: fxue at os dot amperecomputing.com Target Milestone: --- Large-size source codes might exceed representation space of linemap. When this happens, UNKNOWN_LOCATION(0) would inserted to the end of linemap. And the action breaks the order constraint on the map, which requires all logical locations contained by it should remain chronologically-ordered, so that binary search could be used.(Comments in linemap_ordinary_map_lookup). In the function "linemap_add": ... if (start_location >= LINE_MAP_MAX_LOCATION) /* We ran out of line map space. */ start_location = 0; line_map_ordinary *map = linemap_check_ordinary (new_linemap (set, start_location)); ^^^ UNKNOWN_LOCATION(0) is also added to linemap map->reason = reason; ... Afterwards, logical location to source line would be mis-translated. pr86872 only partially fixed see-able ICE due to linemap overflow, but made a this hidden issue.
[Bug ipa/107670] New: Suspicious redundant code in ipa-cp
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107670 Bug ID: 107670 Summary: Suspicious redundant code in ipa-cp Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: ipa Assignee: unassigned at gcc dot gnu.org Reporter: fxue at os dot amperecomputing.com CC: marxin at gcc dot gnu.org Target Milestone: --- Function "ipa_prop_read_jump_functions()" in ipa-prop.c: ipa_check_create_node_params (); ipa_check_create_edge_args (); These two are also called inside the immediately following statement: ipa_register_cgraph_hooks(); Function "ipa_fn_summary_read()" in ipa-fnsummary.cc: ipa-cp's hooks will be registered twice in this function. One is indirectly done by ipa_prop_read_jump_functions(), and the other is by a nearby call to ipa_register_cgraph_hooks.
[Bug tree-optimization/107066] Field initialized before ctor is mis-optimized away by DSE
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107066 --- Comment #3 from Feng Xue --- Got it. Thanks for that.
[Bug tree-optimization/107066] New: Field initialized before ctor is mis-optimized away by DSE
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107066 Bug ID: 107066 Summary: Field initialized before ctor is mis-optimized away by DSE Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: fxue at os dot amperecomputing.com Target Milestone: --- By means of user-defined new operator, it is possible that a field is initialized before constructor. #include class A { public: int f1; int f2; A() : f2(2) { } void *operator new(size_t size) { void *mem = ::operator new(size); A *obj = static_cast(mem); obj->f1 = 1; return obj; } }; A* foo () { return new A(); } The original gimple code of foo() is: struct A * foo () { void * D.2444; void * _9; : _9 = operator new (8); MEM[(struct A *)_9].f1 = 1; MEM[(struct A *)_9] ={v} {CLOBBER}; MEM[(struct A *)_9].f2 = 2; return _9; } In gimple, there exists a pseudo clobber statement marking beginning of constructor code. Although the statement is of no side effect, it is regarded as normal store by DSE when determining store redundancy. Consequently, DSE thought that "MEM[(struct A *)_9].f1 = 1" was killed by "MEM[(struct A *)_9] ={v} {CLOBBER}", and removed it. After DSE pass,the foo becomes: struct A * foo () { void * D.2444; void * _9; : _9 = operator new (8); MEM[(struct A *)_9] ={v} {CLOBBER}; MEM[(struct A *)_9].f2 = 2; return _9; }
[Bug ipa/102513] [10/11/12 Regression] Many false positive warnings with recursive function
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102513 --- Comment #10 from Feng Xue --- (In reply to Martin Jambor from comment #8) > I am about to thest the following patch. In longer-run, it would be better > to never generate lattice values outside of the value_range but there is an > ordering problem, we need the complete VR info before we can use it. I plan > to rearrange IPA-CP into making multiple passes over the lattice dependency > graph and this should quite naturally be solved by doing this kind of > resursive-value-generation only in second and later passes. > > > diff --git a/gcc/ipa-cp.cc b/gcc/ipa-cp.cc > index 453e9c93cc3..cbbb8bbc80a 100644 > --- a/gcc/ipa-cp.cc > +++ b/gcc/ipa-cp.cc > @@ -6154,8 +6154,16 @@ decide_whether_version_node (struct cgraph_node *node) > { > ipcp_value *val; > for (val = lat->values; val; val = val->next) > - ret |= decide_about_value (node, i, -1, val, , > - _gen_clones); > + { > + if (!plats->m_value_range.bottom_p () > + && !plats->m_value_range.m_vr.contains_p (val->value)) > + { > + gcc_checking_assert (val->self_recursion_generated_p ()); > + continue; > + } > + ret |= decide_about_value (node, i, -1, val, , > +_gen_clones); > + } > } > >if (!plats->aggs_bottom) Here is a complication that value range for recursion index might not be not easily computed, and could not prevent IPA-CP generating useless copy. Constraint of recursion index comes from "block2[level][x]", not its value range deduced from condition predicate (level > 0). Change the case to cover up value range of "level", and we will get same warning. So in the circumstances, one way for us is to disable warning for these copies? extern int block2[7][256]; extern unsigned G_START; extern unsigned G_SCALE; static int encode_block(int block2[7][256], unsigned level) { int best_score = 0; for (unsigned x = G_START; x < G_SCALE * level; x++) { int v = block2[1][x]; block2[level][x] = 0; best_score += v * v; } if (G_SCALE * level > G_START && best_score > 64) { int score = 0; score += encode_block(block2, level - 1); score += encode_block(block2, level - 1); if (score < best_score) { best_score = score; } } return best_score; }
[Bug ipa/104377] Unreachable code in create_specialized_node of ipa-prop.c?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104377 --- Comment #4 from Feng Xue --- (In reply to Martin Jambor from comment #2) > (In reply to Feng Xue from comment #1) > > > > OK. I does missed something. Here we could not hold assumption that > > ipcp_decision_stage() only sees raw cgraph node, since sometime in the > > future some new ipa pass may be added prior to ipa-cp, and this pass > > introduces clone node. > > Right, initially IPA-SRA was developed as a pass before IPA-CP and it > may well be that we decide to swap the order again. > > > > > However, there is a questionable point about the code snippet > > > > if (!node->can_change_signature > > || old_adj->op != IPA_PARAM_OP_COPY > > || (!known_csts[old_adj->base_index] > > && ipa_is_param_used (info, old_adj->base_index))) > > > > In ipa-cp, known_csts is for the node, has no relation to the node's origin > > node, but here it is accessed via index of the latter (old_adj->base_index), > > will this cause out-of-bound error? > > I think the code is correct. Assume IPA-SRA running before IPA-CP, and > we're compiling a function with two argument, with indices 0 and 1. > > Analysis phases of both passes run before the IPA (WPA) phases of > either. This is important to keep in mind. > > IPA SRA removes the first one with index zero as useless, IPA-CP wants > to remove the second one with index 1, possibly because it is constant > everywhere. In oder to that it has to combine the pre-existing > adjustments with its own changes. > > Before create_specialized_node, the pass checks whether previous > passes did not kill some parameters and stops caring about them, but > it does not re-index anything, all lattices, jump functions, > everything, still keep their positions (and thus indices) they got in > the analysis phase. > > Then create_specialized_node hits this loop. For i=0 encounters an > old_adj element that actually describes the parameter which originally > had index 1. The pass looks up in base_index what the original > (pre-IPA) index of the parameter was (1) and uses those to look up in > its own structures whether it wants to remove it or not. > > Bounds should be always OK, base_index should never be greater than > the original pre-IPA number of parameters (-1) and known_csts should > always have just as many parameters. > > Does that make sense? Yes. Thanks for your explanation.
[Bug tree-optimization/102513] [10/11/12 Regression] Many false positive warnings with recursive function
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102513 Feng Xue changed: What|Removed |Added CC||fxue at os dot amperecomputing.com --- Comment #7 from Feng Xue --- (In reply to Jakub Jelinek from comment #6) > It is in > /* Recursively generate lattice values with a limited count. */ > FOR_EACH_VEC_ELT (val_seeds, i, src_val) > { > for (int j = 1; j < max_recursive_depth; j++) > { > tree cstval = get_val_across_arith_op (opcode, opnd1_type, > opnd2, > src_val, res_type); > if (!cstval > || !ipacp_value_safe_for_type (res_type, cstval)) > break; > > ret |= dest_lat->add_value (cstval, cs, src_val, src_idx, > src_offset, _val, j); > gcc_checking_assert (src_val); > } > } > (but there is another spot doing the similar thing) where it would be nice > to also break if cstval is non-NULL and safe for type, but is outside of the > value range. I have no idea how to get from this spot at that value range > though. By default, ipcp is told to clone a recursive function 8 times, that exceeds value space of index in this case. We could rely on ipa fnsummary on condition predicate of a call to avoid generating never-executed copy. I will take it.
[Bug ipa/104377] Unreachable code in create_specialized_node of ipa-prop.c?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104377 --- Comment #1 from Feng Xue --- (In reply to Feng Xue from comment #0) > For function create_specialized_node(), the "node" to operated on seems > always to be an original cgraph node, never a clone node. From call graph > related to the function, we know that ipcp_decision_stage () only passes raw > cgraph node downwards to its callees. Then, "node" reaching > create_specialized_node() would not be a clone, so the code enclosed by "if > (old_adjustments)" might be of no use. But I am not sure sure if there is > some thing that I missed. > > ipcp_driver > | > '--> ipcp_decision_stage >| >'--> decide_whether_version_node > | > |--> decide_about_value > | | > '-'--> create_specialized_node OK. I does missed something. Here we could not hold assumption that ipcp_decision_stage() only sees raw cgraph node, since sometime in the future some new ipa pass may be added prior to ipa-cp, and this pass introduces clone node. However, there is a questionable point about the code snippet if (!node->can_change_signature || old_adj->op != IPA_PARAM_OP_COPY || (!known_csts[old_adj->base_index] && ipa_is_param_used (info, old_adj->base_index))) In ipa-cp, known_csts is for the node, has no relation to the node's origin node, but here it is accessed via index of the latter (old_adj->base_index), will this cause out-of-bound error?
[Bug ipa/104377] New: Unreachable code in create_specialized_node of ipa-prop.c?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104377 Bug ID: 104377 Summary: Unreachable code in create_specialized_node of ipa-prop.c? Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: ipa Assignee: unassigned at gcc dot gnu.org Reporter: fxue at os dot amperecomputing.com CC: marxin at gcc dot gnu.org Target Milestone: --- For function create_specialized_node(), the "node" to operated on seems always to be an original cgraph node, never a clone node. From call graph related to the function, we know that ipcp_decision_stage () only passes raw cgraph node downwards to its callees. Then, "node" reaching create_specialized_node() would not be a clone, so the code enclosed by "if (old_adjustments)" might be of no use. But I am not sure sure if there is some thing that I missed. ipcp_driver | '--> ipcp_decision_stage | '--> decide_whether_version_node | |--> decide_about_value | | '-'--> create_specialized_node
[Bug tree-optimization/103786] New: Suspicious code in verify_type
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103786 Bug ID: 103786 Summary: Suspicious code in verify_type Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: fxue at os dot amperecomputing.com Target Milestone: --- tree ct = TYPE_CANONICAL (t); if (!ct) ; else if (TYPE_CANONICAL (t) != ct) ^ should be ct?
[Bug tree-optimization/100802] VRP fails to fold comparison using known value orders
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100802 Feng Xue changed: What|Removed |Added Status|RESOLVED|VERIFIED --- Comment #2 from Feng Xue --- Verified.
[Bug bootstrap/102681] [12 Regression] AArch64 bootstrap failure
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102681 --- Comment #4 from Feng Xue --- (In reply to Martin Sebor from comment #3) > Simply initializing the variable as in the patch below avoids the warning. > The control flow in the code is sufficiently opaque to make it worthwhile > from a readability point irrespective of whether or not the variable can, in > fact, be used uninitialized. > > index e50d3fc3b62..c7f0a405ff6 100644 > --- a/gcc/calls.c > +++ b/gcc/calls.c > @@ -199,7 +199,7 @@ stack_region_maybe_used_p (poly_uint64 lower_bound, > poly_uint64 upper_bound, > static void > mark_stack_region_used (poly_uint64 lower_bound, poly_uint64 upper_bound) > { > - unsigned HOST_WIDE_INT const_lower, const_upper; > + unsigned HOST_WIDE_INT const_lower, const_upper = 0; >const_lower = constant_lower_bound (lower_bound); >if (upper_bound.is_constant (_upper)) > for (unsigned HOST_WIDE_INT i = const_lower; i < const_upper; ++i) This code looks good, the warning seems to be an over-kill. Will this change be checked in as a fix?
[Bug tree-optimization/102681] New: AArch64 bootstrap failure
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102681 Bug ID: 102681 Summary: AArch64 bootstrap failure Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: fxue at os dot amperecomputing.com Target Milestone: --- This occurs after commit "Loosen loop crossing restriction in threader" (ec0124e0acb556cdf5dba0e8d0ca6b69d9537fcc). In function ‘void mark_stack_region_used(poly_uint64, poly_uint64)’, inlined from ‘rtx_def* emit_library_call_value_1(int, rtx, rtx, libcall_type, machine_mode, int, rtx_mode_t*)’ at ../../gcc/calls.c:4536:29: ../../gcc/calls.c:206:26: error: ‘const_upper’ may be used uninitialized in this function [-Werror=maybe-uninitialized] 206 | stack_usage_map[i] = 1; | ~~~^~~ ../../gcc/calls.c: In function ‘rtx_def* emit_library_call_value_1(int, rtx, rtx, libcall_type, machine_mode, int, rtx_mode_t*)’: ../../gcc/calls.c:202:39: note: ‘const_upper’ was declared here 202 | unsigned HOST_WIDE_INT const_lower, const_upper; | ^~~
[Bug tree-optimization/102451] Suspicious null-pointer dereference in delete_dead_or_redundant_call
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102451 Feng Xue changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED --- Comment #5 from Feng Xue --- Fixed
[Bug tree-optimization/102400] Field might miss initialization in vn_reference_insert_pieces()
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102400 Feng Xue changed: What|Removed |Added Resolution|--- |FIXED Status|NEW |RESOLVED --- Comment #4 from Feng Xue --- Fixed
[Bug tree-optimization/102451] Suspicious null-pointer dereference in delete_dead_or_redundant_call
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102451 --- Comment #2 from Feng Xue --- (In reply to Richard Biener from comment #1) > Confirmed. Mind fixing it by recording the basic-block index before > removing/replacing? OK.
[Bug tree-optimization/102451] New: Suspicious null-pointer dereference in delete_dead_or_redundant_call
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102451 Bug ID: 102451 Summary: Suspicious null-pointer dereference in delete_dead_or_redundant_call Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: fxue at os dot amperecomputing.com Target Milestone: --- In the code snippet: if (lhs) { if (gsi_replace (gsi, new_stmt, true)) bitmap_set_bit (need_eh_cleanup, gimple_bb (stmt)->index); } else { if (gsi_remove (gsi, true)) bitmap_set_bit (need_eh_cleanup, gimple_bb (stmt)->index); } It is known that "gsi" points to "stmt", and gsi_replace/gsi_remove will clear basic block of "stmt". Then bitmap_set_bit will dereference "index" from null bb.
[Bug tree-optimization/102400] New: Field might miss initialization in vn_reference_insert_pieces()
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102400 Bug ID: 102400 Summary: Field might miss initialization in vn_reference_insert_pieces() Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: fxue at os dot amperecomputing.com Target Milestone: --- There is no initialization for vr1->result_vdef, and it may contain non-zero garbage value.
[Bug c++/102214] New: ICE when compiling local class with -fno-weak
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102214 Bug ID: 102214 Summary: ICE when compiling local class with -fno-weak Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: fxue at os dot amperecomputing.com Target Milestone: --- Created attachment 51413 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=51413=edit testcase Compiling attached testcase with option "-flto -O3 -fno-weak" will get: during IPA pass: *free_lang_data ic.cpp:5:11: internal compiler error: in mangle_decl, at cp/mangle.c:4088 5 | class InnerParent { | ^~~ 0xbe3ef2 mangle_decl(tree_node*) ../../gcc/cp/mangle.c:4088 0x17b0ac4 decl_assembler_name(tree_node*) ../../gcc/tree.c:710 0x17bd358 assign_assembler_name_if_needed(tree_node*) ../../gcc/tree.c:6318 0x17bd471 free_lang_data_in_cgraph ../../gcc/tree.c:6366 0x17bd606 free_lang_data ../../gcc/tree.c:6411 0x17bd7ac execute ../../gcc/tree.c:6483
[Bug ipa/101502] Inconsistent behavior in maybe_record_node()
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101502 --- Comment #2 from Feng Xue --- (In reply to Martin Liška from comment #1) > @Honza: Can you please take a look? > @Feng: Do you have a test-case for it, please? No. I just got this from code logic, but not 100% sure.
[Bug ipa/101502] New: Inconsistent behavior in maybe_record_node()
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101502 Bug ID: 101502 Summary: Inconsistent behavior in maybe_record_node() Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: ipa Assignee: unassigned at gcc dot gnu.org Reporter: fxue at os dot amperecomputing.com CC: marxin at gcc dot gnu.org Target Milestone: --- For code snippet in maybe_record_node(), behavior might be inconsistent when SANITIZE_UNREACHABLE is turned on. Suppose we are adding two functions (method_fn/__cxa_pure_virtual), content of nodes is order-sensitive. o. method_fn, __cxa_pure_virtual -> nodes: [ method_fn ] o. __cxa_pure_virtual, method_fn -> nodes: [ __cxa_pure_virtual, method_fn ] BTW: assertion on target_node->real_symbol_p() is redundant, since the condition is contained. else if (target_node != NULL && (TREE_PUBLIC (target) || DECL_EXTERNAL (target) || target_node->definition) && target_node->real_symbol_p ()) { gcc_assert (!target_node->inlined_to); gcc_assert (target_node->real_symbol_p ()); /* When sanitizing, do not assume that __cxa_pure_virtual is not called by valid program. */ if (flag_sanitize & SANITIZE_UNREACHABLE) ; /* Only add pure virtual if it is the only possible target. This way we will preserve the diagnostics about pure virtual called in many cases without disabling optimization in other. */ else if (pure_virtual) { if (nodes.length ()) return; } /* If we found a real target, take away cxa_pure_virtual. */ else if (!pure_virtual && nodes.length () == 1 && is_cxa_pure_virtual_p (nodes[0]->decl)) nodes.pop (); if (pure_virtual && nodes.length ()) return; if (!inserted->add (target)) { cached_polymorphic_call_targets->add (target_node); nodes.safe_push (target_node); } }
[Bug tree-optimization/100802] New: VRP fails to fold comparison with known value orders
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100802 Bug ID: 100802 Summary: VRP fails to fold comparison with known value orders Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: fxue at os dot amperecomputing.com Target Milestone: --- In the case below, it is obvious that we have value orders as i < a <= b inside loop. So "if (i >= b)" could be optimized away, but VRP fails to do that. int f1(); int f2(); int foo (unsigned a, unsigned b) { if (a <= b) { for (unsigned i = 0; i < a; i++) { f1 (); if (i >= b) f2 (); } } return 0; }
[Bug c++/100580] New: ICE with -fdump-passes
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100580 Bug ID: 100580 Summary: ICE with -fdump-passes Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: fxue at os dot amperecomputing.com Target Milestone: --- Created attachment 50805 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=50805=edit Source file Reproduce ICE using the command: g++ -fpreprocessed TimeControl.ii -std=c++98 -fdump-passes
[Bug tree-optimization/100222] New: Redundant mark_irreducible_loops () in predicate.c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100222 Bug ID: 100222 Summary: Redundant mark_irreducible_loops () in predicate.c Product: gcc Version: tree-ssa Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: fxue at os dot amperecomputing.com Target Milestone: --- Found two places that redundant mark_irreducible_loops() is called after loop_optimizer_init (LOOPS_NORMAL), since "LOOPS_NORMAL" already includes "LOOPS_HAVE_MARKED_IRREDUCIBLE_REGIONS". The codes are in pass_profile::execute () and report_predictor_hitrates().
[Bug ipa/99951] Dead return value after modify_call() is not released
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99951 --- Comment #2 from Feng Xue --- Can we report error in verify_ssa() when a non-default SSA's defining statement has NULL bb, which is always a case that the statement is removed somewhere?
[Bug ipa/99951] New: Dead return value after modify_call() is not released
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99951 Bug ID: 99951 Summary: Dead return value after modify_call() is not released Product: gcc Version: tree-ssa Status: UNCONFIRMED Severity: normal Priority: P3 Component: ipa Assignee: unassigned at gcc dot gnu.org Reporter: fxue at os dot amperecomputing.com CC: marxin at gcc dot gnu.org Target Milestone: --- By ipa_param_adjustments::modify_call(), a call statement with return value might be replaced to a new void-return call. But the original return value SSA is not released, which would give us garbage SSA when traversing FOR_EACH_SSA_NAME. In the following code, all uses of stmt's lhs are redirected to a new one, and it becomes unused, but is never released from SSA name table. if (tree lhs = gimple_call_lhs (stmt)) { if (!m_skip_return) gimple_call_set_lhs (new_stmt, lhs); else if (TREE_CODE (lhs) == SSA_NAME) { /* LHS should now by a default-def SSA. Unfortunately default-def SSA_NAMEs need a backing variable (or at least some code examining SSAs assumes it is non-NULL). So we either have to re-use the decl we have at hand or introdice a new one. */ tree repl = create_tmp_var (TREE_TYPE (lhs), "removed_return"); repl = get_or_create_ssa_default_def (cfun, repl); SSA_NAME_IS_DEFAULT_DEF (repl) = true; imm_use_iterator ui; use_operand_p use_p; gimple *using_stmt; FOR_EACH_IMM_USE_STMT (using_stmt, ui, lhs) { FOR_EACH_IMM_USE_ON_STMT (use_p, ui) { SET_USE (use_p, repl); } update_stmt (using_stmt); } } }
[Bug ipa/98815] Redundant free_dominance_info in cgraph_node::analyze()
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98815 --- Comment #2 from Feng Xue --- If we step into free_dominance_info(dir), it is a wrapper of free_dominance_info (cfun, dir), which means it assumes a non-NULL "cfun". Additionally, please go through calling stack of free_dominance_info(): free_dominance_info () -> if (!dom_info_available_p ()) -> dom_info_state (fn, dir) != DOM_NONE -> if (!fn->cfg) return DOM_NONE; This shows when "fn->cfg" is NULL, free_dominance_info() does nothing. Above all, if conditional check "cfun && fn->cfg" is true, two calls to free_dominance_info() in cgraph_node::analyze() are redundant, and if the check is false, these calls are trival. So we could remove those calls in cgraph_node::analyze(), not execute_pass_list().
[Bug tree-optimization/98815] New: Redundant free_dominance_info in cgraph_node::analyze()
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98815 Bug ID: 98815 Summary: Redundant free_dominance_info in cgraph_node::analyze() Product: gcc Version: tree-ssa Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: fxue at os dot amperecomputing.com Target Milestone: --- Since "execute_pass_list ()" has an action of clearing dominance information, two free_dominance_info() after the line "execute_pass_list (cfun, g->get_passes ()->all_lowering_passes)" seems to be redundant, and could be removed.
[Bug rtl-optimization/98782] IRA artificially creating spills due to BB frequencies
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98782 Feng Xue changed: What|Removed |Added CC||fxue at os dot amperecomputing.com --- Comment #1 from Feng Xue --- The value "foo + 1024" is spilled for both cases, but in different way. For bad case, spill outside loop, and only reload inside. While for good case, spill/reload pair occurs around the call to "bar", which might also consider extra cost of using caller-saved registers. It seems that IRA has two different logics to handle spilling. [Bad case] foo: stp x29, x30, [sp, -80]! mov w5, 753 mov x29, sp stp x19, x20, [sp, 16] mov x19, x1 mul w1, w0, w5 stp x21, x22, [sp, 32] mov w22, 5271 add w2, w1, 7 mov w21, w5 mul w3, w0, w22 mov w20, 760 mov w22, 0 str w0, [sp, 76] add x0, x19, 1024 str x0, [sp, 64] // Spill (foo + 1024) .p2align 3,,7 .L5: ldrbw0, [x19] cbz w0, .L2 ldr w0, [sp, 76] stp w1, w2, [sp, 56] str w3, [sp, 72] bl bar ldrbw0, [x19, 1]! ldp w1, w2, [sp, 56] add w21, w21, w0 ldr w3, [sp, 72] mul w20, w20, w0 ldr x0, [sp, 64] // Reload (foo + 1024) add w22, w22, w20 cmp x19, x0 bne .L5 b .L4 .p2align 2,,3 .L2: ldrbw0, [x19, 1]! add w21, w21, w0 mul w20, w20, w0 ldr x0, [sp, 64] // Reload (foo + 1024) add w22, w22, w20 cmp x0, x19 bne .L5 .L4: add w0, w20, w21 add w0, w0, w22 ldp x19, x20, [sp, 16] ldp x21, x22, [sp, 32] ldp x29, x30, [sp], 80 ret [Good case:] foo: stp x29, x30, [sp, -80]! mov w5, 753 add x7, x1, 1024 mul w2, w0, w5 mov x29, sp stp x21, x22, [sp, 32] mov w21, 5271 mov w22, w5 stp x19, x20, [sp, 16] mov x19, x1 mul w3, w0, w21 stp w2, w0, [sp, 72] // Spill x(%w0) add w2, w2, 7 // t2(%w2) mov w21, 0 mov w20, 760 .p2align 3,,7 .L5: ldrbw0, [x19] cbz w0, .L2 ldp w1, w0, [sp, 72] // Reload x stp w2, w3, [sp, 56] // Spill t2 str x7, [sp, 64] // Spill (foo + 1024) bl bar ldrbw0, [x19, 1]! ldr x7, [sp, 64] // Reload (foo + 1024) add w22, w22, w0 ldp w2, w3, [sp, 56] // Reload t2 mul w20, w20, w0 add w21, w21, w20 cmp x19, x7 bne .L5 b .L4 .p2align 2,,3 .L2: ldrbw0, [x19, 1]! add w22, w22, w0 mul w20, w20, w0 add w21, w21, w20 cmp x7, x19 bne .L5 .L4: add w0, w20, w22 add w0, w0, w21 ldp x19, x20, [sp, 16] ldp x21, x22, [sp, 32] ldp x29, x30, [sp], 80 ret Even for good case, we could expect better spill/reload generation. Refer to comments above, "x" and "t2" are similar, both loop invariant, but handled differently. Spilling "t2" inside loop is worst than spilling it outside, as what IRA does for "x". Both issues could be correlated to same thing.
[Bug tree-optimization/97066] [11 Regression] regression caused by r11-3207
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97066 --- Comment #1 from Feng Xue --- Both cases will be simplified with new pattern, and this is correct. Will change test code to make that not happen.
[Bug tree-optimization/94234] missed ccp folding for (addr + 8 * n) - (addr + 8 * (n - 1))
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94234 Feng Xue changed: What|Removed |Added Resolution|--- |FIXED Status|NEW |RESOLVED --- Comment #6 from Feng Xue --- Fixed.
[Bug tree-optimization/92712] [8/9 Regression] Performance regression with assumed values
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92712 Feng Xue changed: What|Removed |Added CC||fxue at os dot amperecomputing.com --- Comment #24 from Feng Xue --- If(In reply to Jakub Jelinek from comment #15) > For > A*B+A*C -> (B+C)*A > the problematic cases are > A==-1 and B > 0 and C==(max-B)+1 (i.e. when B+C overflows to min) > or A==0 and B < 0 and C or A==0 and B > 0 and C>max-B > (last two cases cover when B+C overflows) > For > A*B-A*C -> (B-C)*A > the problematic cases are > A==-1 and B > 0 and C==min+B (i.e. when B-C is min) > or A==0 and B < -1 and C>B-min > or A==0 and B >= 0 and C (last two cases cover when B-C overflows) > Again, we perform the operation right now if A is not 0 and not -1 for > certain. > I guess we could handle those cases by using something like > check_for_binary_op_overflow, except that for the case where A might be -1 > and plusminus equal to MINUS_EXPR we also need to make sure the result of > B-C is known not to be min. Another point: if B+-C can be folded to an existing gimple value, we might deduce B+-C does not overflow?
[Bug tree-optimization/94352] New: Suspicious code in tree-ssa-propagate.c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94352 Bug ID: 94352 Summary: Suspicious code in tree-ssa-propagate.c Product: gcc Version: tree-ssa Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: fxue at os dot amperecomputing.com Target Milestone: --- In function ssa_propagation_engine::ssa_propagate(), a call chain below might lead to a reference to uninitialized data. ssa_propagate() -> ssa_prop_init() -> add_control_edge() -> "curr_order" But ssa_prop_init() is called before "curr_order" is initialized to 0.
[Bug tree-optimization/94234] New: missed ccp folding for (addr + 8 * n) - (addr + 8 * (n - 1))
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94234 Bug ID: 94234 Summary: missed ccp folding for (addr + 8 * n) - (addr + 8 * (n - 1)) Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: fxue at os dot amperecomputing.com Target Milestone: --- CCP could not fold the following expression to a constant "8": size_t foo (char *a, size_t n) { char *b1 = a + 8 * n; char *b2 = a + 8 * (n - 1); return b1 - b2; } But if we change b1 and b2 to integer type, folding happens. size_t foo (char *a, size_t n) { size_t b1 = (size_t)(a + 8 * n); size_t b2 = (size_t)(a + 8 * (n - 1)); return b1 - b2; }
[Bug tree-optimization/93023] New: give preference to address iv without offset in ivopts
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93023 Bug ID: 93023 Summary: give preference to address iv without offset in ivopts Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: fxue at os dot amperecomputing.com Target Milestone: --- >From address IVs with same base and index, ivopts always pick up one with non-zero offset. This does not incur extra cost on architecture like X86, which has LEA instruction to combine offset into address computation. But on ARM, one more add-with-offset instruction is required. X86: lea addr_reg, base[index + offset] ARM: add addr_reg, base, index add addr_reg, addr_reg, offset So choosing IV w/o offset can save one instruction in most situations. Here is an example, compile it on aarch64. int data[100]; int fn1 (); void fn2 (int b, int n) { int i; for (i = 0; i < n; i++, b++) { data[b + 10] = 1; fn1 (); data[b + 3] = 2; } } Analysis into ivopts shows that those address IVs have same in-loop cost, and IV w/o offset does have smaller pre-loop setup cost. But since the setup cost will be averaged to each iteration, the minor cost difference will go due to round-off by integer division. To fix this round-off error, cost can be represented in a more accurate way, such as adding a fraction part to make it a fixpoint number.
[Bug tree-optimization/93004] New: Suspicious code in tree-ssa-loop-ivopts.c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93004 Bug ID: 93004 Summary: Suspicious code in tree-ssa-loop-ivopts.c Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: fxue at os dot amperecomputing.com Target Milestone: --- In function determine_group_iv_cost_address(), /* Uses in a group can share setup code, so only add setup cost once. */ cost -= cost.scratch; /* Compute and add costs for rest uses of this group. */ for (i = 1; i < group->vuses.length () && !sum_cost.infinite_cost_p (); i++) { struct iv_use *next = group->vuses[i]; /* TODO: We could skip computing cost for sub iv_use when it has the same cost as the first iv_use, but the cost really depends on the offset and where the iv_use is. */ cost = get_computation_cost (data, next, cand, true, NULL, _autoinc, _expr); if (inv_expr) { if (!inv_exprs) inv_exprs = BITMAP_ALLOC (NULL); bitmap_set_bit (inv_exprs, inv_expr->id); } sum_cost += cost; } Change to "cost" ahead of loop is meaningless, since it will be refreshed inside the loop. According to comment, I guess "cost -= cost.scratch" might be placed after "cost = get_computation_cost()".
[Bug tree-optimization/92862] New: Suspicious codes in tree-ssa-loop-niter.c and predict.c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92862 Bug ID: 92862 Summary: Suspicious codes in tree-ssa-loop-niter.c and predict.c Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: fxue at os dot amperecomputing.com Target Milestone: --- Function loop_only_exit_p() in tree-ssa-loop-niter.c: for (i = 0; i < loop->num_nodes; i++) { for (bsi = gsi_start_bb (body[i]); !gsi_end_p (bsi); gsi_next ()) if (stmt_can_terminate_bb_p (gsi_stmt (bsi))) { return true; } } we should return false, not true? Function predict_paths_leading_to_edge() in predict.c: FOR_EACH_EDGE (e2, ei, bb->succs) if (e2->dest != e->src && e2->dest != e->dest && !unlikely_executed_edge_p (e) && !dominated_by_p (CDI_POST_DOMINATORS, e->src, e2->dest)) { has_nonloop_edge = true; break; } "e" is loop invariant, I guess that unlikely_executed_edge_p (e) might be unlikely_executed_edge_p (e2), which is more reasonable. And we can find similar code in predict_paths_for_bb () that uses the latter.
[Bug tree-optimization/92839] New: Normalize memory address to same base in non-loop code
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92839 Bug ID: 92839 Summary: Normalize memory address to same base in non-loop code Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: fxue at os dot amperecomputing.com Target Milestone: --- If possible, IVOPTs can transform memory accesses to make them use same base, which can decrease register pressure in loop. But it does not handle memory address that is not an iv, or in a pure non-loop code. Consider the following example, compile with -O3, we can find that data[b + ?] will use same base register. But if we remove the line "for (i = 0; i < n; i++, b++)" to make it be non-loop code, data[b + ?] will use different bases, with result that we have to consume 10 register as bases, which is a much higher register pressure. And since live ranges of these base registers cross function call, most of them will be spilled to memory. int data[100]; int fn(); void foo(int b, int n) { int i; for (i = 0; i < n; i++, b++) // remove the statement to be non-loop { data[b + 1] += 1; data[b + 3] += 3; data[b + 5] += 5; data[b + 7] += 7; data[b + 9] += 9; data[b + 11] += 11; data[b + 13] += 13; data[b + 15] += 15; data[b + 17] += 17; data[b + 19] += 19; fn (); data[b + 1] -= 1; data[b + 3] -= 3; data[b + 5] -= 5; data[b + 7] -= 7; data[b + 9] -= 9; data[b + 11] -= 11; data[b + 15] -= 13; data[b + 17] -= 15; data[b + 19] -= 19; } }
[Bug ipa/92133] Support multi versioning on self recursive function
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92133 --- Comment #11 from Feng Xue --- --param ipa-cp-eval-threshold=1 -param ipcp-unit-growth=80 is enough.
[Bug ipa/92133] Support multi versioning on self recursive function
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92133 --- Comment #9 from Feng Xue --- Ok. For any followups on this, I'll create new tracker.
[Bug tree-optimization/92689] Improve stmt_may_clobber_ref_p_1 on constant memory reference
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92689 --- Comment #6 from Feng Xue --- Good case. I did missed something, a const pointer does not imply it is restrict and for a real const data, we can even create a non-const pointer alias to it by using explicit type cast.
[Bug tree-optimization/92689] Improve stmt_may_clobber_ref_p_1 on constant memory reference
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92689 --- Comment #2 from Feng Xue --- int fn(); int goo(const int *p) { return fn(); } int data; int foo (const int *p) { int i; int t; data = *p; for (i = 0; i < 100; i++) { int t = *p + 1; goo (); } return *p; } Compile the above with -O3 -fno-early-inlining -fno-inline -fno-partial-inlining, and add a breakpoint at stmt_may_clobber_ref_p_1(), you will find the call stmt_may_clobber_ref_p_1 (stmt/* data = *p; */, *p) returns true.
[Bug tree-optimization/92689] New: Improve stmt_may_clobber_ref_p_1 on constant memory reference
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92689 Bug ID: 92689 Summary: Improve stmt_may_clobber_ref_p_1 on constant memory reference Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: fxue at os dot amperecomputing.com Target Milestone: --- If ao_ref represents a constant data, stmt_may_clobber_ref_p_1() and call_may_clobber_ref_p() can return false. But not sure there is any subtlety behind it, which I missed.
[Bug ipa/91682] IPA-cp can not propagate value for by-ref argument in form of *arg = param op constant
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91682 Feng Xue changed: What|Removed |Added Status|UNCONFIRMED |RESOLVED Resolution|--- |FIXED --- Comment #2 from Feng Xue --- Fixed and close it.
[Bug ipa/91088] IPA-cp cost evaluation is too conservative for "if (f(param) cmp const_val)" condition
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91088 Feng Xue changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED --- Comment #5 from Feng Xue --- Fixed and close it.
[Bug ipa/91089] IPA-cp does not setup proper cost model for switch default case in function versioning
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91089 Feng Xue changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED --- Comment #4 from Feng Xue --- Fixed and close it.
[Bug ipa/92133] Support multi versioning on self recursive function
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92133 --- Comment #5 from Feng Xue --- (In reply to Thomas Koenig from comment #4) > Author: tkoenig > Date: Sun Nov 3 22:33:53 2019 > New Revision: 277760 > > URL: https://gcc.gnu.org/viewcvs?rev=277760=gcc=rev > Log: > 2019-11-03 Thomas Koenig > > PR fortran/92133 > * trans-decl.c (gfc_get_symbol_decl): If __def_init actually > contains a value, put it into the read-only section. > > > Modified: > trunk/gcc/fortran/ChangeLog > trunk/gcc/fortran/trans-decl.c Your patch's tracker number should be 92113, not 92133.
[Bug ipa/92133] Support multi versioning on self recursive function
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92133 --- Comment #2 from Feng Xue --- (In reply to Martin Liška from comment #1) > Let me take a look. I've created a patch (https://gcc.gnu.org/ml/gcc-patches/2019-10/msg01260.html), could you take a time to review it?
[Bug ipa/92133] New: Support multi versioning on self recursive function
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92133 Bug ID: 92133 Summary: Support multi versioning on self recursive function Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: ipa Assignee: unassigned at gcc dot gnu.org Reporter: fxue at os dot amperecomputing.com CC: marxin at gcc dot gnu.org Target Milestone: --- For recursive function, IPA does not allow constant propagation on parameter that is used to control recursion. It is common that the parameter is inc/dec (arithmetic op) a constant before entering next recursion. The following example gives a general source code pattern. int recur_fn (int i) { if (i == 6) { do_work (); return 0; } do_prepare (); recur_fn (i + 1); do_post (); return 0; } int foo() { ... recur_fn (1); ... } A straight forward optimization is to duplicate recur_fn () 6 times, which constitute a calling chain on the specialized copies as: recur_fn() -> recur_fn() -> ... -> recur_fn()
[Bug ipa/91682] New: IPA-cp can not propagate value for by-ref argument in form of *arg = param op constant
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91682 Bug ID: 91682 Summary: IPA-cp can not propagate value for by-ref argument in form of *arg = param op constant Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: ipa Assignee: unassigned at gcc dot gnu.org Reporter: fxue at os dot amperecomputing.com CC: marxin at gcc dot gnu.org Target Milestone: --- Current IPA only supports CP on by-ref argument assigned with constant. But if value of the argument is computed via an unary/binary operation on caller's parameter, it fails to do that. For example. struct S { int a, b; }; void callee2 (struct S *s) { // use s->a; // use s->b; } void callee1 (int a, int *p) { struct S s; s.a = a; s.b = *p + 1; callee2 (); } void caller () { int a = 1; int b = 2; callee1 (a, ); } In callee1(), we could know that value of s.a equals 1 and s.b is 3, it is possible to propagate these value to callee2().
[Bug tree-optimization/91571] New: TBAA does not work for ao_ref created by ao_ref_init_from_ptr_and_size()
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91571 Bug ID: 91571 Summary: TBAA does not work for ao_ref created by ao_ref_init_from_ptr_and_size() Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: fxue at os dot amperecomputing.com Target Milestone: --- Given a SSA_NAME pointer, its type information is discarded by ao_ref_init_from_ptr_and_size() when building an ao_ref from the pointer, thus TBAA for this kind of ao_ref is actually disabled. Related code snippet: else { gcc_assert (POINTER_TYPE_P (TREE_TYPE (ptr))); ref->base = build2 (MEM_REF, char_type_node, ptr, null_pointer_node); .. } . ref->ref_alias_set = 0; ref->base_alias_set = 0; If enabled, some optimizations relying on this could produce better result. Here is an example for IPA-CP, but not just it (others include dse, strlen-opt). struct A { int a; }; struct B { int b; }; void f2 (struct A *pa, struct B *pb) { ... } void f1(int v, struct A *pa, struct B *pb) { pa->a = 1; pb->b = v; f2 (pa, pb); } The ao_ref for f2's argument "pa" is setup by ao_ref_init_from_ptr_and_size(), AA can not disambiguate pa->a and pb->b due to loss of original type information, and IPA-CP gets a conservative conclusion that constant "1" assigned to pa->a can not reach f2(pa, pb), then misses a chance of constant propagation. I check uses of ao_ref_init_from_ptr_and_size(), it is mainly used in alias analysis on function call arguments. That's better to enable TBAA in this scenario, but not sure is there any special consideration about it?
[Bug ipa/91508] New: Segfault due to referencing removed cgraph_node
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91508 Bug ID: 91508 Summary: Segfault due to referencing removed cgraph_node Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: ipa Assignee: unassigned at gcc dot gnu.org Reporter: fxue at os dot amperecomputing.com CC: marxin at gcc dot gnu.org Target Milestone: --- Caught a segfault when compiling pr63766.C with debug version gcc. pr63766.C:48:1: internal compiler error: Segmentation fault 48 | } | ^ 0x14ace78 crash_signal ../../gcc/toplev.c:326 0x8fd7c5 tree_check(tree_node*, char const*, int, char const*, tree_code) ../../gcc/tree.h:3256 0x1342850 do_per_function_toporder(void (*)(function*, void*), void*) ../../gcc/passes.c:1703 0x1345471 execute_ipa_pass_list(opt_pass*) ../../gcc/passes.c:2919 0xe10dba ipa_passes ../../gcc/cgraphunit.c:2480 0xe1120c symbol_table::compile() ../../gcc/cgraphunit.c:2618 0xe117d2 symbol_table::finalize_compilation_unit() ../../gcc/cgraphunit.c:2868 Analysis points to do_per_function_order()/passes.c, in which handling an cgraph_node in a worklist might remove later unprocessed node. So, it registers a remove hook to track removed node by hashset, and uses node's uid as hashset key. But this trick has a problem in that node's uid becomes garbage if the node has been removed. Alternative way is to use node address as key. /* Function could be inlined and removed as unreachable. */ if (node == NULL || removed_nodes.contains (node->get_uid ())) ^^
[Bug ipa/91468] New: Suspicious codes in ipa-prop.c and ipa-cp.c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91468 Bug ID: 91468 Summary: Suspicious codes in ipa-prop.c and ipa-cp.c Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: ipa Assignee: unassigned at gcc dot gnu.org Reporter: fxue at os dot amperecomputing.com CC: marxin at gcc dot gnu.org Target Milestone: --- Some might be a bug, and some might be redundant. ipa-prop.c: In function ipcp_modif_dom_walker::before_dom_children(), vce = false; t = rhs; while (handled_component_p (t)) { /* V_C_E can do things like convert an array of integers to one bigger integer and similar things we do not handle below. */ if (TREE_CODE (rhs) == VIEW_CONVERT_EXPR) { vce = true; break; } t = TREE_OPERAND (t, 0); } if (vce) continue; Should "rhs" in "if (TREE_CODE (rhs) == VIEW_CONVERT_EXPR)" be "t"? In function update_jump_functions_after_inlining(), if (dst->type == IPA_JF_ANCESTOR) { .. if (src->type == IPA_JF_PASS_THROUGH && src->value.pass_through.operation == NOP_EXPR) { .. } else if (src->type == IPA_JF_PASS_THROUGH && TREE_CODE_CLASS (src->value.pass_through.operation) == tcc_unary) { dst->value.ancestor.formal_id = src->value.pass_through.formal_id; dst->value.ancestor.agg_preserved = false; } .. } If we suppose pass_through operation is "negate_expr" (while it is not a reasonable operation on pointer type), the code might be incorrect. It's better to specify expected unary operations here. In function compute_complex_assign_jump_func(), case GIMPLE_UNARY_RHS: if (is_gimple_assign (stmt) && gimple_assign_rhs_class (stmt) == GIMPLE_UNARY_RHS && ! CONVERT_EXPR_CODE_P (gimple_assign_rhs_code (stmt))) ipa_set_jf_unary_pass_through (jfunc, index, gimple_assign_rhs_code (stmt)); The condition "is_gimple_assign (stmt) && gimple_assign_rhs_class (stmt) == GIMPLE_UNARY_RHS" seems to be redundant, might be omit. ipa-cp.c: In function merge_agg_lats_step(), if (**aglat && (**aglat)->offset == offset) { if ((**aglat)->size != val_size || ((**aglat)->next && (**aglat)->next->offset < offset + val_size)) { set_agg_lats_to_bottom (dest_plats); return false; } gcc_checking_assert (!(**aglat)->next || (**aglat)->next->offset >= offset + val_size); return true; } The condition "|| ((**aglat)->next && (**aglat)->next->offset < offset + val_size))" seems to be always false, because the next item should not be overlapped with its prev, this is what merge_agg_lats_step() tries to ensure.
[Bug ipa/91089] IPA-cp does not setup proper cost model for switch default case in function versioning
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91089 --- Comment #2 from Feng Xue --- I've already created a patch under review. Please give some comments. Here it is: https://gcc.gnu.org/ml/gcc-patches/2019-07/msg00937.html
[Bug ipa/91088] IPA-cp cost evaluation is too conservative for "if (f(param) cmp const_val)" condition
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91088 --- Comment #3 from Feng Xue --- I've already created a patch under review. Please give some comments. Here it is: https://gcc.gnu.org/ml/gcc-patches/2019-07/msg00959.html
[Bug ipa/91194] A suspicious condition in recursive_inlining
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91194 Feng Xue changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED --- Comment #4 from Feng Xue --- Fixed
[Bug ipa/91194] New: A suspicious condition in recursive_inlining
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91194 Bug ID: 91194 Summary: A suspicious condition in recursive_inlining Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: ipa Assignee: unassigned at gcc dot gnu.org Reporter: fxue at os dot amperecomputing.com CC: marxin at gcc dot gnu.org Target Milestone: --- A piece of code n recursive_inlining()/ipa-inline.c: while (!heap.empty ()) { struct cgraph_edge *curr = heap.extract_min (); struct cgraph_node *cnode, *dest = curr->callee; if (!can_inline_edge_p (curr, true) || can_inline_edge_by_limits_p (curr, true)) continue; The second condition does not seem be logical, should it be "!can_inline_edge_by_limits_p (curr, true)"?
[Bug tree-optimization/91090] New: A suspicious code in tree-ssa-dom.c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91090 Bug ID: 91090 Summary: A suspicious code in tree-ssa-dom.c Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: fxue at os dot amperecomputing.com Target Milestone: --- Find code snippet in simplify_stmt_for_jump_threading (): if (vr->kind () == VR_RANGE) { size_t i, j; find_case_label_range (switch_stmt, vr->min (), vr->max (), , ); if (i == j) { tree label = gimple_switch_label (switch_stmt, i); tree singleton; if (CASE_HIGH (label) != NULL_TREE ? (tree_int_cst_compare (CASE_LOW (label), vr->min ()) <= 0 && tree_int_cst_compare (CASE_HIGH (label), vr->max ()) >= 0) : (vr->singleton_p () && tree_int_cst_equal (CASE_LOW (label), singleton))) return label; if (i > j) return gimple_switch_label (switch_stmt, 0); } } The conditional "if (i > j)" should be outside of "if (i == j)"?
[Bug ipa/91089] New: IPA-cp does setup proper cost model for switch default case in function versioning
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91089 Bug ID: 91089 Summary: IPA-cp does setup proper cost model for switch default case in function versioning Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: ipa Assignee: unassigned at gcc dot gnu.org Reporter: fxue at os dot amperecomputing.com CC: marxin at gcc dot gnu.org Target Milestone: --- IPA-cp always adds execution cost of switch default case into cost evaluation for other switch case, which might miss opportunities of function versioning when only certain case is proved to be executed after constant propagation. The following is an example, compiled with "-O3 -fno-inline". It is reasonable to make a function versioning for callee when "v=1", but actually it does not happen. And if we replace "default" with "case = 4", it clones the function as we expect. In some situations, we can deduce a relative small execution predicate for default case. If all switch cases are in a continuous value range, we can use compare to range low and high bound to mark default switch case. Additionally, range analysis can also give us more information to simplify the predicate. int foo(); int callee(int v) { int i = 0; switch (v) { case 1: i = 3; break; case 2: i = 9; break; default: foo(); foo(); foo(); foo(); foo(); foo(); foo(); foo(); foo(); } return i; } int caller() { return callee (1); }
[Bug ipa/91088] New: IPA-cp cost evaluation is too conservative for "if (f(param) cmp const_val)" condition
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91088 Bug ID: 91088 Summary: IPA-cp cost evaluation is too conservative for "if (f(param) cmp const_val)" condition Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: ipa Assignee: unassigned at gcc dot gnu.org Reporter: fxue at os dot amperecomputing.com CC: marxin at gcc dot gnu.org Target Milestone: --- Current IPA-cp only detects conditional statement like "if (param cmp const_val)", and gives a reasonable cost evaluation used in function versioning. But beyond this form, any generalized form as f(param), a function on the parameter, which involves extra operations and constant values, will be conservatively treated, have not use in cost computation. The following is an example, compiled with "-O3 -fno-inline". The value "5" is propagated into callee(), for "if (v < 3)", ipa-cp can deduce that "true branch" will not be executed, and benefit for function versioning is large. And for "if (v * 2 < 6)", ipa-cp can not handle that, and conservatively takes cost of "true branch" into account, so get a "no need to clone function" conclusion. int foo(); int callee(int v) { // if (v < 3) // will make a constprop copy with v = 5 // if (v * 2 < 6) // will no make a constprop copy { foo(); foo(); foo(); foo(); foo(); foo(); foo(); foo(); foo(); } else return 1; } int caller() { return callee (5); }
[Bug ipa/90401] Missed propagation of by-ref constant argument to callee function
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90401 Feng Xue changed: What|Removed |Added Status|UNCONFIRMED |RESOLVED Resolution|--- |FIXED --- Comment #4 from Feng Xue --- Fixed
[Bug tree-optimization/89713] Optimize away an empty loop whose finiteness can not be analytically determined
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89713 Feng Xue changed: What|Removed |Added Status|UNCONFIRMED |RESOLVED Resolution|--- |FIXED --- Comment #6 from Feng Xue --- Fixed
[Bug ipa/90401] Missed propagation of by-ref constant argument to callee function
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90401 --- Comment #2 from Feng Xue --- (In reply to Richard Biener from comment #1) > Huh. IPA-CP dump difference: > > @@ -26,6 +26,8 @@ > Unknown VR > callsite int caller(int, int&)/2 -> int callee(int&)/1 : > param 0: UNKNOWN > + Aggregate passed by reference: > + offset: 0, cst: 1 > value: 0x0, mask: 0xfffc > VR ~[0, 0] >Jump functions of caller int callee(int&)/1: This jump function is for the 2nd call. For the 1st call, IPA-CP can not deduce that. So there is no enough benefit to convince IPA-CP to clone the "callee()". > > I guess somehow IPA-CP walks stmts for defs instead of virtual operands? Yes, it is. IPA-CP only walks stmts in callsite basic block.
[Bug ipa/90401] New: Missed propagation of by-ref constant argument to callee function
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90401 Bug ID: 90401 Summary: Missed propagation of by-ref constant argument to callee function Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: ipa Assignee: unassigned at gcc dot gnu.org Reporter: fxue at os dot amperecomputing.com CC: marxin at gcc dot gnu.org Target Milestone: --- Created attachment 46318 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=46318=edit The case that ipa_cp is failed on by_ref constant argument IPA-cp can not identify a constant by-ref argument to a function, if value definition of the argument is not in same basic block where the callsite lies in. In Fortran, this is very common that a constant function argument is placed in a local variable instead of literal constant. The following is a C++ case, compiling option is "-O3 -fno-inline", here using "-fno-inline" to give a change for ipa-cp on small function. If we move the statement "int a = 1" from inside conditional block to begin of function, ipa-cp fails to do function versioning on "callee()". int data; int callee(int ) { if (v < 2) { return 0; } else { data = v; return 1; } } int caller(int c, int ) { // int a = 1; <. /* ipa-cp on callee(), failed */ | if (c) { | // int a = 1; -' /* ipa-cp on callee(), ok */ return callee(a); } else { r = 1; return callee(r); } }
[Bug rtl-optimization/90174] Bad register spill due to top-down allocation order
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90174 --- Comment #9 from Feng Xue --- (In reply to Feng Xue from comment #5) > > I would say that top-down algorithm behaves better than bottom-up one. I > > implemented Callahan-Koblentz (bottom-up algorithm) in GCC about 10 years > > ago and it behaved worse than the current one. I think adding an additional > Intuitively, it's better to give preference to regions with higher frequency > to let them pick up registers earlier and with more freedom, which matches > with bottom-up coloring strategy. > > > pass over regions probably would solve the problem but it will complicate > > already complicated RA (which is about 60K lines now). In any case I'll > > think about this problem. > > > > The problem you are mentioning in this code is quite insignificant (one > > memory access in the top most region). > But it will be significant inside multi-level loop. Actually, the problem is > exposed by a 8-level loop in a real Fortran application. > > > > > I also checked the attachments. What I see is GCC generates a better big > > loop body (a loop with high register pressure). GCC generated loop body has > > 50 x86-64 insns with 22 memory accesses vs 56 with 26 memory access for > > LLVM. > As far as how to spilling live range at loop boundary is concerned, gcc is > not very efficient. To make loop body not be the focus, when we remove two > live-range variables in original case, we can get same amount of memory > accesses for gcc and llvm, both of which generates 9 register spills in loop > body. I attached modified case and assembly files. For the modified case, allocation of llvm and gcc remain nearly the same for loop body, but as to 'lv0', llvm gets better spill/split result over gcc.
[Bug rtl-optimization/90168] context-sensitive local register allocation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90168 --- Comment #4 from Feng Xue --- (In reply to Andrew Pinski from comment #3) > >or to use float type to hold frequency? > > This won't work correctly as floating point is different between hosts. > There has been some usage of floating point inside of GCC which was removed > because of that issue. I thought that was documented somewhere too. How about adjusting REG_FREQ_MAX to be same as BB_FREQ_MAX? Now REG_FREQ_MAX/BB_FREQ_MAX is 1/10.
[Bug rtl-optimization/90174] Bad register spill due to top-down allocation order
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90174 --- Comment #8 from Feng Xue --- Created attachment 46294 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=46294=edit asm file 2 generated by llvm
[Bug rtl-optimization/90174] Bad register spill due to top-down allocation order
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90174 --- Comment #7 from Feng Xue --- Created attachment 46293 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=46293=edit asm file 2 generated by gcc
[Bug rtl-optimization/90174] Bad register spill due to top-down allocation order
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90174 --- Comment #6 from Feng Xue --- Created attachment 46292 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=46292=edit test case with less live ranges
[Bug rtl-optimization/90174] Bad register spill due to top-down allocation order
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90174 --- Comment #5 from Feng Xue --- > I would say that top-down algorithm behaves better than bottom-up one. I > implemented Callahan-Koblentz (bottom-up algorithm) in GCC about 10 years > ago and it behaved worse than the current one. I think adding an additional Intuitively, it's better to give preference to regions with higher frequency to let them pick up registers earlier and with more freedom, which matches with bottom-up coloring strategy. > pass over regions probably would solve the problem but it will complicate > already complicated RA (which is about 60K lines now). In any case I'll > think about this problem. > > The problem you are mentioning in this code is quite insignificant (one > memory access in the top most region). But it will be significant inside multi-level loop. Actually, the problem is exposed by a 8-level loop in a real Fortran application. > > I also checked the attachments. What I see is GCC generates a better big > loop body (a loop with high register pressure). GCC generated loop body has > 50 x86-64 insns with 22 memory accesses vs 56 with 26 memory access for LLVM. As far as how to spilling live range at loop boundary is concerned, gcc is not very efficient. To make loop body not be the focus, when we remove two live-range variables in original case, we can get same amount of memory accesses for gcc and llvm, both of which generates 9 register spills in loop body. I attached modified case and assembly files.
[Bug rtl-optimization/90174] Bad register spill due to top-down allocation order
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90174 --- Comment #3 from Feng Xue --- Created attachment 46237 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=46237=edit test case for aarch64 Add another case composed for aarch64.