Richard Guenther wrote: > On Tue, Feb 28, 2012 at 4:10 PM, Ulrich Weigand <uweig...@de.ibm.com> wrote: > > I'll still need to do proper testing and benchmarking, but I thought > > I'd post the patch anyway just as a heads-up ... > > > > Does this look reasonable to you? > > Yes, that looks reasonable. Though instead of unsigned_type_for > please use build_nonstandard_integer_type instead (if the type > is not already unsigned). unsigned_type_for does not necessarily > preserve type-precision and may even return NULL.

## Advertising

OK, I've changed the patch to use build_nonstandard_integer_type, and it still fixes the original problem. However, on further examination it turns out that this "fix" is not because the expression is actually simplified at tree level. Rather, it is just that because of some minor reassociation (the +1 is moved to the end), the *RTL* optimizers now see a (just as complex but) slightly different RTL sequence, and therefore combine now just happens to be able to fold everything together (using the RTL-level simplify_plus_minus machinery). Even worse, the patch causes a number of regressions: > FAIL: gcc.dg/tree-ssa/loop-15.c scan-tree-dump-times optimized "\\+" 0 > FAIL: gcc.dg/tree-ssa/loop-15.c scan-tree-dump-times optimized "n_. \\* n_." 1 > FAIL: gcc.dg/vect/no-scevccp-noreassoc-outer-4.c scan-tree-dump-times vect > "OUTER LOOP VECTORIZED." 1 The loop-15.c problem is probably just a missed optimization elsewhere that can be fixed. The problem is that a loop-end value used to be computed as "(n - 1) * n + n" which got simplified via fold_plusminus_mult_expr to "n * n". When computing in unsigned, however, the expression becomes something along the lines of "[...] * (unsigned) n + (unsigned) n" and the top-level of fold_binary_loc removes the *outermost* NOP_EXPRs resulting in "[...] * (unsigned) n + n" which fold_plusminus_mult_expr no longer recognizes as something it can handle (since "(unsigned) n" is not operand_equal_p to "n"). I think this can be fixed by having fold_plusminus_mult_expr call STRIP_NOPS on sub-operands, as is already done elsewhere in fold_binary_loc. This basically fixes the test case (due to some remaining casts, the scan-tree-dump patterns still don't match as-is, but the expected code is again generated). The no-scevccp-noreassoc-outer-4.c problem is more difficult, and would seem to point to a fundamental problem with performing the computation in unsigned mode. In this case, the end value of an inner loop forms part of a scalar evolution of the outer loop. When computing the end value in unsigned type, that scalar evolution is no longer detected. In part, this is because the scalar evolution machinery might need to be improved in handling casts. In particular, in some places it only does STRIP_USELESS_TYPE_CONVERSION; changing those to STRIP_NOPS makes it detect the evolution again. But I'm not fully convinced this is a safe change ... In any case, this still doesn't make the outer loop vectorizable, since it is no longer provably finite. Without the patch, this could be proven *because* the computation of the inner loop's end value used signed arithmetic which cannot overflow. By performing the computation in unsigned, this information is in fact lost. This would seem to indicate that unconditionally using unsigned arithmetic even if not strictly necessary might not always be the best solution either. I've gone back and looked that at the original problem that (start + 1) + (int)((unsigned int) ~start + (unsigned int) end) is not simplified by fold-const. Interestingly enough, it *is* valid to simplify this expression to just plain "end", even with signed arithmetic, since the transformation does not introduce any new potential overflows. This is actually recognized in fold_binary_loc, but only one special case. See the code around the following comment: /* The only case we can still associate with two variables is if they are the same, modulo negation. */ Unfortunately, this handles only A and -A; it doesn't recognize that A+1 is in fact the negation of ~A ... There is also code in tree-ssa-forwprop.c:associate_plusminus that handles quite a number of special cases where re-association *is* allowed even for signed arithmetic: (A +- B) - A -> +- B (A +- B) -+ B -> A (CST +- A) +- CST -> CST +- A (A + CST) +- CST -> A + CST ~A + A -> -1 ~A + 1 -> -A A - (A +- B) -> -+ B A +- (B +- A) -> +- B CST +- (CST +- A) -> CST +- A CST +- (A +- CST) -> CST +- A A + ~A -> -1 This handles some cases involving ~, but still not quite the case we need for this expression. In addition, the forward propagtion logic doesn't seem to handle casts very well (as opposed to fold-const, which makes liberal use of STRIP_NOPS). I'm wondering where to go from there: - Why are those special re-association cases handled only in forwprop, and not in fold-cost? I would have expected forwprop to just propagate subexpressions into a tree and then use fold-const to actually simplify ... - Should I try to improve forwprop to handle casts and additional re- association cases until it handles the above expression? - Or should the logic in fold-const be improved to add additional cases of re-association? Thanks for any further suggestions! I've attached some of the changes mentioned above for reference. Bye, Ulrich Patch: Compute loop-end value using unsigned type. Index: gcc/tree-scalar-evolution.c =================================================================== --- gcc/tree-scalar-evolution.c (revision 184625) +++ gcc/tree-scalar-evolution.c (working copy) @@ -474,11 +474,22 @@ return chrec_dont_know; else { + tree type = TREE_TYPE (evolution_fn), utype = type; tree res; + /* Since the number of iterations is always unsigned, we perform + the computation in an unsigned type (if feasible) to allow for + improved simplification of subexpressions. */ + if ((INTEGRAL_TYPE_P (type) && !TYPE_UNSIGNED (type)) + || POINTER_TYPE_P (type)) + utype = build_nonstandard_integer_type + (TYPE_PRECISION (type), 1); + /* evolution_fn is the evolution function in LOOP. Get its value in the nb_iter-th iteration. */ - res = chrec_apply (inner_loop->num, evolution_fn, nb_iter); + res = chrec_convert (utype, evolution_fn, NULL); + res = chrec_apply (inner_loop->num, res, nb_iter); + res = chrec_convert (type, res, NULL); if (chrec_contains_symbols_defined_in_loop (res, loop->num)) res = instantiate_parameters (loop, res); Patch: Use STRIP_NOPS on subexpressions in fold_plusminus_mult_expr Index: gcc/fold-const.c =================================================================== --- gcc/fold-const.c (revision 184625) +++ gcc/fold-const.c (working copy) @@ -7081,6 +7081,8 @@ { arg00 = TREE_OPERAND (arg0, 0); arg01 = TREE_OPERAND (arg0, 1); + STRIP_NOPS (arg00); + STRIP_NOPS (arg01); } else if (TREE_CODE (arg0) == INTEGER_CST) { @@ -7099,6 +7101,8 @@ { arg10 = TREE_OPERAND (arg1, 0); arg11 = TREE_OPERAND (arg1, 1); + STRIP_NOPS (arg10); + STRIP_NOPS (arg11); } else if (TREE_CODE (arg1) == INTEGER_CST) { Patch: Use STRIP_NOPS when following scalar evolutions Index: gcc/tree-scalar-evolution.c =================================================================== --- gcc/tree-scalar-evolution.c (revision 184625) +++ gcc/tree-scalar-evolution.c (working copy) @@ -1154,8 +1165,8 @@ rhs0 = TREE_OPERAND (expr, 0); rhs1 = TREE_OPERAND (expr, 1); type = TREE_TYPE (rhs0); - STRIP_USELESS_TYPE_CONVERSION (rhs0); - STRIP_USELESS_TYPE_CONVERSION (rhs1); + STRIP_NOPS (rhs0); + STRIP_NOPS (rhs1); res = follow_ssa_edge_binary (loop, at_stmt, type, rhs0, code, rhs1, halting_phi, evolution_of_loop, limit); break; @@ -1168,8 +1179,8 @@ rhs0 = TREE_OPERAND (expr, 0); rhs1 = TREE_OPERAND (expr, 1); type = TREE_TYPE (rhs0); - STRIP_USELESS_TYPE_CONVERSION (rhs0); - STRIP_USELESS_TYPE_CONVERSION (rhs1); + STRIP_NOPS (rhs0); + STRIP_NOPS (rhs1); res = follow_ssa_edge_binary (loop, at_stmt, type, rhs0, POINTER_PLUS_EXPR, rhs1, halting_phi, evolution_of_loop, limit); -- Dr. Ulrich Weigand GNU Toolchain for Linux on System z and Cell BE ulrich.weig...@de.ibm.com