Just realized that I missed the updated patch before. Here it is... Thanks, bin
On Tue, Sep 8, 2015 at 6:07 PM, Bin.Cheng <amker.ch...@gmail.com> wrote: > On Tue, Sep 8, 2015 at 6:06 PM, Bin.Cheng <amker.ch...@gmail.com> wrote: >> On Wed, Sep 2, 2015 at 10:12 PM, Richard Biener >> <richard.guent...@gmail.com> wrote: >>> On Wed, Sep 2, 2015 at 5:26 AM, Bin Cheng <bin.ch...@arm.com> wrote: >>>> Hi, >>>> This patch is a new approach to fix PR66388. IVO today computes iv_use >>>> with >>>> iv_cand which has at least same type precision as the use. On 64bit >>>> platforms like AArch64, this results in different iv_cand created for each >>>> address type iv_use, and register pressure increased. As a matter of fact, >>>> the BIV should be used for all iv_uses in some of these cases. It is a >>>> latent bug but recently getting worse because of overflow changes. >>>> >>>> The original approach at >>>> https://gcc.gnu.org/ml/gcc-patches/2015-07/msg01484.html can fix the issue >>>> except it conflict with IV elimination. Seems to me it is impossible to >>>> mitigate the contradiction. >>>> >>>> This new approach fixes the issue by adding sizetype iv_cand for BIVs >>>> directly. In cases if the original BIV is preferred, the sizetype iv_cand >>>> will be chosen. As for code generation, the sizetype iv_cand has the same >>>> effect as the original BIV. Actually, it's better because BIV needs to be >>>> explicitly extended to sizetype to be used in address expression on most >>>> targets. >>>> >>>> One shortage of this approach is it may introduce more iv candidates. To >>>> minimize the impact, this patch does sophisticated code analysis and adds >>>> sizetype candidate for BIV only if it is used as index. Moreover, it >>>> avoids >>>> to add candidate of the original type if the BIV is only used as index. >>>> Statistics for compiling spec2k6 shows increase of candidate number is >>>> modest and can be ignored. >>>> >>>> There are two more patches following to fix corner cases revealed by this >>>> one. In together they bring obvious perf improvement for spec26k/int on >>>> aarch64. >>>> Spec2k6/int >>>> 400.perlbench 3.44% >>>> 445.gobmk -0.86% >>>> 456.hmmer 14.83% >>>> 458.sjeng 2.49% >>>> 462.libquantum -0.79% >>>> GEOMEAN 1.68% >>>> >>>> There is also about 0.36% improvement for spec2k6/fp, mostly because of >>>> case >>>> 436.cactusADM. I believe it can be further improved, but that should be >>>> another patch. >>>> >>>> I also collected benchmark data for x86_64. Spec2k6/fp is not affected. >>>> As >>>> for spec2k6/int, though the geomean is improved slightly, 400.perlbench is >>>> regressed by ~3%. I can see BIVs are chosen for some loops instead of >>>> address candidates. Generally, the loop header will be simplified because >>>> iv elimination with BIV is simpler; the number of instructions in loop body >>>> isn't changed. I suspect the regression comes from different addressing >>>> modes. With BIV, complex addressing mode like [base + index << scale + >>>> disp] is used, rather than [base + disp]. I guess the former has more >>>> micro-ops, thus more expensive. This guess can be confirmed by manually >>>> suppressing the complex addressing mode with higher address cost. >>>> Now the problem becomes why overall cost of BIV is computed lower while the >>>> actual cost is higher. I noticed for most affected loops, loop header is >>>> bloated because of iv elimination using the old address candidate. The >>>> bloated loop header results in much higher cost than BIV. As a result, BIV >>>> is preferred. I also noticed the bloated loop header generally can be >>>> simplified (I have a following patch for this). After applying the local >>>> patch, the old address candidate is chosen, and most of regression is >>>> recovered. >>>> Conclusion is I think loop header bloated issue should be blamed for the >>>> regression, and it can be resolved. >>>> >>>> Bootstrap and test on x64_64 and aarch64. It fixes failure of >>>> gcc.target/i386/pr49781-1.c, without new breakage. >>>> >>>> So what do you think? >>> >>> The data above looks ok to me. >>> >>> +static struct iv * >>> +find_deriving_biv_for_iv (struct ivopts_data *data, struct iv *iv) >>> +{ >>> + aff_tree aff; >>> + struct expand_data exp_data; >>> + >>> + if (!iv->ssa_name || TREE_CODE (iv->ssa_name) != SSA_NAME) >>> + return iv; >>> + >>> + /* Expand IV's ssa_name till the deriving biv is found. */ >>> + exp_data.data = data; >>> + exp_data.biv = NULL; >>> + tree_to_aff_combination_expand (iv->ssa_name, TREE_TYPE (iv->ssa_name), >>> + &aff, &data->name_expansion_cache, >>> + stop_expand, &exp_data); >>> + return exp_data.biv; >>> >>> that's actually "abusing" tree_to_aff_combination_expand for simply walking >>> SSA uses and their defs uses recursively until you hit "stop". ISTR past >>> discussion to add a generic walk_ssa_use interface for that. Not sure if it >>> materialized with a name I can't remember or whether it didn't. >> Thanks for reviewing. I didn't found existing interface to walk up >> definition chains of ssa vars. In this updated patch, I implemented a >> simple function which meets the minimal requirement of walking up >> definition chains of BIV variables. I also counted number of >> no_overflow BIVs that are not used in address type use. Since >> generally there are only two BIVs in a loop, this can prevent us from >> visiting definition chains most of time. Statistics shows that number >> of call to find_deriving_biv_for_expr plummet. >> >>> >>> - add_candidate (data, iv->base, iv->step, true, NULL); >>> + /* Check if this biv is used in address type use. */ >>> + if (iv->no_overflow && iv->have_address_use >>> + && INTEGRAL_TYPE_P (TREE_TYPE (iv->base)) >>> + && TYPE_PRECISION (TREE_TYPE (iv->base)) < TYPE_PRECISION (sizetype)) >>> + { >>> + tree type = unsigned_type_for (sizetype); >>> >>> sizetype is unsigned. >> Fixed. >> >> Bootstrap and test on x86_64, is this OK? >> >> Thanks, >> bin > > And here is the updated Changelog entry. > > > 2015-09-08 Bin Cheng <bin.ch...@arm.com> > > PR tree-optimization/66388 > * tree-ssa-loop-ivopts.c (struct iv, iv_cand, ivopts_data): New > fields. > (dump_iv): Dump no_overflow information. > (alloc_iv): Initialize new field for struct iv. > (mark_bivs): Count number of no_overflow bivs. > (find_deriving_biv_for_expr, record_biv_for_address_use): New > functions. > (idx_find_step): Call new functions above. > (add_candidate_1, add_candidate): New paramter. > (add_iv_candidate_for_biv): Add sizetype cand for BIV. > (get_computation_aff): Simplify convertion of cand for BIV. > (get_computation_cost_at): Step cand's base if necessary.
Index: gcc/tree-ssa-loop-ivopts.c =================================================================== --- gcc/tree-ssa-loop-ivopts.c (revision 227163) +++ gcc/tree-ssa-loop-ivopts.c (working copy) @@ -147,6 +147,8 @@ struct iv bool biv_p; /* Is it a biv? */ bool have_use_for; /* Do we already have a use for it? */ bool no_overflow; /* True if the iv doesn't overflow. */ + bool have_address_use;/* For biv, indicate if it's used in any address + type use. */ }; /* Per-ssa version information (induction variable descriptions, etc.). */ @@ -251,6 +253,8 @@ struct iv_cand where it is incremented. */ bitmap depends_on; /* The list of invariants that are used in step of the biv. */ + struct iv *orig_iv; /* The original iv if this cand is added from biv with + smaller type. */ }; /* Loop invariant expression hashtable entry. */ @@ -336,6 +340,9 @@ struct ivopts_data /* The maximum invariant id. */ unsigned max_inv_id; + /* Number of no_overflow BIVs which are not used in memory address. */ + unsigned bivs_not_used_in_addr; + /* Obstack for iv structure. */ struct obstack iv_obstack; @@ -529,6 +536,9 @@ dump_iv (FILE *file, struct iv *iv, bool dump_name if (iv->biv_p) fprintf (file, " is a biv\n"); + + if (iv->no_overflow) + fprintf (file, " iv doesn't overflow wrto loop niter\n"); } /* Dumps information about the USE to FILE. */ @@ -1013,6 +1023,7 @@ alloc_iv (struct ivopts_data *data, tree base, tre iv->use_id = 0; iv->ssa_name = NULL_TREE; iv->no_overflow = no_overflow; + iv->have_address_use = false; return iv; } @@ -1155,6 +1166,7 @@ mark_bivs (struct ivopts_data *data) basic_block incr_bb; gphi_iterator psi; + data->bivs_not_used_in_addr = 0; for (psi = gsi_start_phis (loop->header); !gsi_end_p (psi); gsi_next (&psi)) { phi = psi.phi (); @@ -1183,6 +1195,10 @@ mark_bivs (struct ivopts_data *data) iv->biv_p = true; incr_iv->biv_p = true; + if (iv->no_overflow) + data->bivs_not_used_in_addr++; + if (incr_iv->no_overflow) + data->bivs_not_used_in_addr++; } } @@ -1621,6 +1637,143 @@ expr_invariant_in_loop_p (struct loop *loop, tree return true; } +/* Given expression EXPR which computes inductive with respect to + loop recorded in DATA, this function returns biv from which EXPR + is derived by tracing definition chains of ssa variables in EXPR. */ + +static struct iv* +find_deriving_biv_for_expr (struct ivopts_data *data, tree expr) +{ + struct iv *iv; + unsigned i, n; + tree e2, e1; + enum tree_code code; + gimple stmt; + + if (expr == NULL_TREE) + return NULL; + + if (is_gimple_min_invariant (expr)) + return NULL; + + code = TREE_CODE (expr); + if (IS_EXPR_CODE_CLASS (TREE_CODE_CLASS (code))) + { + n = TREE_OPERAND_LENGTH (expr); + for (i = 0; i < n; i++) + { + iv = find_deriving_biv_for_expr (data, TREE_OPERAND (expr, i)); + if (iv) + return iv; + } + } + + /* Stop if it's not ssa name. */ + if (code != SSA_NAME) + return NULL; + + iv = get_iv (data, expr); + if (!iv || integer_zerop (iv->step)) + return NULL; + else if (iv->biv_p) + return iv; + + stmt = SSA_NAME_DEF_STMT (expr); + if (gphi *phi = dyn_cast <gphi *> (stmt)) + { + ssa_op_iter iter; + use_operand_p use_p; + + if (virtual_operand_p (gimple_phi_result (phi))) + return NULL; + + FOR_EACH_PHI_ARG (use_p, phi, iter, SSA_OP_USE) + { + tree use = USE_FROM_PTR (use_p); + iv = find_deriving_biv_for_expr (data, use); + if (iv) + return iv; + } + return NULL; + } + if (gimple_code (stmt) != GIMPLE_ASSIGN) + return NULL; + + e1 = gimple_assign_rhs1 (stmt); + code = gimple_assign_rhs_code (stmt); + if (get_gimple_rhs_class (code) == GIMPLE_SINGLE_RHS) + return find_deriving_biv_for_expr (data, e1); + + switch (code) + { + case MULT_EXPR: + case PLUS_EXPR: + case MINUS_EXPR: + case POINTER_PLUS_EXPR: + /* And increments and decrements by a constant are simple. */ + e2 = gimple_assign_rhs2 (stmt); + iv = find_deriving_biv_for_expr (data, e2); + if (iv) + return iv; + + /* Fallthru. */ + CASE_CONVERT: + /* Casts are simple. */ + return find_deriving_biv_for_expr (data, e1); + + default: + break; + } + + return NULL; +} + +/* Record BIV, its predecessor and successor that they are used in + address type uses. */ + +static void +record_biv_for_address_use (struct ivopts_data *data, struct iv *biv) +{ + unsigned i; + tree type, base_1, base_2; + bitmap_iterator bi; + + if (!biv || !biv->biv_p || integer_zerop (biv->step) + || biv->have_address_use || !biv->no_overflow) + return; + + type = TREE_TYPE (biv->base); + if (!INTEGRAL_TYPE_P (type)) + return; + + biv->have_address_use = true; + data->bivs_not_used_in_addr--; + base_1 = fold_build2 (PLUS_EXPR, type, biv->base, biv->step); + EXECUTE_IF_SET_IN_BITMAP (data->relevant, 0, i, bi) + { + struct iv *iv = ver_info (data, i)->iv; + + if (!iv || !iv->biv_p || integer_zerop (iv->step) + || iv->have_address_use || !iv->no_overflow) + continue; + + if (type != TREE_TYPE (iv->base) + || !INTEGRAL_TYPE_P (TREE_TYPE (iv->base))) + continue; + + if (!operand_equal_p (biv->step, iv->step, 0)) + continue; + + base_2 = fold_build2 (PLUS_EXPR, type, iv->base, iv->step); + if (operand_equal_p (base_1, iv->base, 0) + || operand_equal_p (base_2, biv->base, 0)) + { + iv->have_address_use = true; + data->bivs_not_used_in_addr--; + } + } +} + /* Cumulates the steps of indices into DATA and replaces their values with the initial ones. Returns false when the value of the index cannot be determined. Callback for for_each_index. */ @@ -1711,6 +1864,13 @@ idx_find_step (tree base, tree *idx, void *data) step = fold_build2 (MULT_EXPR, sizetype, step, iv_step); dta->step = fold_build2 (PLUS_EXPR, sizetype, dta->step, step); + if (dta->ivopts_data->bivs_not_used_in_addr) + { + if (!iv->biv_p) + iv = find_deriving_biv_for_expr (dta->ivopts_data, iv->ssa_name); + + record_biv_for_address_use (dta->ivopts_data, iv); + } return true; } @@ -2603,7 +2763,8 @@ find_depends (tree *expr_p, int *ws ATTRIBUTE_UNUS static struct iv_cand * add_candidate_1 (struct ivopts_data *data, tree base, tree step, bool important, enum iv_position pos, - struct iv_use *use, gimple incremented_at) + struct iv_use *use, gimple incremented_at, + struct iv *orig_iv = NULL) { unsigned i; struct iv_cand *cand = NULL; @@ -2685,6 +2846,7 @@ add_candidate_1 (struct ivopts_data *data, else cand->ainc_use = NULL; + cand->orig_iv = orig_iv; if (dump_file && (dump_flags & TDF_DETAILS)) dump_cand (dump_file, cand); } @@ -2793,15 +2955,17 @@ add_autoinc_candidates (struct ivopts_data *data, static void add_candidate (struct ivopts_data *data, - tree base, tree step, bool important, struct iv_use *use) + tree base, tree step, bool important, struct iv_use *use, + struct iv *orig_iv = NULL) { gcc_assert (use == NULL || use->sub_id == 0); if (ip_normal_pos (data->current_loop)) - add_candidate_1 (data, base, step, important, IP_NORMAL, use, NULL); + add_candidate_1 (data, base, step, important, + IP_NORMAL, use, NULL, orig_iv); if (ip_end_pos (data->current_loop) && allow_ip_end_pos_p (data->current_loop)) - add_candidate_1 (data, base, step, important, IP_END, use, NULL); + add_candidate_1 (data, base, step, important, IP_END, use, NULL, orig_iv); } /* Adds standard iv candidates. */ @@ -2836,8 +3000,23 @@ add_iv_candidate_for_biv (struct ivopts_data *data tree def; struct iv_cand *cand; - add_candidate (data, iv->base, iv->step, true, NULL); + /* Check if this biv is used in address type use. */ + if (iv->no_overflow && iv->have_address_use + && INTEGRAL_TYPE_P (TREE_TYPE (iv->base)) + && TYPE_PRECISION (TREE_TYPE (iv->base)) < TYPE_PRECISION (sizetype)) + { + tree base = fold_convert (sizetype, iv->base); + tree step = fold_convert (sizetype, iv->step); + /* Add iv cand of same precision as index part in TARGET_MEM_REF. */ + add_candidate (data, base, step, true, NULL, iv); + /* Add iv cand of the original type only if it has nonlinear use. */ + if (iv->have_use_for) + add_candidate (data, iv->base, iv->step, true, NULL); + } + else + add_candidate (data, iv->base, iv->step, true, NULL); + /* The same, but with initial value zero. */ if (POINTER_TYPE_P (TREE_TYPE (iv->base))) add_candidate (data, size_int (0), iv->step, true, NULL); @@ -3358,6 +3537,28 @@ get_computation_aff (struct loop *loop, /* If the conversion is not noop, perform it. */ if (TYPE_PRECISION (utype) < TYPE_PRECISION (ctype)) { + if (cand->orig_iv != NULL && CONVERT_EXPR_P (cbase) + && (CONVERT_EXPR_P (cstep) || TREE_CODE (cstep) == INTEGER_CST)) + { + tree inner_base, inner_step, inner_type; + inner_base = TREE_OPERAND (cbase, 0); + if (CONVERT_EXPR_P (cstep)) + inner_step = TREE_OPERAND (cstep, 0); + else + inner_step = cstep; + + inner_type = TREE_TYPE (inner_base); + /* If candidate is added from a biv whose type is smaller than + ctype, we know both candidate and the biv won't overflow. + In this case, it's safe to skip the convertion in candidate. + As an example, (unsigned short)((unsigned long)A) equals to + (unsigned short)A, if A has a type no larger than short. */ + if (TYPE_PRECISION (inner_type) <= TYPE_PRECISION (uutype)) + { + cbase = inner_base; + cstep = inner_step; + } + } cstep = fold_convert (uutype, cstep); cbase = fold_convert (uutype, cbase); var = fold_convert (uutype, var); @@ -4525,6 +4726,13 @@ get_computation_cost_at (struct ivopts_data *data, (ratio, mem_mode, TYPE_ADDR_SPACE (TREE_TYPE (utype)))) { + if (cstepi == 0 && stmt_is_after_inc) + { + if (POINTER_TYPE_P (ctype)) + cbase = fold_build2 (POINTER_PLUS_EXPR, ctype, cbase, cstep); + else + cbase = fold_build2 (PLUS_EXPR, ctype, cbase, cstep); + } cbase = fold_build2 (MULT_EXPR, ctype, cbase, build_int_cst (ctype, ratio)); cost = difference_cost (data,