On Thu, Aug 17, 2017 at 2:24 PM, Bin.Cheng <amker.ch...@gmail.com> wrote: > On Thu, Aug 17, 2017 at 12:35 PM, Richard Sandiford > <richard.sandif...@linaro.org> wrote: >> "Bin.Cheng" <amker.ch...@gmail.com> writes: >>> On Wed, Aug 16, 2017 at 6:50 PM, Richard Sandiford >>> <richard.sandif...@linaro.org> wrote: >>>> "Bin.Cheng" <amker.ch...@gmail.com> writes: >>>>> On Wed, Aug 16, 2017 at 5:00 PM, Richard Sandiford >>>>> <richard.sandif...@linaro.org> wrote: >>>>>> "Bin.Cheng" <amker.ch...@gmail.com> writes: >>>>>>> On Wed, Aug 16, 2017 at 2:38 PM, Richard Sandiford >>>>>>> <richard.sandif...@linaro.org> wrote: >>>>>>>> The first loop in the testcase regressed after my recent changes to >>>>>>>> dr_analyze_innermost. Previously we would treat "i" as an iv even >>>>>>>> for bb analysis and end up with: >>>>>>>> >>>>>>>> DR_BASE_ADDRESS: p or q >>>>>>>> DR_OFFSET: 0 >>>>>>>> DR_INIT: 0 or 4 >>>>>>>> DR_STEP: 16 >>>>>>>> >>>>>>>> We now always keep the step as 0 instead, so for an int "i" we'd have: >>>>>>>> >>>>>>>> DR_BASE_ADDRESS: p or q >>>>>>>> DR_OFFSET: (intptr_t) i >>>>>>>> DR_INIT: 0 or 4 >>>>>>>> DR_STEP: 0 >>>>>>>> >>>>>>>> This is also what we'd like to have for the unsigned "i", but the >>>>>>>> problem is that strip_constant_offset thinks that the "i + 1" in >>>>>>>> "(intptr_t) (i + 1)" could wrap and so doesn't peel off the "+ 1". >>>>>>>> The [i + 1] accesses therefore have a DR_OFFSET equal to the SSA >>>>>>>> name that holds "(intptr_t) (i + 1)", meaning that the accesses no >>>>>>>> longer seem to be related to the [i] ones. >>>>>>> >>>>>>> Didn't read the change in detail, so sorry if I mis-understood the >>>>>>> issue. >>>>>>> I made changes in scev to better fold type conversion by various sources >>>>>>> of information, for example, vrp, niters, undefined overflow behavior >>>>>>> etc. >>>>>>> In theory these information should be available for other >>>>>>> optimizers without >>>>>>> querying scev. For the mentioned test, vrp should compute accurate >>>>>>> range >>>>>>> information for "i" so that we can fold (intptr_t) (i + 1) it without >>>>>>> worrying >>>>>>> overflow. Note we don't do it in generic folding because >>>>>>> (intptr_t) (i) + 1 >>>>>>> could be more expensive (especially in case of (T)(i + j)), or because >>>>>>> the >>>>>>> CST part is in bigger precision after conversion. >>>>>>> But such folding is wanted in several places, e.g, IVOPTs. To provide >>>>>>> such >>>>>>> an interface, we changed tree-affine and made it do aggressive fold. I >>>>>>> am >>>>>>> curious if it's possible to use aff_tree to implement >>>>>>> strip_constant_offset >>>>>>> here since aggressive folding is wanted. After all, using additional >>>>>>> chrec >>>>>>> looks like a little heavy wrto the simple test. >>>>>> >>>>>> Yeah, using aff_tree does work here when the bounds are constant. >>>>>> It doesn't look like it works for things like: >>>>>> >>>>>> double p[1000]; >>>>>> double q[1000]; >>>>>> >>>>>> void >>>>>> f4 (unsigned int n) >>>>>> { >>>>>> for (unsigned int i = 0; i < n; i += 4) >>>>>> { >>>>>> double a = q[i] + p[i]; >>>>>> double b = q[i + 1] + p[i + 1]; >>>>>> q[i] = a; >>>>>> q[i + 1] = b; >>>>>> } >>>>>> } >>>>>> >>>>>> though, where the bounds on the global arrays guarantee that [i + 1] >>>>>> can't >>>>>> overflow, even though "n" is unconstrained. The patch as posted handles >>>>>> this case too. >>>>> BTW is this a missed optimization in value range analysis? The range >>>>> information for i should flow in a way like: array boundary -> niters >>>>> -> scev/vrp. >>>>> I think that's what niters/scev do in analysis. >>>> >>>> Yeah, maybe :-) It looks like the problem is that when SLP runs, >>>> the previous VRP pass came before loop header copying, so the (single) >>>> header has to cope with n == 0 case. Thus we get: >>> Ah, there are several passes in between vrp and pass_ch, not sure if >>> any such pass depends on vrp intensively. I would suggestion reorder >>> the two passes, or standalone VRP interface updating information for >>> loop region after header copied? This is a non-trivial issue that >>> needs to be fixed. Niters analyzer rely on >>> simplify_using_initial_conditions heavily to get the same information, >>> which in my opinion should be provided by VRP. Though this won't be >>> able to obsolete simplify_using_initial_conditions because VRP is weak >>> in symbolic range... >>> >>>> >>>> Visiting statement: >>>> i_15 = ASSERT_EXPR <i_6, i_6 < n_9(D)>; >>>> Intersecting >>>> [0, n_9(D) + 4294967295] EQUIVALENCES: { i_6 } (1 elements) >>>> and >>>> [0, 0] >>>> to >>>> [0, 0] EQUIVALENCES: { i_6 } (1 elements) >>>> Intersecting >>>> [0, 0] EQUIVALENCES: { i_6 } (1 elements) >>>> and >>>> [0, 1000] >>>> to >>>> [0, 0] EQUIVALENCES: { i_6 } (1 elements) >>>> Found new range for i_15: [0, 0] >>>> >>>> Visiting statement: >>>> _3 = i_15 + 1; >>>> Match-and-simplified i_15 + 1 to 1 >>>> Intersecting >>>> [1, 1] >>>> and >>>> [0, +INF] >>>> to >>>> [1, 1] >>>> Found new range for _3: [1, 1] >>>> >>>> (where _3 is the index we care about), followed by: >>>> >>>> Visiting statement: >>>> i_15 = ASSERT_EXPR <i_6, i_6 < n_9(D)>; >>>> Intersectings/aarch64-linux/trunk-orig/debug/gcc' >>>> [0, n_9(D) + 4294967295] EQUIVALENCES: { i_6 } (1 elements) >>>> and >>>> [0, 4] >>>> to >>>> [0, n_9(D) + 4294967295] EQUIVALENCES: { i_6 } (1 elements) >>>> Intersecting >>>> [0, n_9(D) + 4294967295] EQUIVALENCES: { i_6 } (1 elements) >>>> and >>>> [0, 1000] >>>> to >>>> [0, n_9(D) + 4294967295] EQUIVALENCES: { i_6 } (1 elements) >>>> Found new range for i_15: [0, n_9(D) + 4294967295] >>>> >>>> Visiting statement: >>>> _3 = i_15 + 1; >>>> Intersecting >>>> [0, +INF] >>>> and >>>> [0, +INF] >>>> to >>>> [0, +INF] >>>> Found new range for _3: [0, +INF] >>>> >>>> I guess in this case it would be better to intersect the i_15 ranges >>>> to [0, 1000] rather than [0, n_9(D) + 4294967295]. >>>> >>>> It does work if another VRP pass runs after CH. But even then, >>>> is it a good idea to rely on the range info being kept up-to-date >>>> all the way through to SLP? A lot happens inbetween. >>> To some extend yes. Now I understand that SCEV uses niters >>> information to prove no_overflow. Niters analysis does infer better >>> information from array boundary, while VRP fails to do that. I don't >>> worry much about gap between vrp pass and slp, it's basically the same >>> as niters. Both information are analyzed at one point and meant to be >>> used by following passes. It's also not common for vrp information >>> become invalid given we are on SSA? >> >> I'm not worried so much about vrp information becoming invalid on >> an SSA name that existed when VRP was run. It's more a question >> of what happens about SSA names that get introduced after VRP, >> e.g. by things like dom, reassoc, PRE, etc. > For induction variables in concern, these passes shouldn't > aggressively introduces new variables I think. >> >>> Now that data address is not analyzed against loop, VRP would be the >>> only information we can use unless adding back scev analysis. IMHO, >>> the patch is doing so in another way than before. It requires >>> additional chrec data structure. I remember the previous patch >>> enables more slp vectorization, is it because of "step" information >>> from scev? >> >> Do you mean that: >> >> 2017-07-03 Richard Sandiford <richard.sandif...@linaro.org> >> >> * tree-data-ref.c (dr_analyze_innermost): Replace the "nest" >> parameter with a "loop" parameter and use it instead of the >> loop containing DR_STMT. Don't check simple_iv when doing >> BB analysis. Describe the two analysis modes in the comment. >> >> enabled more SLP vectorisation in bb-slp-pr65935.c? That was due >> to us not doing IV analysis for BB vectorisation, and ensuring that >> the step was always zero. > Which means vectorizer code can handle not IV-analyzed offset, but > can't for analyzed form? >> >>> In this patch, step information is simply discarded. I am >>> wondering if possible to always analyze scev within innermost loop for >>> slp while discards step information. >> >> Well, we don't calculate a step for bb analysis (i.e. it's not the case >> the patch calculates step information and then simply discards it). >> I don't see how that would work. For bb analysis, the DR_OFFSET + DR_INIT >> has to give the offset for every execution of the block, not just the >> first iteration of the containing loop. So if we get back a nonzero >> step, we have to do something with it. > Yeah. >> >> But: >> >> (a) the old simple_iv analysis is more expensive than simply calling >> analyze_scev, so I don't think this is a win in terms of complexity. >> >> (b) for bb analysis, there's nothing particularly special about the >> innermost loop. It makes more sense to analyse it in the innermost >> loop for which the offset is invariant, as shown by the second >> testcase in the patch. >> >> (c) The patch helps with loop vectorisation too, since analysing the >> starting DR_OFFSET in the context of the containing loop can help >> in a similar way as analysing the full offset does for SLP. > > I have to admit I am not very much into this method. It complicates > structure as well as code. > Mostly because now dr_init are split into two different fields and one > of it is lazily computed. > > For example: >> @@ -2974,12 +2974,12 @@ vect_vfa_segment_size (struct data_refer >> vect_no_alias_p (struct data_reference *a, struct data_reference *b, >> tree segment_length_a, tree segment_length_b) >> { >> - gcc_assert (TREE_CODE (DR_INIT (a)) == INTEGER_CST >> - && TREE_CODE (DR_INIT (b)) == INTEGER_CST); >> - if (tree_int_cst_equal (DR_INIT (a), DR_INIT (b))) >> + gcc_assert (TREE_CODE (DR_CHREC_INIT (a)) == INTEGER_CST >> + && TREE_CODE (DR_CHREC_INIT (b)) == INTEGER_CST); >> + if (tree_int_cst_equal (DR_CHREC_INIT (a), DR_CHREC_INIT (b))) >> return false; >> >> - tree seg_a_min = DR_INIT (a); >> + tree seg_a_min = DR_CHREC_INIT (a); >> tree seg_a_max = fold_build2 (PLUS_EXPR, TREE_TYPE (seg_a_min), >> seg_a_min, segment_length_a); >> /* For negative step, we need to adjust address range by TYPE_SIZE_UNIT >> @@ -2990,10 +2990,10 @@ vect_no_alias_p (struct data_reference * >> tree unit_size = TYPE_SIZE_UNIT (TREE_TYPE (DR_REF (a))); >> seg_a_min = fold_build2 (PLUS_EXPR, TREE_TYPE (seg_a_max), >> seg_a_max, unit_size); >> - seg_a_max = fold_build2 (PLUS_EXPR, TREE_TYPE (DR_INIT (a)), >> - DR_INIT (a), unit_size); >> + seg_a_max = fold_build2 (PLUS_EXPR, TREE_TYPE (DR_CHREC_INIT (a)), >> + DR_CHREC_INIT (a), unit_size); >> } >> - tree seg_b_min = DR_INIT (b); >> + tree seg_b_min = DR_CHREC_INIT (b); >> tree seg_b_max = fold_build2 (PLUS_EXPR, TREE_TYPE (seg_b_min), >> seg_b_min, segment_length_b); >> if (tree_int_cst_compare (DR_STEP (b), size_zero_node) < 0) > > Use of DR_INIT is simply replaced by DR_CHREC_INIT. Is it safe to do > so in case of non-ZERO > DR_INIT? It worries me that I may need to think twice before > referring to DR_INIT because it's > not clear when DR_OFFSET is split and DR_CHREC_INIT becomes non-ZERO. > It may simply > because I am too dumb to handle this. I will leave this to richi.
I'm trying to understand this a bit (not liking it very much in its current form). Can code currently using DR_OFFSET and DR_INIT simply use DR_CHREC_INIT and CHREC_LEFT (DR_CHREC_OFFSET) (or DR_CHREC_OFFSET if DR_CHREC_OFFSET is not a CHREC)? If yes, would there be any downside in doing that? If not, then why? That said, I'm all for making DR info more powerful. But I detest having extra fields and adding confusion as to when to use which. Thus if we can make DR_CHREC_INIT be DR_INIT and DR_OFFSET an inline function accessing CHREC_LEFT if suitable plus exposing DR_OFFSET_CHREC that would make me more happy. One downside might be that the scalar evolution of the offset might pull in SSA vars into "DR_OFFSET" that are "far away" and thus less optimal for code-generation when one re-constructs a ref by adding the components? Thanks, Richard. > Thanks, > bin >> >> Thanks, >> Richard >> >>> >>> Thanks, >>> bin >>>> >>>> FWIW, the old simple_iv check that I removed for bb data-ref analysis >>>> relies on SCEV analysis too, so I don't think this is more expensive >>>> than what we had before. >>>> >>>> Thanks, >>>> Richard