Re: PR81635: Use chrecs to help find related data refs

Richard Biener Fri, 18 Aug 2017 03:35:52 -0700

On Thu, Aug 17, 2017 at 2:24 PM, Bin.Cheng <amker.ch...@gmail.com> wrote:
> On Thu, Aug 17, 2017 at 12:35 PM, Richard Sandiford
> <richard.sandif...@linaro.org> wrote:
>> "Bin.Cheng" <amker.ch...@gmail.com> writes:
>>> On Wed, Aug 16, 2017 at 6:50 PM, Richard Sandiford
>>> <richard.sandif...@linaro.org> wrote:
>>>> "Bin.Cheng" <amker.ch...@gmail.com> writes:
>>>>> On Wed, Aug 16, 2017 at 5:00 PM, Richard Sandiford
>>>>> <richard.sandif...@linaro.org> wrote:
>>>>>> "Bin.Cheng" <amker.ch...@gmail.com> writes:
>>>>>>> On Wed, Aug 16, 2017 at 2:38 PM, Richard Sandiford
>>>>>>> <richard.sandif...@linaro.org> wrote:
>>>>>>>> The first loop in the testcase regressed after my recent changes to
>>>>>>>> dr_analyze_innermost.  Previously we would treat "i" as an iv even
>>>>>>>> for bb analysis and end up with:
>>>>>>>>
>>>>>>>>    DR_BASE_ADDRESS: p or q
>>>>>>>>    DR_OFFSET: 0
>>>>>>>>    DR_INIT: 0 or 4
>>>>>>>>    DR_STEP: 16
>>>>>>>>
>>>>>>>> We now always keep the step as 0 instead, so for an int "i" we'd have:
>>>>>>>>
>>>>>>>>    DR_BASE_ADDRESS: p or q
>>>>>>>>    DR_OFFSET: (intptr_t) i
>>>>>>>>    DR_INIT: 0 or 4
>>>>>>>>    DR_STEP: 0
>>>>>>>>
>>>>>>>> This is also what we'd like to have for the unsigned "i", but the
>>>>>>>> problem is that strip_constant_offset thinks that the "i + 1" in
>>>>>>>> "(intptr_t) (i + 1)" could wrap and so doesn't peel off the "+ 1".
>>>>>>>> The [i + 1] accesses therefore have a DR_OFFSET equal to the SSA
>>>>>>>> name that holds "(intptr_t) (i + 1)", meaning that the accesses no
>>>>>>>> longer seem to be related to the [i] ones.
>>>>>>>
>>>>>>> Didn't read the change in detail, so sorry if I mis-understood the 
>>>>>>> issue.
>>>>>>> I made changes in scev to better fold type conversion by various sources
>>>>>>> of information, for example, vrp, niters, undefined overflow behavior 
>>>>>>> etc.
>>>>>>> In theory these information should be available for other
>>>>>>> optimizers without
>>>>>>> querying scev.  For the mentioned test, vrp should compute accurate 
>>>>>>> range
>>>>>>> information for "i" so that we can fold (intptr_t) (i + 1) it without
>>>>>>> worrying
>>>>>>> overflow.  Note we don't do it in generic folding because
>>>>>>> (intptr_t) (i) + 1
>>>>>>> could be more expensive (especially in case of (T)(i + j)), or because 
>>>>>>> the
>>>>>>> CST part is in bigger precision after conversion.
>>>>>>> But such folding is wanted in several places, e.g, IVOPTs.  To provide 
>>>>>>> such
>>>>>>> an interface, we changed tree-affine and made it do aggressive fold.  I 
>>>>>>> am
>>>>>>> curious if it's possible to use aff_tree to implement 
>>>>>>> strip_constant_offset
>>>>>>> here since aggressive folding is wanted.  After all, using additional 
>>>>>>> chrec
>>>>>>> looks like a little heavy wrto the simple test.
>>>>>>
>>>>>> Yeah, using aff_tree does work here when the bounds are constant.
>>>>>> It doesn't look like it works for things like:
>>>>>>
>>>>>>     double p[1000];
>>>>>>     double q[1000];
>>>>>>
>>>>>>     void
>>>>>>     f4 (unsigned int n)
>>>>>>     {
>>>>>>       for (unsigned int i = 0; i < n; i += 4)
>>>>>>         {
>>>>>>           double a = q[i] + p[i];
>>>>>>           double b = q[i + 1] + p[i + 1];
>>>>>>           q[i] = a;
>>>>>>           q[i + 1] = b;
>>>>>>         }
>>>>>>     }
>>>>>>
>>>>>> though, where the bounds on the global arrays guarantee that [i + 1] 
>>>>>> can't
>>>>>> overflow, even though "n" is unconstrained.  The patch as posted handles
>>>>>> this case too.
>>>>> BTW is this a missed optimization in value range analysis?  The range
>>>>> information for i should flow in a way like: array boundary -> niters
>>>>> -> scev/vrp.
>>>>> I think that's what niters/scev do in analysis.
>>>>
>>>> Yeah, maybe :-)  It looks like the problem is that when SLP runs,
>>>> the previous VRP pass came before loop header copying, so the (single)
>>>> header has to cope with n == 0 case.  Thus we get:
>>> Ah, there are several passes in between vrp and pass_ch, not sure if
>>> any such pass depends on vrp intensively.  I would suggestion reorder
>>> the two passes, or standalone VRP interface updating information for
>>> loop region after header copied?   This is a non-trivial issue that
>>> needs to be fixed.  Niters analyzer rely on
>>> simplify_using_initial_conditions heavily to get the same information,
>>> which in my opinion should be provided by VRP.  Though this won't be
>>> able to obsolete simplify_using_initial_conditions because VRP is weak
>>> in symbolic range...
>>>
>>>>
>>>>   Visiting statement:
>>>>   i_15 = ASSERT_EXPR <i_6, i_6 < n_9(D)>;
>>>>   Intersecting
>>>>     [0, n_9(D) + 4294967295]  EQUIVALENCES: { i_6 } (1 elements)
>>>>   and
>>>>     [0, 0]
>>>>   to
>>>>     [0, 0]  EQUIVALENCES: { i_6 } (1 elements)
>>>>   Intersecting
>>>>     [0, 0]  EQUIVALENCES: { i_6 } (1 elements)
>>>>   and
>>>>     [0, 1000]
>>>>   to
>>>>     [0, 0]  EQUIVALENCES: { i_6 } (1 elements)
>>>>   Found new range for i_15: [0, 0]
>>>>
>>>>   Visiting statement:
>>>>   _3 = i_15 + 1;
>>>>   Match-and-simplified i_15 + 1 to 1
>>>>   Intersecting
>>>>     [1, 1]
>>>>   and
>>>>     [0, +INF]
>>>>   to
>>>>     [1, 1]
>>>>   Found new range for _3: [1, 1]
>>>>
>>>> (where _3 is the index we care about), followed by:
>>>>
>>>>   Visiting statement:
>>>>   i_15 = ASSERT_EXPR <i_6, i_6 < n_9(D)>;
>>>>   Intersectings/aarch64-linux/trunk-orig/debug/gcc'
>>>>     [0, n_9(D) + 4294967295]  EQUIVALENCES: { i_6 } (1 elements)
>>>>   and
>>>>     [0, 4]
>>>>   to
>>>>     [0, n_9(D) + 4294967295]  EQUIVALENCES: { i_6 } (1 elements)
>>>>   Intersecting
>>>>     [0, n_9(D) + 4294967295]  EQUIVALENCES: { i_6 } (1 elements)
>>>>   and
>>>>     [0, 1000]
>>>>   to
>>>>     [0, n_9(D) + 4294967295]  EQUIVALENCES: { i_6 } (1 elements)
>>>>   Found new range for i_15: [0, n_9(D) + 4294967295]
>>>>
>>>>   Visiting statement:
>>>>   _3 = i_15 + 1;
>>>>   Intersecting
>>>>     [0, +INF]
>>>>   and
>>>>     [0, +INF]
>>>>   to
>>>>     [0, +INF]
>>>>   Found new range for _3: [0, +INF]
>>>>
>>>> I guess in this case it would be better to intersect the i_15 ranges
>>>> to [0, 1000] rather than [0, n_9(D) + 4294967295].
>>>>
>>>> It does work if another VRP pass runs after CH.  But even then,
>>>> is it a good idea to rely on the range info being kept up-to-date
>>>> all the way through to SLP?  A lot happens inbetween.
>>> To some extend yes.  Now I understand that SCEV uses niters
>>> information to prove no_overflow.  Niters analysis does infer better
>>> information from array boundary, while VRP fails to do that.  I don't
>>> worry much about gap between vrp pass and slp, it's basically the same
>>> as niters.  Both information are analyzed at one point and meant to be
>>> used by following passes.  It's also not common for vrp information
>>> become invalid given we are on SSA?
>>
>> I'm not worried so much about vrp information becoming invalid on
>> an SSA name that existed when VRP was run.  It's more a question
>> of what happens about SSA names that get introduced after VRP,
>> e.g. by things like dom, reassoc, PRE, etc.
> For induction variables in concern, these passes shouldn't
> aggressively introduces new variables I think.
>>
>>> Now that data address is not analyzed against loop, VRP would be the
>>> only information we can use unless adding back scev analysis.  IMHO,
>>> the patch is doing so in another way than before.  It requires
>>> additional chrec data structure.  I remember the previous patch
>>> enables more slp vectorization, is it because of "step" information
>>> from scev?
>>
>> Do you mean that:
>>
>> 2017-07-03  Richard Sandiford  <richard.sandif...@linaro.org>
>>
>>         * tree-data-ref.c (dr_analyze_innermost): Replace the "nest"
>>         parameter with a "loop" parameter and use it instead of the
>>         loop containing DR_STMT.  Don't check simple_iv when doing
>>         BB analysis.  Describe the two analysis modes in the comment.
>>
>> enabled more SLP vectorisation in bb-slp-pr65935.c?  That was due
>> to us not doing IV analysis for BB vectorisation, and ensuring that
>> the step was always zero.
> Which means vectorizer code can handle not IV-analyzed offset, but
> can't for analyzed form?
>>
>>> In this patch, step information is simply discarded.  I am
>>> wondering if possible to always analyze scev within innermost loop for
>>> slp while discards step information.
>>
>> Well, we don't calculate a step for bb analysis (i.e. it's not the case
>> the patch calculates step information and then simply discards it).
>> I don't see how that would work.  For bb analysis, the DR_OFFSET + DR_INIT
>> has to give the offset for every execution of the block, not just the
>> first iteration of the containing loop.  So if we get back a nonzero
>> step, we have to do something with it.
> Yeah.
>>
>> But:
>>
>> (a) the old simple_iv analysis is more expensive than simply calling
>>     analyze_scev, so I don't think this is a win in terms of complexity.
>>
>> (b) for bb analysis, there's nothing particularly special about the
>>     innermost loop.  It makes more sense to analyse it in the innermost
>>     loop for which the offset is invariant, as shown by the second
>>     testcase in the patch.
>>
>> (c) The patch helps with loop vectorisation too, since analysing the
>>     starting DR_OFFSET in the context of the containing loop can help
>>     in a similar way as analysing the full offset does for SLP.
>
> I have to admit I am not very much into this method.  It complicates
> structure as well as code.
> Mostly because now dr_init are split into two different fields and one
> of it is lazily computed.
>
> For example:
>> @@ -2974,12 +2974,12 @@ vect_vfa_segment_size (struct data_refer
>>  vect_no_alias_p (struct data_reference *a, struct data_reference *b,
>>                   tree segment_length_a, tree segment_length_b)
>>  {
>> -  gcc_assert (TREE_CODE (DR_INIT (a)) == INTEGER_CST
>> -          && TREE_CODE (DR_INIT (b)) == INTEGER_CST);
>> -  if (tree_int_cst_equal (DR_INIT (a), DR_INIT (b)))
>> +  gcc_assert (TREE_CODE (DR_CHREC_INIT (a)) == INTEGER_CST
>> +          && TREE_CODE (DR_CHREC_INIT (b)) == INTEGER_CST);
>> +  if (tree_int_cst_equal (DR_CHREC_INIT (a), DR_CHREC_INIT (b)))
>>      return false;
>>
>> -  tree seg_a_min = DR_INIT (a);
>> +  tree seg_a_min = DR_CHREC_INIT (a);
>>    tree seg_a_max = fold_build2 (PLUS_EXPR, TREE_TYPE (seg_a_min),
>>                  seg_a_min, segment_length_a);
>>    /* For negative step, we need to adjust address range by TYPE_SIZE_UNIT
>> @@ -2990,10 +2990,10 @@ vect_no_alias_p (struct data_reference *
>>        tree unit_size = TYPE_SIZE_UNIT (TREE_TYPE (DR_REF (a)));
>>        seg_a_min = fold_build2 (PLUS_EXPR, TREE_TYPE (seg_a_max),
>>                     seg_a_max, unit_size);
>> -      seg_a_max = fold_build2 (PLUS_EXPR, TREE_TYPE (DR_INIT (a)),
>> -                   DR_INIT (a), unit_size);
>> +      seg_a_max = fold_build2 (PLUS_EXPR, TREE_TYPE (DR_CHREC_INIT (a)),
>> +                   DR_CHREC_INIT (a), unit_size);
>>      }
>> -  tree seg_b_min = DR_INIT (b);
>> +  tree seg_b_min = DR_CHREC_INIT (b);
>>    tree seg_b_max = fold_build2 (PLUS_EXPR, TREE_TYPE (seg_b_min),
>>                  seg_b_min, segment_length_b);
>>    if (tree_int_cst_compare (DR_STEP (b), size_zero_node) < 0)
>
> Use of DR_INIT is simply replaced by DR_CHREC_INIT.  Is it safe to do
> so in case of non-ZERO
> DR_INIT?  It worries me that I may need to think twice before
> referring to DR_INIT because it's
> not clear when DR_OFFSET is split and DR_CHREC_INIT becomes non-ZERO.
> It may simply
> because I am too dumb to handle this.  I will leave this to richi.


I'm trying to understand this a bit (not liking it very much in its
current form).

Can code currently using DR_OFFSET and DR_INIT simply use
DR_CHREC_INIT and CHREC_LEFT (DR_CHREC_OFFSET) (or DR_CHREC_OFFSET
if DR_CHREC_OFFSET is not a CHREC)?  If yes, would there be any downside
in doing that?  If not, then why?

That said, I'm all for making DR info more powerful.  But I detest
having extra fields
and adding confusion as to when to use which.  Thus if we can make DR_CHREC_INIT
be DR_INIT and DR_OFFSET an inline function accessing CHREC_LEFT if
suitable plus exposing DR_OFFSET_CHREC that would make me more happy.

One downside might be that the scalar evolution of the offset might pull in
SSA vars into "DR_OFFSET" that are "far away" and thus less optimal for
code-generation when one re-constructs a ref by adding the components?

Thanks,
Richard.


> Thanks,
> bin
>>
>> Thanks,
>> Richard
>>
>>>
>>> Thanks,
>>> bin
>>>>
>>>> FWIW, the old simple_iv check that I removed for bb data-ref analysis
>>>> relies on SCEV analysis too, so I don't think this is more expensive
>>>> than what we had before.
>>>>
>>>> Thanks,
>>>> Richard

Re: PR81635: Use chrecs to help find related data refs

Reply via email to