On 2021-07-12 23:53, guojiufu via Gcc-patches wrote:
> On 2021-07-12 22:46, Richard Biener wrote:
>> On Mon, 12 Jul 2021, guojiufu wrote:
>>
>>> On 2021-07-12 18:02, Richard Biener wrote:
>>> > On Mon, 12 Jul 2021, guojiufu wrote:
>>> >
>>> >> On 2021-07-12 16:57, Richard Biener wrote:
>>> >> > On Mon, 12 Jul 2021, guojiufu wrote:
>>> >> >
>>> >> >> On 2021-07-12 14:20, Richard Biener wrote:
>>> >> >> > On Fri, 9 Jul 2021, Segher Boessenkool wrote:
>>> >> >> >
>>> >> >> >> On Fri, Jul 09, 2021 at 08:43:59AM +0200, Richard Biener wrote:
>>> >> >> >> > I wonder if there's a way to query the target what modes the
>>> >> >> >> > doloop
>>> >> >> >> > pattern can handle (not being too familiar with the doloop
>>> >> >> >> > code).
>>> >> >> >>
>>> >> >> >> You can look what modes are allowed for operand 0 of doloop_end,
>>> >> >> >> perhaps? Although that is a define_expand, not a define_insn, so
>>> >> >> >> it
>>> >> >> >> is
>>> >> >> >> hard to introspect.
>>> >> >> >>
>>> >> >> >> > Why do you need to do any checks besides the new type being
>>> >> >> >> > able to
>>> >> >> >> > represent all IV values? The original doloop IV will never
>>> >> >> >> > wrap
>>> >> >> >> > (OTOH if niter is U*_MAX then we compute niter + 1 which will
>>> >> >> >> > become
>>> >> >> >> > zero ... I suppose the doloop might still do the correct thing
>>> >> >> >> > here
>>> >> >> >> > but it also still will with a IV with larger type).
>>> >> >>
>>> >> >> The issue comes from U*_MAX (original short MAX), as you said: on
>>> >> >> which
>>> >> >> niter + 1 becomes zero. And because the step for doloop is -1;
>>> >> >> then, on
>>> >> >> larger type 'zero - 1' will be a very large number on larger type
>>> >> >> (e.g. 0xff...ff); but on the original short type 'zero - 1' is a
>>> >> >> small
>>> >> >> value
>>> >> >> (e.g. "0xff").
>>> >> >
>>> >> > But for the larger type the small type MAX + 1 fits and does not
>>> >> > yield
>>> >> > zero so it should still work exactly as before, no? Of course you
>>> >> > have to compute the + 1 in the larger type.
>>> >> >
>>> >> You are right, if compute the "+ 1" in the larger type it is ok, as
>>> >> below
>>> >> code:
>>> >> ```
>>> >> /* Use type in word size may fast. */
>>> >> if (TYPE_PRECISION (ntype) < BITS_PER_WORD)
>>> >> {
>>> >> ntype = lang_hooks.types.type_for_size (BITS_PER_WORD, 1);
>>> >> niter = fold_convert (ntype, niter);
>>> >> }
>>> >>
>>> >> tree base = fold_build2 (PLUS_EXPR, ntype, unshare_expr (niter),
>>> >> build_int_cst (ntype, 1));
>>> >>
>>> >>
>>> >> add_candidate (data, base, build_int_cst (ntype, -1), true, NULL,
>>> >> NULL,
>>> >> true);
>>> >> ```
>>> >> The issue of this is, this code generates more stmt for doloop.xxx:
>>> >> _12 = (unsigned int) xx(D);
>>> >> _10 = _12 + 4294967295;
>>> >> _24 = (long unsigned int) _10;
>>> >> doloop.6_8 = _24 + 1;
>>> >>
>>> >> if use previous patch, "+ 1" on original type, then the stmts will
>>> >> looks
>>> >> like:
>>> >> _12 = (unsigned int) xx(D);
>>> >> doloop.6_8 = (long unsigned int) _12;
>>> >>
>>> >> This is the reason for checking
>>> >> wi::ltu_p (niter_desc->max, wi::to_widest (TYPE_MAX_VALUE (ntype)))
>>> >
>>> > But this then only works when there's an upper bound on the number
>>> > of iterations. Note you should not use TYPE_MAX_VALUE here but
>>> > you can instead use
>>> >
>>> > wi::ltu_p (niter_desc->max, wi::to_widest (wi::max_value
>>> > (TYPE_PRECISION (ntype), TYPE_SIGN (ntype))));
>>>
>>> Ok, Thanks!
>>> I remember you mentioned that:
>>> widest_int::from (wi::max_value (TYPE_PRECISION (ntype), TYPE_SIGN
>>> (ntype)),
>>> TYPE_SIGN (ntype))
>>> would be better than
>>> wi::to_widest (TYPE_MAX_VALUE (ntype)).
>>>
>>> It seems that:
>>> "TYPE_MAX_VALUE (ntype)" is "NUMERICAL_TYPE_CHECK
>>> (NODE)->type_non_common.maxval"
>>> which do a numerical-check and return the field of maxval. And then call
>>> to
>>> wi::to_widest
>>>
>>> The other code "widest_int::from (wi::max_value (..,..),..)" calls
>>> wi::max_value
>>> and widest_int::from.
>>>
>>> I'm wondering if wi::to_widest (TYPE_MAX_VALUE (ntype)) is cheaper?
>>
>> TYPE_MAX_VALUE can be "suprising", it does not necessarily match the
>> underlying modes precision. At some point we've tried to eliminate
>> most of its uses, not sure what the situation/position is right now.
> Ok, get it, thanks.
> I will use "widest_int::from (wi::max_value (..,..),..)".
>
>>
>>> > I think the -1 above comes from number of latch iterations vs. header
>>> > entries - it's a common source for this kind of issues. range analysis
>>> > might be able to prove that we can still merge the two adds even with
>>> > the intermediate extension.
>>> Yes, as you mentioned here, it relates to number of latch iterations
>>> For loop looks like : while (l < n) or for (i = 0; i < n; i++)
>>> This kind of loop, the niter is used to be 'n - 1' after transformed
>>> into 'do-while' form.
> For this kind of loop, the max value for the number of iteration "n - 1"
> would be "max_value_type(n) - 1" which is wi::ltu than max_value_type.
> This kind of loop is already common, and we could use wi::ltu (max,
> max_value_type)
> to check.
>
> For loop looks like:
> do ;
> while (n-- > 0); /* while (n-- > low); */
>
> The niter_desc->max will wi::eq to max_value_type, and niter would be "n",
> and then doloop.xx is 'n+1'.
>
>>> I would see how to merge these two adds safely at this point
>>> when generating doloop iv. (maybe range info, thanks!
>>>
For some cases, it may not easy to merge -1 +1 pair:
(from doloop-1.c)
```
int __attribute__ ((noinline))
test (unsigned char n)
{
do ;
while (--n > 0);
return n;
}
int main ()
{
unsigned char z = 0;
return test (z);
}
```
For this loop, niter_desc->max is 255, and niter is 'n - 1', then
doloop.xx is "'n - 1' + 1", 'n - 1' should be compute at uchar type
(QImode).
It is ok to compute '- 1' and "+ 1" under the original shorter type.
And it is ok if computing "n - 1" under original type and computing
"+ 1" under larger mode.
But it is not ok to compute both "- 1" and "+ 1" in large mode.
As discussed previously: when 'n - 1' is U*_MAX (original short MAX),
'(n - 1) + 1' becomes zero. And the step for doloop is -1; then,
on the larger type 'zero - 1' will be a very large number.
It is ok: Original
unsigned char doloop.5;
# doloop.5_4 = PHI <n_2(D)(2), doloop.5_6(5)>
doloop.5_6 = doloop.5_4 - 1;
it would be error if:
doloop.5_7 = (long unsigned int) n_2(D);
# doloop.5_4 = PHI <doloop.5_7(2), doloop.5_6(5)>
doloop.5_6 = doloop.5_4 - 1;
>>> >
>>> > Is this pre-loop extra add really offsetting the in-loop doloop
>>> > improvements?
>>> I'm not catching this question too much, sorry. I guess your concern
>>> is if the "+1" is an offset: it may not, "+1" may be just that doloop.xx
>>> is decreasing niter until 0 (all number >0).
>>> If misunderstand, thanks for point out.
>>
>> I'm questioning the argument that not being able to eliminate the +1-1
>> pair effects the overall cost improvement for the doloop IV type change
>> which hopefully is _not_ just loop IV setup cost but improving
>> performance in the loop body itself?
> Yes, I think so. It would affect doloop IV setup cost, it may also affect
> loop itself on some targets, if the target prefers a changed type:
> using one jump-counter instruction for wider type, but cmp/jump instructions
> for other types.
> If the loop is nested in outer-loop, eliminating +1-1 would improving
> outer-loop performance.