On October 21, 2019 4:06:49 PM GMT+02:00, Richard Sandiford 
<richard.sandif...@arm.com> wrote:
>Richard Biener <richard.guent...@gmail.com> writes:
>> On Mon, Oct 21, 2019 at 12:08 PM Richard Sandiford
>> <richard.sandif...@arm.com> wrote:
>>>
>>> Richard Biener <richard.guent...@gmail.com> writes:
>>> > On October 20, 2019 2:54:48 PM GMT+02:00, Richard Sandiford
><richard.sandif...@arm.com> wrote:
>>> >>Richard Biener <richard.guent...@gmail.com> writes:
>>> >>> On October 19, 2019 5:06:40 PM GMT+02:00, Richard Sandiford
>>> >><richard.sandif...@arm.com> wrote:
>>> >>>>After the previous patch, it seems more natural to apply the
>>> >>>>PARAM_SLP_MAX_INSNS_IN_BB threshold as soon as we know what
>>> >>>>the region is, rather than delaying it to vect_slp_analyze_bb_1.
>>> >>>>(But rather than carve out the biggest region possible and then
>>> >>>>reject it, wouldn't it be better to stop when the region gets
>>> >>>>too big, so we at least have a chance of vectorising something?)
>>> >>>>
>>> >>>>It also seems more natural for vect_slp_bb_region to create the
>>> >>>>bb_vec_info itself rather than (a) having to pass bits of data
>down
>>> >>>>for the initialisation and (b) forcing vect_slp_analyze_bb_1 to
>free
>>> >>>>on every failure return.
>>> >>>>
>>> >>>>Tested on aarch64-linux-gnu and x86_64-linux-gnu.  OK to
>install?
>>> >>>
>>> >>> Ok. But I wonder what the reason was for this limit? Dependency
>>> >>analysis was greatly simplified, being no longer quadratic here.
>Can
>>> >>you check where the limit originally came from? But indeed
>splitting
>>> >>the region makes more sense then, but at dataref group boundaries
>I'd
>>> >>say.
>>> >>
>>> >>Yeah, looks it was the complexity of dependence analysis:
>>> >>
>>> >>  https://gcc.gnu.org/ml/gcc-patches/2009-05/msg01303.html
>>> >
>>> > OK. We no longer
>>> > Call compute dependence between all memory refs but only verify we
>can do those code-motions we need to do. That's of course much harder
>to check a bound on upfront (it's still quadratic in the worst case).
>I'm also not sure this is ever a problem, but we might instead count
>the number of stmts involving memory?
>>>
>>> OK, here's what I tried.  It exposes quadratic behaviour for
>>> gcc.dg/pr87985.c on x86_64 though.  To make things worse, this is
>>> all in the name of "vectorising" using V1DI. :-)
>>>
>>> In that test we have N data references in which the even elements
>appear
>>> to be vectorisable but the odd elements don't.  First time round, we
>hit:
>>>
>>>   /* For basic block SLP, try to break the group up into multiples
>of the
>>>      vector size.  */
>>>   unsigned HOST_WIDE_INT const_nunits;
>>>   if (is_a <bb_vec_info> (vinfo)
>>>       && STMT_VINFO_GROUPED_ACCESS (stmt_info)
>>>       && DR_GROUP_FIRST_ELEMENT (stmt_info)
>>>       && nunits.is_constant (&const_nunits))
>>>
>>> This breaks off the leading even element and recurses, discards the
>>> leading odd element, and then recurses on the rest.  We then come
>round
>>> to the same code when recursing on the rest.
>>>
>>> It looks like matches is valid for all elements, so I guess we
>should
>>> divide the group based on all false elements, not just the first?
>>
>> What do you mean with "valid for all elements"?  matches[i] is
>> true or false for all elements?  If it is true we shouldn't split
>> at all.
>
>Initially I'd wondered whether the elements were undefined after the
>first "false", i.e. if we broke early at that point.  But it looks like
>true and false elements after the first false are still meaningful.
>
>> But indeed the recursion is bad if we split out one vector a step,
>> we should split the whole group in "half" instead.
>>
>> Still the code should ensure we have a "half" that works OK
>> and one that possibly doesn't.
>>
>> OK, I see we have {true, true, false <repeats ... times> } for
>> matches but the 2nd iteration gets {true, false, ... } and doesn't
>> split for me.  Of course if you have V1DI then you'll notice
>> that matches[0] is _always_ true ... I guess we should simply
>> never try splitting for const_nunits == 1?  Or even never
>> do BB vectorization with such a vector type.
>
>I don't think it's specific to const_nunits == 1.  E.g. if we have
>matches == { true, true, false, true } repeating for const_nunits == 2,
>we'll have the same problem of trying:
>
>  N
>  2, N-4
>     2, N-8
>        2, N-12,
>           2, N-16
>              ....
>
>which is still quadratic.

Yeah. 

>Dividing genuinely into half would fix that, so would dividing the
>whole group based on the group1_size. 

I think the smaller split was meant as optimization and originally we didn't 
recurse on the rest... So yes, simply halving might help. 

 Or we could try using the
>whole matches array to divide the group up into chunks that are
>worth recursing on, with each chunk having at most half the original
>group size and with the matches elements for each chunk being all-true
>or all-false.
>
>Thanks,
>Richard

Reply via email to