Re: [RFC][tree-vect]PR 88915: Further vectorize second loop when versioning

Andre Vieira (lists) Mon, 05 Aug 2019 11:10:54 -0700

Hi Richard,

Thanks for the feedback! See comments inline.


On 01/08/2019 16:26, Richard Biener wrote:

On Tue, 30 Jul 2019, Andre Vieira (lists) wrote:



On 30/07/2019 13:16, Andre Vieira (lists) wrote:

Hi Richard,

I believe this is in line with what you were explaining to me earlier. The
one thing I haven't quite done here is the jump around for if there is no
prolog peeling. Though I think skip_vectors introduces the jumping we need.

The main gist of it is:
1) if '--param vect-epilogues-nomask=1' is passed we refrain from adding the
versioning threshold check when versioning a loop at first,
2) later we allow the vectorizer to use skip_vectors to basically check
niters to be able to skip the vectorized loop on to the right epilogue,
3) after vectorizing the epilogue, we check whether this was a versioned
loop and whether we skipped adding the versioning threshold, if so we add a
threshold based on the last VF

There is a caveat here, if we don't vectorize the epilogue, because say our
architecture doesn't know how, we will end up with a regression.
Let's say we have a loop for which our target can vectorize for 32x but not
16x, we get:

if (alias & threshold_for_16x ())
{
    if (niters > 31)
      vectorized_31 ();
    else
      scalar ();
}
else
   scalar ();

Imagine our iteration count is 18, all we hit is the scalar loop, but now we
go through 1x alias and 2x niters. Whereas potentially we could have done
with just 1x niters.


True.  Note we should swap the checks to

   if (threshold_for_16x && alias)

I agree, I just haven't figured out what the best way of doing it. I amtrying TRUTH_ANDIF_EXPR now, but fold_build2 turns that into aTRUTH_AND_EXPR. Can I assume it does that for aarch64 because it canuse conditional compare to merge the two checks? To be fair the codegenerated for aarch64 for the two checks I am looking at doesn't looktoo bad.

I am pondering figuring out why fold decides to transform it back into aTRUTH_AND_EXPR. Otherwise I think I might just try splitting the basicblock. Though I am not entirely sure adding an extra jump in there isalways a good thing.

A mitigation for this could be to only add the threshold check if we
actually vectorized the last loop, I believe a:
'if (LOOP_VINFO_VECTORIZABLE_P (loop_vinfo)) return NULL;' inside that new
"else" in vect_transform_loop would do the trick. Though then we will still
have that extra alias check... >
I am going to hack around in the back-end to "fake" a situation like this
and see what happens.


Right should have quickly tried this before sending the email, what happens is
actually vect_transform_loop never gets called because try_vectorize_loop_1
will recognize it can't vectorize the epilogue. So we end up with the
"mitigation" result I suggested, where we simply don't have a versioning
threshold which might still not be ideal.


I think the main issue is how we have implemented epilogue vectorization.
Ideally when vectorizing the loop () we'd recognize all VFs we can handle
and thus know beforehand.  I think that's already done for the sake
of openmp simd now so doing this when epilogue vectorization is enabled
as well wouldn't be too bad - so then we know, at the time we do the
versioning, what and how many vectorized epilogues we create.  See
vect_analyze_loop and the loop over vector sizes.

I think I managed to do this by passing a boolean initialized to thePARAM_VALUE (PARAM_VECT_EPILOGUES_NOMASK) to vect_analyze_loop and alsochanging the looping over vector_sizes to count the number ofvectorizable loops whilst using the same code path as loop->simdlen ifvect_epilogues_nomask is true.

Then I set vect_epilogues_nomask to false if the number of vectorizedloops isn't higher than 1. I'll follow up with a new patch for you tosee, which will most likely make more sense than what I just wrote upthere ;)


Another potential issue arises when the profitability threshold obtained
from the cost model isn't guaranteed to be either LE or EQ to the versioning
threshold. In that case there are two separate checks, which now we no
longer will attempt to fold.


And I just realized I don't take the cost model threshold into account when
creating the new threshold check, I guess if it is ordered_p we should again
take the max there between the new threshold check and the cost threshold for
the new check.


Also see https://gcc.gnu.org/ml/gcc-patches/2018-06/msg01397.html for
a patch that computes costs of each possible VF, with that we could
compute a better combined estimate for minimum number of iters
(also considering the extra jumps to reach the main VF/2 loop).

Started to look at this. Just to check something, this patch enablespicking a lower VF loop if it determines that its single iteration costis more than N times lower than the higher VF loop, where N is theration between higher VF and lower VF no?

Sounds like a good check to have, but I don't see how this is related tomy current changes. I mean other than that it could turn off epiloguevectorization altogether since it could decide to only vectorize once.

Did you mean to show me this as a way to incorporate costs for the extrajumps somehow? For instance, taking the ordered max of currentversioning threshold and the SINGLE_ITERATION_COST + 1 (for the extra jump)?

I also noticed we purposefully only vectorize the epilogue once, withvect-epilogues-nomask=1, I am guessing that's just because nobody gotaround to it yet? Specially since even vectorizing it once goes wrongsometimes.


I am trying to find a way to test and benchmark these changes. Unfortunately
I am having trouble doing this for AVX512 as I find that the old '--param
vect_epilogues_nomask=1' already causes wrong codegen in SPEC2017 for the
gcc and perlbench benchmarks.


There's a reason it is not enabled by default.  But I'd like to
fix bugs it has so can you possibly reduce testcases and file
bugs for it?

Yeah that makes sense, also options that are off by default tend tobitrot fast! I did have an initial look at it, but the gcc benchmarkisn't an easy one to dissect. I will give it another go sometime thisweek, see if I can figure out at least what loop is being miscompiled.

Do you see any use in running any set of tests in the testsuite with theparameter for smaller testcases?

Thanks,
Richard.



Cheers,
Andre

On 19/07/2019 13:38, Andre Vieira (lists) wrote:



On 19/07/2019 12:35, Richard Biener wrote:

On Fri, 19 Jul 2019, Andre Vieira (lists) wrote:



On 15/07/2019 11:54, Richard Biener wrote:

On Mon, 15 Jul 2019, Andre Vieira (lists) wrote:



On 12/07/2019 11:19, Richard Biener wrote:

On Thu, 11 Jul 2019, Andre Vieira (lists) wrote:



I have code that can split the condition and alias checks in
'vect_loop_versioning'. For this approach I am considering keeping
that
bit of
code and seeing if I can patch up the checks after vectorizing the
epilogue
further. I think initially I will just go with a "hacked up" way
of
passing
down the bb with the iteration check and split the false edge
every time
we
vectorize it further. Will keep you posted on progress. If you
have any
pointers/tips they are most welc ome!


I thought to somehow force the idea that we have a prologue loop
to the vectorizer so it creates the number-of-vectorized iterations
check and branch around the main (highest VF) vectorized loop.


Hmm I think I may have skimmed over this earlier. I am reading it now
and am
not entirely sure what you mean by this. Specifically the "number of
vectorized iterations" or how a prologue loop plays a role in this.


When there is no prologue then the versioning condition currently
ensures we always enter the vector loop.  Only if there is a prologue
is there a check and branch eventually skipping right to the epilogue.
If the versioning condition (now using a lower VF) doesn't guarantee
this we have to add this jump-around.


Right, I haven't looked at the prologue path yet. I had a quick look and
can't find where this branch skipping to the epilogue is constructed.  I
will take a better look after I got my current example to work.


I guess this sheds some light on the comment above. And it definitely
implies
we would need to know the lowest VF when creating this condition.
Which is
tricky.


We can simply use the smallest vector size supported by the target to
derive it from the actual VF, no?


So I could wait to introduce this check after all epilogue vectorization
is done, back track to the last niter check and duplicate that in the
outer loop.

What I didn't want to do was use the smallest possible vector size for the
target because I was under the impression that does not necessarily
correspond to the smallest VF we may have for a loop, as the vectorizer
may have decided not to vectorize for that vector size because of costs?
If it I can assume this never happens, that once it starts to vectorize
epilogues that it will keep vectorizing them for any vector size it knows
off then yeah I can use that.

I'm not sure I understand - why would you have any check not inside
the outer loop?  Yes, we now eventually hoist versioning checks
but the VF checks for the individual variants should be around
the vectorized loop itself (so not really part of the versioning check).


Yeah I agree. I was just explaining what I had done wrong now.

Cheers,
Andre

PS: I often find myself having to patch the DOMINATOR information,
sometimes
its easy to, but sometimes it can get pretty complicated. I wonder
whether it
would make sense to write something that traverses a loop and corrects
this,
if it doesn't exist already.


There's iterate-fix-dominators, but unless you create new edges/blocks
manually rather than doing split-block/redirect-edge which should do
dominator updating for you.


Ah I was doing everything manually after having some bad experiences with
lv_add_condition_to_bb.  I will have a look at those thanks!

Cheers,
Andre


Richard.


Richard.

Re: [RFC][tree-vect]PR 88915: Further vectorize second loop when versioning

Reply via email to