Hi Both,
Thanks for all the reviews/patches so far 😊
> >
> > Looks good, but I wonder what we can do to at least make the multiple
> > exit case behave reasonably? The vectorizer keeps track
>
> > of a "canonical" exit, would it be possible to pass in the main exit
> > edge and use that instead of single_exit (), would other exits then
> > behave somewhat reasonable or would we totally screw things up here?
> > That is, the "canonical" exit would be the counting exit while the
> > other exits are on data driven conditions and thus wouldn't change
> > probability when we reduce the number of iterations(?)
>
> I can add canonical_exit parameter and make the function to direct flow to it
> if
> possible. However overall I think fixup depends on what transformation led to
> the change.
>
> Assuming that vectorizer did no prologues and apilogues and we vectorized
> with factor N, then I think the update could be done more specifically as
> follows.
>
If it helps, how this patch series addresses multiple exits by forcing a scalar
epilogue, all non canonical_exits would have been redirected to this scalar
epilogue, so the remaining scalar iteration count will be at most VF.
Regards,
Tamar
> We know that header block count dropped by 4. So we can start from that
> and each time we reach basic block with exit edge, we know the original count
> of the edge. This count is unchanged, so one can rescale probabilities out of
> that BB accordingly. If loop has no inner loops, we can just walk the body in
> RPO and propagate scales downwards and we sould arrive to right result
>
> I originally added the bound parameter to handle prologues/epilogues which
> gets new artificial bound. In prologue I think you are right that the flow
> will be
> probably directed to the conditional counting iterations.
>
> In epilogue we add no artificial iteration cap, so maybe it is more realistic
> to
> simply scale up probability of all exits?
>
> To see what is going on I tried following testcase:
>
> int a[99];
> test()
> {
> for (int i = 0; i < 99; i++)
> a[i]++;
> }
>
> What surprises me is that vectorizer at -O2 does nothing and we end up
> unrolling the loop:
>
> L2:
> addl $1, (%rax)
> addl $1, 4(%rax)
> addl $1, 8(%rax)
> addq $12, %rax
> cmpq $a+396, %rax
>
> Which seems sily thing to do. Vectorized loop with epilogue doing 2 and
> 1 addition would be better.
>
> With -O3 we vectorize it:
>
>
> .L2:
> movdqa (%rax), %xmm0
> addq $16, %rax
> paddd %xmm1, %xmm0
> movaps %xmm0, -16(%rax)
> cmpq %rax, %rdx
> jne .L2
> movq a+384(%rip), %xmm0
> addl $1, a+392(%rip)
> movq .LC1(%rip), %xmm1
> paddd %xmm1, %xmm0
> movq %xmm0, a+384(%rip)
>
>
> and correctly drop vectorized loop body to 24 iterations. However the
> epilogue has loop for vector size 2 predicted to iterate once (it won't)
>
> ;; basic block 7, loop depth 0, count 10737416 (estimated locally), maybe
> hot
> ;; prev block 5, next block 8, flags: (NEW, VISITED)
> ;; pred: 3 [4.0% (adjusted)] count:10737416 (estimated locally)
> (FALSE_VALUE,EXECUTABLE)
> ;; succ: 8 [always] count:10737416 (estimated locally)
> (FALLTHRU,EXECUTABLE)
>
> ;; basic block 8, loop depth 1, count 21474835 (estimated locally), maybe
> hot
> ;; prev block 7, next block 9, flags: (NEW, REACHABLE, VISITED)
> ;; pred: 9 [always] count:10737417 (estimated locally)
> (FALLTHRU,DFS_BACK,EXECUTABLE)
> ;; 7 [always] count:10737416 (estimated locally)
> (FALLTHRU,EXECUTABLE)
> # i_9 = PHI <i_17(9), 96(7)>
> # ivtmp_13 = PHI <ivtmp_18(9), 3(7)>
> # vectp_a.14_40 = PHI <vectp_a.14_41(9), &MEM <int[99]> [(void *)&a +
> 384B](7)>
> # vectp_a.18_46 = PHI <vectp_a.18_47(9), &MEM <int[99]> [(void *)&a +
> 384B](7)>
> # ivtmp_49 = PHI <ivtmp_50(9), 0(7)>
> vect__14.16_42 = MEM <vector(2) int> [(int *)vectp_a.14_40];
> _14 = a[i_9];
> vect__15.17_44 = vect__14.16_42 + { 1, 1 };
> _15 = _14 + 1;
> MEM <vector(2) int> [(int *)vectp_a.18_46] = vect__15.17_44;
> i_17 = i_9 + 1;
> ivtmp_18 = ivtmp_13 - 1;
> vectp_a.14_41 = vectp_a.14_40 + 8;
> vectp_a.18_47 = vectp_a.18_46 + 8;
> ivtmp_50 = ivtmp_49 + 1;
> if (ivtmp_50 < 1)
> goto <bb 9>; [50.00%]
> else
> goto <bb 12>; [50.00%]
>
> and finally the scalar copy
>
> ;; basic block 12, loop depth 0, count 10737416 (estimated locally), maybe
> hot
> ;; prev block 9, next block 13, flags: (NEW, VISITED)
> ;; pred: 8 [50.0% (adjusted)] count:10737418 (estimated locally)
> (FALSE_VALUE,EXECUTABLE)
> ;; succ: 13 [always] count:10737416 (estimated locally) (FALLTHRU)
>
> ;; basic block 13, loop depth 1, count 1063004409 (estimated locally),
> maybe hot
> ;; prev block 12, next block 14, flags: (NEW, REACHABLE, VISITED)
> ;; pred: 14 [always] count:1052266996 (estimated locally)
> (FALLTHRU,DFS_BACK,EXECUTABLE)
> ;; 12 [always] count:10737416 (estimated locally) (FALLTHRU)
> # i_30 = PHI <i_36(14), 98(12)>
> # ivtmp_32 = PHI <ivtmp_37(14), 1(12)>
> _33 = a[i_30];
> _34 = _33 + 1;
> a[i_30] = _34;
> i_36 = i_30 + 1;
> ivtmp_37 = ivtmp_32 - 1;
> if (ivtmp_37 != 0)
> goto <bb 14>; [98.99%]
> else
> goto <bb 4>; [1.01%]
>
> With also small but non-zero iteration probability. This is papered
> over by my yesterday patch. But it seems to me that it would be a lot better
> if
> vectorizer understood that the epilogue will be loopless and accounted it to
> the cost model that would probably make it easy to enable it at cheap costs
> too.
>
> Clang 16 at -O2 is much more aggressive by both vectorizing and unroling:
>
> test: # @test
> .cfi_startproc
> # %bb.0:
> movdqa a(%rip), %xmm1
> pcmpeqd %xmm0, %xmm0
> psubd %xmm0, %xmm1
> movdqa %xmm1, a(%rip)
> movdqa a+16(%rip), %xmm1
> psubd %xmm0, %xmm1
> movdqa %xmm1, a+16(%rip)
> movdqa a+32(%rip), %xmm1
> psubd %xmm0, %xmm1
> movdqa %xmm1, a+32(%rip)
> movdqa a+48(%rip), %xmm1
> psubd %xmm0, %xmm1
> movdqa %xmm1, a+48(%rip)
> movdqa a+64(%rip), %xmm1
> psubd %xmm0, %xmm1
> movdqa %xmm1, a+64(%rip)
> movdqa a+80(%rip), %xmm1
> psubd %xmm0, %xmm1
> movdqa %xmm1, a+80(%rip)
> movdqa a+96(%rip), %xmm1
> psubd %xmm0, %xmm1
> movdqa %xmm1, a+96(%rip)
> movdqa a+112(%rip), %xmm1
> psubd %xmm0, %xmm1
> ....
> movdqa %xmm1, a+240(%rip)
> movdqa a+256(%rip), %xmm1
> psubd %xmm0, %xmm1
> movdqa %xmm1, a+256(%rip)
> movdqa a+272(%rip), %xmm1
> psubd %xmm0, %xmm1
> movdqa %xmm1, a+272(%rip)
> movdqa a+288(%rip), %xmm1
> psubd %xmm0, %xmm1
> movdqa %xmm1, a+288(%rip)
> movdqa a+304(%rip), %xmm1
> psubd %xmm0, %xmm1
> movdqa %xmm1, a+304(%rip)
> movdqa a+320(%rip), %xmm1
> psubd %xmm0, %xmm1
> movdqa %xmm1, a+320(%rip)
> movdqa a+336(%rip), %xmm1
> psubd %xmm0, %xmm1
> movdqa %xmm1, a+336(%rip)
> movdqa a+352(%rip), %xmm1
> psubd %xmm0, %xmm1
> movdqa %xmm1, a+352(%rip)
> movdqa a+368(%rip), %xmm1
> psubd %xmm0, %xmm1
> movdqa %xmm1, a+368(%rip)
> addl $1, a+384(%rip)
> addl $1, a+388(%rip)
> addl $1, a+392(%rip)
> retq
>
> Honza