| Issue |
179534
|
| Summary |
[LoopVectorizer, SystemZ] Missed interleaving opportunities.
|
| Labels |
new issue
|
| Assignees |
|
| Reporter |
JonPsson1
|
@fhahn
Thank you for explanation at https://discourse.llvm.org/t/loop-unrolling-in-loopvectorize-vs-loopunrollpass/78480/4 (please see also a follow-up question there). With that, I understand that LV interleaving with its better cost analysis should - at least in theory - be better than relying on the LoopUnroller.
One objective here on SystemZ is to avoid small loops. Unfortunately, I find that some vectorized loops do not get interleaved even though they end up very small in the assembly output, having been above the default (LoopVectorizer) SmallLoopCost limit of 20.
Of course, increasing that limit remedies this in my reduced test case, but given that the assembly loop has just four instructions (including IV update and branching), it might be a better idea to tweak the cost computation instead.
This is the reduced test case input (compiled with an experimental setting of 8 for max interleave factor):
```
define noalias noundef nonnull ptr @Perl_pp_quotemeta(ptr writeonly captures(none) %d.1) local_unna
med_addr #0 {
entry:
br label %while.cond
while.cond.loopexit: ; preds = %while.body487
%incdec.ptr488.lcssa = phi ptr [ %incdec.ptr488, %while.body487 ]
br label %while.cond
while.cond: ; preds = %while.cond.loopexit, %entry
%s.0 = phi ptr [ null, %entry ], [ %incdec.ptr488.lcssa, %while.cond.loopexit ]
br label %while.body487
while.body487: ; preds = %while.cond, %while.body487
%ulen.13 = phi i64 [ 1, %while.cond ], [ %dec, %while.body487 ]
%d.122 = phi ptr [ %d.1, %while.cond ], [ %incdec.ptr489, %while.body487 ]
%s.11 = phi ptr [ %s.0, %while.cond ], [ %incdec.ptr488, %while.body487 ]
%dec = add i64 %ulen.13, 1
%incdec.ptr488 = getelementptr i8, ptr %s.11, i64 1
%incdec.ptr489 = getelementptr i8, ptr %d.122, i64 1
%0 = load i8, ptr %s.11, align 1
store i8 %0, ptr %d.122, align 1
%tobool486.not = icmp eq i64 %dec, 0
br i1 %tobool486.not, label %while.cond.loopexit, label %while.body487
}
Cost of 1 for VF 16: induction instruction %dec = add i64 %ulen.13, 1
Cost of 1 for VF 16: induction instruction %ulen.13 = phi i64 [ 1, %while.cond ], [ %dec, %while.body487 ]
Cost of 0 for VF 16: induction instruction %incdec.ptr489 = getelementptr i8, ptr %d.122, i64 1
Cost of 1 for VF 16: induction instruction %d.122 = phi ptr [ %d.1, %while.cond ], [ %incdec.ptr489, %while.body487 ]
Cost of 0 for VF 16: induction instruction %incdec.ptr488 = getelementptr i8, ptr %s.11, i64 1
Cost of 16 for VF 16: induction instruction %s.11 = phi ptr [ %s.0, %while.cond ], [ %incdec.ptr488, %while.body487 ]
Cost of 1 for VF 16: exit condition instruction %tobool486.not = icmp eq i64 %dec, 0
LV: Selecting VF: 16.
LoopVectorizer: Selecting VF: VF_16 Cost_23 ScalarCost_8
LV: The target has 14 registers of Generic::ScalarRC register class
LV: The target has 32 registers of Generic::VectorRC register class
LV: Loop cost is 23
LV: IC is 4
LV: VF is 16
LV: Not Interleaving.
```
In this case at least, it seems that the cost of 16 for the phi is much too big. I am not sure why it gets this cost, but I can see that %s.11 is part of the outer loop phi as well which probably influenses this. A simple observation is that this is a pointer (ptr) which only gets incremented with a constant '1' in each iteration. In a vectorized loop that increment will be done with VF instead, so there shouldn't be any extra cost here.
My question is if it would be possible to adjust this cost. Is it a missing piece to the computation, or is this a known tradeoff somehow?
Output:
```
vector.body: ; preds = %vector.body, %vector.ph
%index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
%next.gep = getelementptr i8, ptr %d.1, i64 %index
%next.gep6 = getelementptr i8, ptr %s.0, i64 %index
%wide.load = load <16 x i8>, ptr %next.gep6, align 1
store <16 x i8> %wide.load, ptr %next.gep, align 1
%index.next = add nuw i64 %index, 16
%3 = icmp eq i64 %index.next, -16
br i1 %3, label %while.body487, label %vector.body, !llvm.loop !0
bb.4.vector.body:
renamable $v0 = VL renamable $r1d, 0, renamable $r3d :: (load (s128) from %ir.next.gep6, align 1)
VST killed renamable $v0, renamable $r2d, 0, renamable $r3d :: (store (s128) into %ir.next.gep, align 1)
renamable $r3d = nuw LA killed renamable $r3d, 16, $noreg
CGIJ renamable $r3d, -16, 6, %bb.4, implicit-def dead $cc
```
@uweigand
_______________________________________________
llvm-bugs mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs