Hi Segher, on 2020/9/4 下午10:16, Segher Boessenkool wrote: > Hi! > > On Fri, Sep 04, 2020 at 04:47:37PM +0800, Kewen.Lin wrote: >>>> Apart from that, one P9 specific point is that the update form load isn't >>>> preferred, the reason is that the instruction can not retire until both >>>> parts complete, it can hold up subsequent instructions from retiring. >>>> If the addi stalls (starvation), the instruction can not retire and can >>>> cause things stuck. It seems also something we can model here? >>> >>> This is (almost) no problem on p9, since we no longer have issue groups. >>> It can hold up older insns from retiring, sure, but they *will* have >>> finished, and p9 can retire 64 insns per cycle. The "completion wall" >>> is gone. The only problem is if things stick around so long that >>> resources run out... but you're talking 100s of insns there. >> >> Theoretically it's fine, but the addi starvation was observed in the FP/SIMD >> instructions intensive loop code, which did cause some worse performance. :( > > "addi starvation" has nothing to do with addi (it also happens for other > insns), and nothing with update form memory insns either. What happens > is simply that no shorter latency insns are issued by the core so long > as longer latency insns (like most float insns) are available. So in > really nice floating point loops we execute the few integer add insns > much too late, much later than they were in the machine code, which then > makes the memory insns late as well, etc. >
Yeah, the starvation issue isn't addi specific, but in the FP/SIMD insns intensive loop, "addi/add" is the major/all portion of the shorter latency insns in most time. So I'd argue that it's related. :) Since they are mainly for IV updates, memory insns depend on it, the FP/SIMD insns depend on the memory insns, ..., it can easily cause the stall chain reaction, I guess that's why some people call it as "addi starvation". As the example Bin gave in another email, more auto-inc candidates would have more iv updates (cracked ADDIs), if one/several common index iv can be shared among the memory insns (fewer ADDIs), we can reduce the number of shorter latency insns. As I know, some compiler did implement not to perfer auto-inc candidates, it can mitigate starvation issue in those FP/SIMD intensive loops to some extent. BR, Kewen