Hi Segher,

on 2020/9/4 下午10:16, Segher Boessenkool wrote:
> Hi!
> 
> On Fri, Sep 04, 2020 at 04:47:37PM +0800, Kewen.Lin wrote:
>>>> Apart from that, one P9 specific point is that the update form load isn't
>>>> preferred,  the reason is that the instruction can not retire until both
>>>> parts complete, it can hold up subsequent instructions from retiring.
>>>> If the addi stalls (starvation), the instruction can not retire and can
>>>> cause things stuck.  It seems also something we can model here?
>>>
>>> This is (almost) no problem on p9, since we no longer have issue groups.
>>> It can hold up older insns from retiring, sure, but they *will* have
>>> finished, and p9 can retire 64 insns per cycle.  The "completion wall"
>>> is gone.  The only problem is if things stick around so long that
>>> resources run out...  but you're talking 100s of insns there.
>>
>> Theoretically it's fine, but the addi starvation was observed in the FP/SIMD
>> instructions intensive loop code, which did cause some worse performance.  :(
> 
> "addi starvation" has nothing to do with addi (it also happens for other
> insns), and nothing with update form memory insns either.  What happens
> is simply that no shorter latency insns are issued by the core so long
> as longer latency insns (like most float insns) are available.  So in
> really nice floating point loops we execute the few integer add insns
> much too late, much later than they were in the machine code, which then
> makes the memory insns late as well, etc.
> 

Yeah, the starvation issue isn't addi specific, but in the FP/SIMD
insns intensive loop, "addi/add" is the major/all portion of the
shorter latency insns in most time.  So I'd argue that it's related. :) 
Since they are mainly for IV updates, memory insns depend on it,
the FP/SIMD insns depend on the memory insns, ..., it can easily
cause the stall chain reaction, I guess that's why some people call
it as "addi starvation".

As the example Bin gave in another email, more auto-inc candidates
would have more iv updates (cracked ADDIs), if one/several common
index iv can be shared among the memory insns (fewer ADDIs), we can
reduce the number of shorter latency insns.  As I know, some compiler
did implement not to perfer auto-inc candidates, it can mitigate
starvation issue in those FP/SIMD intensive loops to some extent.

BR,
Kewen

Reply via email to