https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101908

--- Comment #43 from Hongtao.liu <crazylht at gmail dot com> ---
One thing I found by experiments:
Insert 64 vaddps %xmm18, %xmm19, %xmm20(no dependence between each other, just
emulate for pipeline) before stalled load, stlf stall case is as fast as no
stall cases on CLX. I guess this is "distance" you mean.

Is there any existed structure in GCC I can get latency from entry to the load
instruction? And of course for loop with unknown trip count, latency can't be
exactly estimated. Similar for cases when load is in join_bb, guess we need to
calculate "average" latency among all possible predecessors?

Reply via email to