https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101908
--- Comment #43 from Hongtao.liu <crazylht at gmail dot com> --- One thing I found by experiments: Insert 64 vaddps %xmm18, %xmm19, %xmm20(no dependence between each other, just emulate for pipeline) before stalled load, stlf stall case is as fast as no stall cases on CLX. I guess this is "distance" you mean. Is there any existed structure in GCC I can get latency from entry to the load instruction? And of course for loop with unknown trip count, latency can't be exactly estimated. Similar for cases when load is in join_bb, guess we need to calculate "average" latency among all possible predecessors?