http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56935
--- Comment #5 from rguenther at suse dot de <rguenther at suse dot de> 2013-04-16 07:48:47 UTC --- On Mon, 15 Apr 2013, ysrumyan at gmail dot com wrote: > > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56935 > > --- Comment #4 from Yuri Rumyantsev <ysrumyan at gmail dot com> 2013-04-15 > 14:54:50 UTC --- > Richard, > > both subq's are accessed the same cash line and it means that after 1st > store tthe 2nd load will stall till finish updating data cash (this is > not exact explanation but if you'd like I can find out more strong and > correct definition of memory conflict). In result non-vectorizable code > will run much slower adn we saw such slowdown on 253.perl from cpu2000. I fear this is beyond the scope of the vectorizer cost model in its current form. Clearly what it computes is correct if the cost is defined as a sum of individual stmt costs (which is how the scalar cost is computed). The vectorizer cost model now gives the target the power to look at the whole vectorized sequence and compute something better than the sum of the individual vectorized stmt costs, but currently the x86 target does not use this power. Factoring in instruction cache it's still not clear that a possible extra cache miss for ifetch is worth avoiding the "stall" due to the store forwarding issue. (btw, your explanation looks odd - there is no dependency between the two - yes, if the share the same slot as the store buffers granularity then maybe a failed store forward (due to _no_ dependency) may cause that issue?) Note that for basic-block vectorization we want to even more keep an eye on code-size, and the same cost model is used for basic-block and loop SLP. Richard.