https://gcc.gnu.org/bugzilla/show_bug.cgi?id=120234
--- Comment #6 from Richard Biener <rguenth at gcc dot gnu.org> --- r16-869-ge3d3d6d7d2c8ab has added + /* For integer construction, the number of actual GPR -> XMM + moves will be somewhere between 0 and n. + We do not have very good idea about actual number, since + the source may be a constant, memory or a chain of + instructions that will be later converted by + scalar-to-vector pass. */ + if (kind == vec_construct + && GET_MODE_BITSIZE (mode) == 256) + cost *= 2; + else if (kind == vec_construct + && GET_MODE_BITSIZE (mode) == 512) + cost *= 3; which doesn't make much sense - we are tracking an upper bound of number of GPR to XMM moves. Multiplying by 2 or 3 for each unique scalar element source doesn't make any sense. Of course for this bug we're dealing with SSE vector sizes, not affected by this. I cannot find a mail to gcc-patches for this change. The overall change made the core cost for CTORs cheaper (from addss to sse_op), so maybe this is to compensate. But we're also using COSTS_N_INSNS (ix86_cost->integer_to_sse) / 2 instead of ix86_cost->sse_to_integer cost which duplicates costs, but we do not apply that for ix86_cost->sse_op. The integer to SSE cost is then effectively 12 here, the same cost as a load from memory, while the insertion cost is 4. I'll note it's all moot for the testcase since eventually we look for what forwprop does later but the backend refuses the vectorizer to do. I think the costing works as intended, but the actual numbers seem quite off (but the r16-869-ge3d3d6d7d2c8ab commit indicated already that they are). It is IMO quite difficult to improve the high-level things iff the low-level is off (the target costing). I'll propose partial reversion of r16-869-ge3d3d6d7d2c8ab, but that doesn't help here. Best would be to not kneecap the vectorizer with the TARGET_MMX_WITH_SSE guardrail here. Or decide we simply don't care about -m32 and XFAIL there.
