https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122095
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Target|X86_64 |x86_64-*-*
--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Hongtao Liu from comment #3)
> There could be STLF issue if there's following 16-byte load from s1. (it's
> expensive to have a global view of how s1 is used, but since the type is
> _m128i, there's probably 16-byte load for it).
I agree that GCCs emitted code is "safer" in this regard when it's not on a
latency critical path.
It's defnitely a missed optimization for -Os and possibly when you put
this into a loop over an array of __m128i.
We might want to look into lowering _mm_insert_epi8 and friends to GIMPLE
(or implement the intrinsics in terms of the vector extension).