https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93039
--- Comment #3 from Alexander Monakov <amonakov at gcc dot gnu.org> --- > The question is for which CPUs is it actually faster to use SSE? In the context of chains where the source and the destination need to be SSE registers, pretty much all CPUs? Inter-unit moves typically have some latency, e.g. recent AMD (since Zen) and Intel (Skylake) have latency 3 for sse<->gpr moves (surprisingly though four generations prior to Skylake had latency 1). Older AMDs with shared fpu had even worse latencies. At the same time SSE integer ops have comparable latencies and throughput to gpr ones, so generally moving a chain to SSE ops isn't making it slower. Plus it helps with register pressure. When either the source or the destination of a chain is bound to a general register or memory, it's ok to continue doing it on general regs.