On Thu, Mar 21, 2024 at 12:09:44PM -0500, Nathan Bossart wrote: > On Thu, Mar 21, 2024 at 11:30:30AM +0700, John Naylor wrote: >> Further, now that the algorithm is more SIMD-appropriate, I wonder >> what doing 4 registers at a time is actually buying us for either SSE2 >> or AVX2. It might just be a matter of scale, but that would be good to >> understand. > > I'll follow up with these numbers shortly.
It looks like the 4-register code still outperforms the 2-register code, except for a handful of cases where there aren't many elements. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com