On Friday, 3 November 2023 at 15:11:31 UTC, Bogdan wrote:
Can anyone help me to understand what I am missing?
Your loop is likely dominated by sin() calls, And the rest of the loop isn't complicated enough to outperform the compiler.
What you could do is use the intrinsics to implement a _mm_sin_ps that makes 4x sines at once, then you'll see an improvement at scale.