On Wednesday, 19 August 2015 at 10:08:48 UTC, ponce wrote:
Even in video codec, AVX2 is not that useful and barely brings a 10% improvements over SSE, while being extra careful with SSE-AVX transition penalty. And to reap this benefit you would have to write in intrinsics/assembly.

Masked AVX instructions are turned into NOPs. So you can remove conditionals from inner loops. Performance of new instructions tend to improve generation by generation.

For AVX-512 I can't even imagine what to use such large register for. Larger registers => more spilling because of calling conventions, and more fiddling around with complicated shuffle instructions. There is a steep diminishing returns with increasing registers size.

You have to plan your data layout. Which is why libraries should target it, so end users don't have to think too much about it. If your computations are trivial, then you are essentially memory I/O limited. SOA processing isn't really limited by shuffling. Stuff like mapping a pure function over a collection of arrays.

Reply via email to