On Wednesday, 19 August 2015 at 10:16:18 UTC, Ola Fosheim Grøstad
wrote:
On Wednesday, 19 August 2015 at 10:08:48 UTC, ponce wrote:
Even in video codec, AVX2 is not that useful and barely brings
a 10% improvements over SSE, while being extra careful with
SSE-AVX transition penalty. And to reap this benefit you would
have to write in intrinsics/assembly.
Masked AVX instructions are turned into NOPs. So you can remove
conditionals from inner loops. Performance of new instructions
tend to improve generation by generation.
Loops in video coding already have no conditional. And for the
one who have, conditionals were already removeable with existing
instructions.
For AVX-512 I can't even imagine what to use such large
register for. Larger registers => more spilling because of
calling conventions, and more fiddling around with complicated
shuffle instructions. There is a steep diminishing returns
with increasing registers size.
You have to plan your data layout. Which is why libraries
should target it, so end users don't have to think too much
about it. If your computations are trivial, then you are
essentially memory I/O limited. SOA processing isn't really
limited by shuffling. Stuff like mapping a pure function over a
collection of arrays.
I stand by what I know and measured: previously few things are
speed up by AVX-xxx. It almost always better investing this time
to optimize somewhere else.