> While not part of this change, the unrolled loops look as though
> they just destroy the cpu cache.
> I'd like be convinced that anything does CRC over long enough buffers
> to make it a gain at all.
btrfs data checksumming is one area.
> With modern (not that modern now) superscalar cpus you can often
> get the loop instructions 'for free'.
A branch on POWER8 is a three cycle redirect. The vpmsum instructions
are 6 cycles.
> Sometimes pipelining the loop is needed to get full throughput.
> Unlike the IP checksum, you don't even have to 'loop carry' the
> cpu carry flag.
It went through quite a lot of simulation to reach peak performance.
The loop is quite delicate, we have to pace it just right to avoid
some pipeline reject conditions.
Note also that we already modulo schedule the loop across three
iterations, required to hide the latency of the vpmsum instructions.