On Mon, Apr 22, 2013 at 9:04 PM, Florian Pflug <f...@phlo.org> wrote:
> The one downside of the fnv1+shift approach is that it's built around
> the assumption that processing 64-bytes at once is the sweet spot. That
> might be true for x86 and x86_64 today, but it won't stay that way for
> long, and quite surely isn't true for other architectures. That doesn't
> necessarily rule it out, but it certainly weakens the argument that
> slipping it into 9.3 avoids having the change the algorithm later...

It's actually 128 bytes as it was tested. The ideal shape depends on
multiplication latency, multiplication throughput and amount of
registers available. Specifically BLCKSZ/mul_throughput_in_bytes needs
to be larger than BLCKSZ/(N_SUMS*sizeof(uint32))*(mul latency + 2*xor
latency). For latest Intel the values are 8192/16 = 512 and
8192/(32*4)*(5 + 2*1) = 448. 128 bytes is also 8 registers which is
the highest power of two fitting into architectural registers (16).
This means that the value chosen is indeed the sweet spot for x86
today. For future processors we can expect the multiplication width to
increase and possibly the latency too shifting the sweet spot into
higher widths. In fact, Haswell coming out later this year should have
AVX2 instructions that introduce integer ops on 256bit registers,
making the current choice already suboptimal.

All that said, having a lower width won't make the algorithm slower on
future processors, it will just leave some parallelism on the table
that could be used to make it even faster. The line in the sand needed
to be drawn somewhere, I chose the maximum comfortable width today
fearing that even that would be shot down based on code size.
Coincidentally 32 elements is also the internal parallelism that GPUs
have settled on. We could bump the width up by one notch to buy some
future safety, but after that I'm skeptical we will see any
conventional processors that would benefit from a higher width. I just
tested that the auto-vectorized version runs at basically identical
speed as GCC's inability to do good register allocation means that it
juggles values between registers and the stack one way or the other.

So to recap, I don't know of any CPUs where a lower value would be
better. Raising the width by one notch would mean better performance
on future processors, but raising it further would just bloat the size
of the inner loop without much benefit in sight.

Ants Aasma

Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de

Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Reply via email to