On Wed, Apr 17, 2013 at 6:54 PM, Florian Pflug <f...@phlo.org> wrote:
> On Apr17, 2013, at 16:47 , Ants Aasma <a...@cybertec.at> wrote:
>> This made me remember, the issue I had was with high order bits, not
>> with low order ones, somehow I got them confused. The exact issue is
>> that the high order bits don't affect any bit lower than them. It's
>> easy to see that if you remember the shift and add nature of multiply.
>> Unfortunately XOR will not fix that. Neither will adding an offset
>> basis. This is the fundamental thing that is behind the not-so-great
>> uncorrelated bit error detection rate.
> Right. We could maybe fix that by extending the update step to
>   tmp = s[j] ^ d[i,j]
>   s[j] = (t * PRIME) ^ (t >> 1)
> or something like that. Shifting t instead of (t * PRIME) should
> help to reduce the performance impact, since a reordering CPU should
> be able to parallelize the multiple and the shift. Note though that
> I haven't really though that through extensively - the general idea
> should be sound, but whether 1 is a good shifting amount I do not
> know.

I was thinking about something similar too. The big issue here is that
the parallel checksums already hide each other latencies effectively
executing one each of movdqu/pmullw/paddw each cycle, that's why the
N_SUMS adds up to 128 bytes not 16 bytes.

I went ahead and coded up both the parallel FNV-1a and parallel FNV-1a
+ srl1-xor variants and ran performance tests and detection rate tests
on both.

Performance results:
Mul-add checksums: 12.9 bytes/s
FNV-1a checksums: 13.5 bytes/s
FNV-1a + srl-1: 7.4 bytes/s

Detection rates:
False positive rates:
                 Add-mul       FNV-1a     FNV-1a + srl-1
Single bit flip: 1:inf         1:129590   1:64795
Double bit flip: 1:148         1:511      1:53083
Triple bit flip: 1:673         1:5060     1:61511
  Quad bit flip: 1:1872        1:19349    1:68320
Write 0x00 byte: 1:774538137   1:118776   1:68952
Write 0xFF byte: 1:165399500   1:137489   1:68958
  Partial write: 1:59949       1:71939    1:89923
  Write garbage: 1:64866       1:64980    1:67732
Write run of 00: 1:57077       1:61140    1:59723
Write run of FF: 1:63085       1:59609    1:62977

Test descriptions:
N bit flip: picks N random non-overlapping bits and flips their value.
Write X byte: overwrites a single byte with X.
Partial write: picks a random cut point, overwrites everything from
there to end with 0x00.
Write garbage/run of X: picks two random cut points and fills
everything in between with random values/X bytes.

So adding in the shifted value nearly cuts the performance in half. I
think that by playing with the instruction order I might coax the CPU
scheduler to schedule the instructions better, but even in best case
it will be somewhat slower. The point to keep in mind that even this
slower speed is still faster than hardware accelerated CRC32, so all
in all the hit might not be so bad. The effect on false positive rates
for double bit errors is particularly impressive. I'm now running a
testrun that shift right by 13 to see how that works out, intuitively
it should help dispersing the bits a lot faster.

>> I wonder if we use 32bit FNV-1a's (the h = (h^v)*p variant) with
>> different offset-basis values, would it be enough to just XOR fold the
>> resulting values together. The algorithm looking like this:
> Hm, this will make the algorithm less resilient to some particular
> input permutations (e.g. those which swap the 64*i-th and the (64+1)-ith
> words), but those seem very unlikely to occur randomly. But if we're
> worried about that, we could use your linear combination method for
> the aggregation phase.

I don't think it significantly reduces resilience to permutations
thanks to using different basis offsets and multiply not distributing
over xor.

>> Speaking against this option is the fact that we will need to do CPU
>> detection at startup to make it fast on the x86 that support SSE4.1,
>> and the fact that AMD CPUs before 2011 will run it an order of
>> magnitude slower (but still faster than the best CRC).
> Hm, CPU detection isn't that hard, and given the speed at which Intel
> currently invents new instructions we'll end up going that route sooner
> or later anyway, I think.

Sure it's not that hard but it does have an order of magnitude more
design decisions than #if defined(__x86_64__). Maybe a first stab
could avoid a generic infrastructure and just have the checksum
function as a function pointer, with the default "trampoline"
implementation running a cpuid and overwriting the function pointer
with either the optimized or generic versions and then calling it.

>> Any opinions if it would be a reasonable tradeoff to have a better
>> checksum with great performance on latest x86 CPUs and good
>> performance on other architectures at the expense of having only ok
>> performance on older AMD CPUs?
> The loss on AMD is offset by the increased performance on machines
> where we can't vectorize, I'd say.

+1 Old AMD machines won't soon be used by anyone caring about
performance, where a lousy checksum algorithm will stick around for a

Ants Aasma
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de

Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Reply via email to