On Wed, Apr 17, 2013 at 6:54 PM, Florian Pflug <f...@phlo.org> wrote: > On Apr17, 2013, at 16:47 , Ants Aasma <a...@cybertec.at> wrote: >> This made me remember, the issue I had was with high order bits, not >> with low order ones, somehow I got them confused. The exact issue is >> that the high order bits don't affect any bit lower than them. It's >> easy to see that if you remember the shift and add nature of multiply. >> Unfortunately XOR will not fix that. Neither will adding an offset >> basis. This is the fundamental thing that is behind the not-so-great >> uncorrelated bit error detection rate. > > Right. We could maybe fix that by extending the update step to > > tmp = s[j] ^ d[i,j] > s[j] = (t * PRIME) ^ (t >> 1) > > or something like that. Shifting t instead of (t * PRIME) should > help to reduce the performance impact, since a reordering CPU should > be able to parallelize the multiple and the shift. Note though that > I haven't really though that through extensively - the general idea > should be sound, but whether 1 is a good shifting amount I do not > know.
I was thinking about something similar too. The big issue here is that the parallel checksums already hide each other latencies effectively executing one each of movdqu/pmullw/paddw each cycle, that's why the N_SUMS adds up to 128 bytes not 16 bytes. I went ahead and coded up both the parallel FNV-1a and parallel FNV-1a + srl1-xor variants and ran performance tests and detection rate tests on both. Performance results: Mul-add checksums: 12.9 bytes/s FNV-1a checksums: 13.5 bytes/s FNV-1a + srl-1: 7.4 bytes/s Detection rates: False positive rates: Add-mul FNV-1a FNV-1a + srl-1 Single bit flip: 1:inf 1:129590 1:64795 Double bit flip: 1:148 1:511 1:53083 Triple bit flip: 1:673 1:5060 1:61511 Quad bit flip: 1:1872 1:19349 1:68320 Write 0x00 byte: 1:774538137 1:118776 1:68952 Write 0xFF byte: 1:165399500 1:137489 1:68958 Partial write: 1:59949 1:71939 1:89923 Write garbage: 1:64866 1:64980 1:67732 Write run of 00: 1:57077 1:61140 1:59723 Write run of FF: 1:63085 1:59609 1:62977 Test descriptions: N bit flip: picks N random non-overlapping bits and flips their value. Write X byte: overwrites a single byte with X. Partial write: picks a random cut point, overwrites everything from there to end with 0x00. Write garbage/run of X: picks two random cut points and fills everything in between with random values/X bytes. So adding in the shifted value nearly cuts the performance in half. I think that by playing with the instruction order I might coax the CPU scheduler to schedule the instructions better, but even in best case it will be somewhat slower. The point to keep in mind that even this slower speed is still faster than hardware accelerated CRC32, so all in all the hit might not be so bad. The effect on false positive rates for double bit errors is particularly impressive. I'm now running a testrun that shift right by 13 to see how that works out, intuitively it should help dispersing the bits a lot faster. >> I wonder if we use 32bit FNV-1a's (the h = (h^v)*p variant) with >> different offset-basis values, would it be enough to just XOR fold the >> resulting values together. The algorithm looking like this: > > Hm, this will make the algorithm less resilient to some particular > input permutations (e.g. those which swap the 64*i-th and the (64+1)-ith > words), but those seem very unlikely to occur randomly. But if we're > worried about that, we could use your linear combination method for > the aggregation phase. I don't think it significantly reduces resilience to permutations thanks to using different basis offsets and multiply not distributing over xor. >> Speaking against this option is the fact that we will need to do CPU >> detection at startup to make it fast on the x86 that support SSE4.1, >> and the fact that AMD CPUs before 2011 will run it an order of >> magnitude slower (but still faster than the best CRC). > > Hm, CPU detection isn't that hard, and given the speed at which Intel > currently invents new instructions we'll end up going that route sooner > or later anyway, I think. Sure it's not that hard but it does have an order of magnitude more design decisions than #if defined(__x86_64__). Maybe a first stab could avoid a generic infrastructure and just have the checksum function as a function pointer, with the default "trampoline" implementation running a cpuid and overwriting the function pointer with either the optimized or generic versions and then calling it. >> Any opinions if it would be a reasonable tradeoff to have a better >> checksum with great performance on latest x86 CPUs and good >> performance on other architectures at the expense of having only ok >> performance on older AMD CPUs? > > The loss on AMD is offset by the increased performance on machines > where we can't vectorize, I'd say. +1 Old AMD machines won't soon be used by anyone caring about performance, where a lousy checksum algorithm will stick around for a while. Regards, Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de -- Sent via pgsql-hackers mailing list (firstname.lastname@example.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers