# Re: [HACKERS] Enabling Checksums

On Wed, Apr 17, 2013 at 2:26 AM, Florian Pflug <f...@phlo.org> wrote:
>>> This raises two question. First, why are there two primes? You could
>>> just as well using a single prime q and set p=q^64 mod 2^16. You then
>>> get
>>>  S = sum V[i,j] * q^(64*(64-i) + (64-j)
>>>    = sum V[i,j] * q^(4096 - 64*(i-1) - j)
>>> You get higher prime powers that way, but you can easily choose a prime
>>> that yields distinct values mod 2^16 for exponents up to 16383. Your
>>> PRIME2, for example, does. (It wraps around for 16384, i.e.
>>> PRIME2^16384 = 1 mod 2^16, but that's true for every possible prime since
>>> 16384 is the Carmichael function's value at 2^16)
>>
>> The experimental detection rate is about the same if we use a single
>> prime. But I think you have the analytical form wrong here. It should
>> be given q = p:
>>
>>    S = sum V[i,j] * p^(64-i) * p^(64-j)
>>      = sum V[i,j] * p^(64 - i + 64 - j)
>>      = sum V[i,j] * p^(128 - i -j)
>
> Yeah, if you set q = p that's true. My suggestion was p=q^64 though...

So it was, I guess it was too late here and I missed it... All thing
considered that is a good suggestion, if for nothing else, the generic
implementation can be smaller this way.

>>> Second, why does it use addition instead of XOR? It seems that FNV
>>
>> Testing showed slightly better detection rate for adds. Intuitively I
>> think it's because the carry introduces some additional mixing.
>
> Hm, but OTOH it makes S linear in V, i.e. if you have two inputs
> V1,V2 and V = V1 + V2, then S = S1 + S2. Also, if V' = V*m, then
> S' = S*m. The second property is quite undesirable, I think. Assume
> all the V[i,j] are divisible by 2^k, i.e. have zeros at all bit
> positions 0..(k-1). Then, due to linearity, S is also divisible by
> 2^k, i.e. also has no ones before the k-th bit. This means, for example
> that if you hash values values which all have their lowest bit cleared,
> you get only 2^15 distinct hash values. If they all have the two
> lowest bits cleared, you get only 2^14 distinct values, and so on…
>
> Generally, linearity doesn't seem to be a property that one wants
> in a hash I think, so my suggestion is to stick to XOR.

This made me remember, the issue I had was with high order bits, not
with low order ones, somehow I got them confused. The exact issue is
that the high order bits don't affect any bit lower than them. It's
easy to see that if you remember the shift and add nature of multiply.
Unfortunately XOR will not fix that. Neither will adding an offset
basis. This is the fundamental thing that is behind the not-so-great
uncorrelated bit error detection rate.

While I understand that linearity is not a desirable property, I
couldn't think of a realistic case where it would hurt. I can see how
it can hurt checksums of variable length values, but for our fixed
buffer case it's definitely not so clear cut. On the pro side the
distributive property that is behind linearity allowed me to do final
aggregation in a tree form, performing the multiplies in parallel
instead of linearly. This adds up to the difference between 250 cycles
(64*(3 cycle IMUL + 1 cycle XOR)) and 25 cycles (4*5 cycle pmullw + 5
cycle addw). Given that the main loop is about 576 cycles, this is a
significant difference.

>>> Here, btw, is a page on FNV hashing. It mentions a few rules for
>>> picking suitable primes
>>>
>>> http://www.isthe.com/chongo/tech/comp/fnv
>>
>> Unfortunately the rules don't apply here because of the hash size.
>
> Yeah :-(.
>
> I noticed that their 32-bit prime only has a single one outside
> the first 16 bits. Maybe we can take advantage of that and use a
> 32-bit state while still providing decent performance on machines
> without a 32-bit x 32-bit -> 32-bit multiply instruction?

Looking at the Power instruction set, a 32bit mul by the FNV prime
would look like this:

vmulouh tmp1, hash, prime
vslw tmp2, hash, 24

That is 4 instructions to multiply 4 values. Depending on the specific
execution ports on the processor it might faster or slower than the
scalar version but not by a whole lot. Main benefit would be that the
intermediate state could be held in registers.

> If we lived in an Intel-only world, I'd suggest going with a
> the last CPUs without it came out over 5 years ago, I think.
> (Core2 and later support SSE4.1, and some later Core1 do too)
>
> But unfortunately things look bleak even for other x86
> implementations - AMD support SSE4.1 only starting with
> Bulldozer, which came out 2011 or so I believe. Leaving the x86
> realm, it seems that only ARM's NEON provides the instructions
> we'd need - AltiVec seems to be support only 16-bit multiplies,
> and from what some quick googling brought up, MIPS and SPARC
> SIMD instructions look no better..
>
> OTOH, chances are that nobody will ever do SIMD implementations
> for those machines. In that case, working in 32-bit chunks instead
> of 16-bit chunks would be beneficial, since it requires half the
> number of instructions…

Great job finding the information about other instructionsets. I
checked Intel manuals and Itanium too is one of the 16bit pmul
architectures.

Working in 32-bit chunks would also help non-x86 platforms by reducing
the number of registers needed to hold state. Those architectures are
not as register starved and can hold most of the required state in
registers. This would speed them up to about the same speed as
vectorizing.

I wonder if we use 32bit FNV-1a's (the h = (h^v)*p variant) with
different offset-basis values, would it be enough to just XOR fold the
resulting values together. The algorithm looking like this:

static uint16
PageCalcChecksum16(Page page, BlockNumber blkno)
{
uint32 sums[N_SUMS];
uint32 (*pageArr)[N_SUMS] = (uint32 (*)[N_SUMS]) page;
uint32 final_sum;
int i, j;

/* initialize partial checksums to arbitrary offsets */
memcpy(sums, checksum_offsets, sizeof(checksum_offsets));

/* calculate N_SUMS parallel FNV-1a hashes over the page */
for (i = 0; BLCKSZ/sizeof(uint32)/N_SUMS; i++)
for (j = 0; j < N_SUMS; j++)
sums[j] = (sums[j] ^ pageArr[0][j]) * FNV_PRIME;

/* XOR fold hashes together */
final_sum = sums[i];
for (i = 1; i < N_SUMS; i++)
final_sum ^= sums[i];

/* mix in block number */
final_sum ^= blkno;

/* truncate to 16bits by modulo prime and offset by 1 to avoid zero */
return (final_sum % CHECKSUM_TRUNC) + 1;
}

The SSE4.1 implementation of this would be as fast as the last pat,
generic version will be faster and we avoid the linearity issue. By
using different offsets for each of the partial hashes we don't
directly suffer from commutativity of the final xor folding. By using
the xor-then-multiply variant the last values hashed have their bits
mixed before folding together.

Speaking against this option is the fact that we will need to do CPU
detection at startup to make it fast on the x86 that support SSE4.1,
and the fact that AMD CPUs before 2011 will run it an order of
magnitude slower (but still faster than the best CRC).

Any opinions if it would be a reasonable tradeoff to have a better
checksum with great performance on latest x86 CPUs and good
performance on other architectures at the expense of having only ok
performance on older AMD CPUs?

Also, any good suggestions where should we do CPU detection when we go
this route?

Regards,
Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26