On Apr17, 2013, at 16:47 , Ants Aasma <a...@cybertec.at> wrote:
> This made me remember, the issue I had was with high order bits, not
> with low order ones, somehow I got them confused. The exact issue is
> that the high order bits don't affect any bit lower than them. It's
> easy to see that if you remember the shift and add nature of multiply.
> Unfortunately XOR will not fix that. Neither will adding an offset
> basis. This is the fundamental thing that is behind the not-so-great
> uncorrelated bit error detection rate.

Right. We could maybe fix that by extending the update step to

  tmp = s[j] ^ d[i,j]
  s[j] = (t * PRIME) ^ (t >> 1)

or something like that. Shifting t instead of (t * PRIME) should
help to reduce the performance impact, since a reordering CPU should
be able to parallelize the multiple and the shift. Note though that
I haven't really though that through extensively - the general idea
should be sound, but whether 1 is a good shifting amount I do not

> While I understand that linearity is not a desirable property, I
> couldn't think of a realistic case where it would hurt. I can see how
> it can hurt checksums of variable length values, but for our fixed
> buffer case it's definitely not so clear cut. On the pro side the
> distributive property that is behind linearity allowed me to do final
> aggregation in a tree form, performing the multiplies in parallel
> instead of linearly. This adds up to the difference between 250 cycles
> (64*(3 cycle IMUL + 1 cycle XOR)) and 25 cycles (4*5 cycle pmullw + 5
> cycle addw). Given that the main loop is about 576 cycles, this is a
> significant difference.

> I wonder if we use 32bit FNV-1a's (the h = (h^v)*p variant) with
> different offset-basis values, would it be enough to just XOR fold the
> resulting values together. The algorithm looking like this:

Hm, this will make the algorithm less resilient to some particular
input permutations (e.g. those which swap the 64*i-th and the (64+1)-ith
words), but those seem very unlikely to occur randomly. But if we're
worried about that, we could use your linear combination method for
the aggregation phase.

> Speaking against this option is the fact that we will need to do CPU
> detection at startup to make it fast on the x86 that support SSE4.1,
> and the fact that AMD CPUs before 2011 will run it an order of
> magnitude slower (but still faster than the best CRC).

Hm, CPU detection isn't that hard, and given the speed at which Intel
currently invents new instructions we'll end up going that route sooner
or later anyway, I think. 

> Any opinions if it would be a reasonable tradeoff to have a better
> checksum with great performance on latest x86 CPUs and good
> performance on other architectures at the expense of having only ok
> performance on older AMD CPUs?

The loss on AMD is offset by the increased performance on machines
where we can't vectorize, I'd say.

best regards,
Florian Pflug

Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Reply via email to