On Wed, Apr 10, 2013 at 4:36 AM, Jeff Davis <pg...@j-davis.com> wrote:
> On Tue, 2013-04-09 at 05:35 +0300, Ants Aasma wrote:
>> And here you go. I decided to be verbose with the comments as it's
>> easier to delete a comment to write one. I also left in a huge jumble
>> of macros to calculate the contents of a helper var during compile
>> time. This can easily be replaced with the calculated values once we
>> settle on specific parameters.
> Great, thank you.
> Is it possible to put an interface over it that somewhat resembles the
> CRC checksum (INIT/COMP/FIN)? It looks a little challenging because of
> the nature of the algorithm, but it would make it easier to extend to
> other places (e.g. WAL). It doesn't have to match the INIT/COMP/FIN
> pattern exactly.

The algorithm has 128 bytes of state. Storing it on every step would
negate any performance gains and C doesn't have a way to keep it in
registers. If we can trust that the compiler doesn't clobber xmm
registers then it could be split up into the following pieces:
1. init
2. process 128 bytes
3. aggregate state
4. mix in block number

Even if we don't split it up, factoring out steps 1..3 would make
sense as there is no point in making step 4 platform specific and so
is just duplicated.

> Regardless, we should have some kind of fairly generic interface and
> move the code to its own file (e.g. checksum.c).
> To make the interface more generic, would it make sense to require the
> caller to save the page's stored checksum and zero it before
> calculating? That would avoid the awkwardness of avoiding the
> pd_checksum field. For example (code for illustration only):

Yes, that would help make it reusable.

> That would make it possible to use a different word size -- is uint16
> optimal or would a larger word be more efficient?

Larger words would have better mixing as multiplies mix 4 bytes at a
time instead of 2. Performance of the vectorized version will be the
same as it is tied to the vector length but unvectorized will get a
speed up. The reason I picked 16bits is not actually related to the
checksum hole but because pmullw instruction is guaranteed to be
available on all 64bit CPUs whereas pmulld is only available on the
latest CPUs.

> It looks like the block size needs to be an even multiple of
> sizeof(uint16)*NSUMS. And it also look like it's hard to combine
> different regions of memory into the same calculation (unless we want to
> just calculate them separately and XOR them or something). Does that
> mean that this is not suitable for WAL at all?

I think it would be possible to define a padding scheme for
irregularly sized memory segments where we would only need a lead-out
command for blocks that are not a multiple of 128 bytes. The
performance of it would need to be measured. All-in-all, it's not
really a great match for WAL. While all of the fast checksums process
many bytes in a single iteration, they still process an order of
magnitude bytes less and so have an easier time with irregularly
shaped blocks.

> Using SIMD for WAL is not a requirement at all; I just thought it might
> be a nice benefit for non-checksum-enabled users in some later release.

I think we should first deal with using it for page checksums and if
future versions want to reuse some of the code for WAL checksums then
we can rearrange the code.

Ants Aasma
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de

Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Reply via email to