On Wed, Apr 10, 2013 at 4:36 AM, Jeff Davis <pg...@j-davis.com> wrote: > On Tue, 2013-04-09 at 05:35 +0300, Ants Aasma wrote: >> And here you go. I decided to be verbose with the comments as it's >> easier to delete a comment to write one. I also left in a huge jumble >> of macros to calculate the contents of a helper var during compile >> time. This can easily be replaced with the calculated values once we >> settle on specific parameters. > > Great, thank you. > > Is it possible to put an interface over it that somewhat resembles the > CRC checksum (INIT/COMP/FIN)? It looks a little challenging because of > the nature of the algorithm, but it would make it easier to extend to > other places (e.g. WAL). It doesn't have to match the INIT/COMP/FIN > pattern exactly.
The algorithm has 128 bytes of state. Storing it on every step would negate any performance gains and C doesn't have a way to keep it in registers. If we can trust that the compiler doesn't clobber xmm registers then it could be split up into the following pieces: 1. init 2. process 128 bytes 3. aggregate state 4. mix in block number Even if we don't split it up, factoring out steps 1..3 would make sense as there is no point in making step 4 platform specific and so is just duplicated. > Regardless, we should have some kind of fairly generic interface and > move the code to its own file (e.g. checksum.c). > > To make the interface more generic, would it make sense to require the > caller to save the page's stored checksum and zero it before > calculating? That would avoid the awkwardness of avoiding the > pd_checksum field. For example (code for illustration only): Yes, that would help make it reusable. > That would make it possible to use a different word size -- is uint16 > optimal or would a larger word be more efficient? Larger words would have better mixing as multiplies mix 4 bytes at a time instead of 2. Performance of the vectorized version will be the same as it is tied to the vector length but unvectorized will get a speed up. The reason I picked 16bits is not actually related to the checksum hole but because pmullw instruction is guaranteed to be available on all 64bit CPUs whereas pmulld is only available on the latest CPUs. > It looks like the block size needs to be an even multiple of > sizeof(uint16)*NSUMS. And it also look like it's hard to combine > different regions of memory into the same calculation (unless we want to > just calculate them separately and XOR them or something). Does that > mean that this is not suitable for WAL at all? I think it would be possible to define a padding scheme for irregularly sized memory segments where we would only need a lead-out command for blocks that are not a multiple of 128 bytes. The performance of it would need to be measured. All-in-all, it's not really a great match for WAL. While all of the fast checksums process many bytes in a single iteration, they still process an order of magnitude bytes less and so have an easier time with irregularly shaped blocks. > Using SIMD for WAL is not a requirement at all; I just thought it might > be a nice benefit for non-checksum-enabled users in some later release. I think we should first deal with using it for page checksums and if future versions want to reuse some of the code for WAL checksums then we can rearrange the code. Regards, Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de -- Sent via pgsql-hackers mailing list (firstname.lastname@example.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers