Re: CRC algorithm (was Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes)
On Tue, Sep 16, 2014 at 6:49 AM, Heikki Linnakangas wrote: As it happens, I also wrote an implementation of Slice-by-4 the other day >>> >> If Heikki's version works I see little need to use my/Abhijit's >> patch. That version has part of it under the zlib license. If Heikki's >> version is a 'clean room', then I'd say we go with it. It looks really >> quite similar though... We can make minor changes like additional >> unrolling without problems lateron. > > > I used http://create.stephan-brumme.com/crc32/#slicing-by-8-overview as > reference - you can probably see the similarity. Any implementation is going > to look more or less the same, though; there aren't that many ways to write > the implementation. So, it seems like the status of this patch is: 1. It probably has a bug, since Amit's testing seemed to show that it wasn't returning the same results as unpatched master. 2. The performance tests showed a significant win on an important workload. 3. It's not in any CommitFest anywhere. Given point #2, it's seems like we ought to find a way to keep this from sliding into oblivion. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: CRC algorithm (was Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes)
On Tue, Sep 16, 2014 at 4:27 PM, Andres Freund wrote: > On 2014-09-16 13:49:20 +0300, Heikki Linnakangas wrote: > > I used http://create.stephan-brumme.com/crc32/#slicing-by-8-overview as > > reference - you can probably see the similarity. Any implementation is going > > to look more or less the same, though; there aren't that many ways to write > > the implementation. > > True. > > I think I see what's the problem causing Amit's test to fail. Amit, did > you use the powerpc machine? Yes. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: CRC algorithm (was Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes)
On 2014-09-16 13:49:20 +0300, Heikki Linnakangas wrote: > I used http://create.stephan-brumme.com/crc32/#slicing-by-8-overview as > reference - you can probably see the similarity. Any implementation is going > to look more or less the same, though; there aren't that many ways to write > the implementation. True. I think I see what's the problem causing Amit's test to fail. Amit, did you use the powerpc machine? Heikki, you swap bytes unconditionally - afaics that's wrong on big endian systems. My patch had: + static inline uint32 swab32(const uint32 x); + static inline uint32 swab32(const uint32 x){ + return ((x & (uint32)0x00ffUL) << 24) | + ((x & (uint32)0xff00UL) << 8) | + ((x & (uint32)0x00ffUL) >> 8) | + ((x & (uint32)0xff00UL) >> 24); + } + + #if defined __BIG_ENDIAN__ + #define cpu_to_be32(x) + #else + #define cpu_to_be32(x) swab32(x) + #endif I guess yours needs something similar. I personally like the cpu_to_be* naming - it imo makes it pretty clear what happens. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: CRC algorithm (was Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes)
On 09/16/2014 01:28 PM, Andres Freund wrote: On 2014-09-16 15:43:06 +0530, Amit Kapila wrote: On Sat, Sep 13, 2014 at 1:33 AM, Heikki Linnakangas wrote: On 09/12/2014 10:54 PM, Abhijit Menon-Sen wrote: At 2014-09-12 22:38:01 +0300, hlinnakan...@vmware.com wrote: We probably should consider switching to a faster CRC algorithm again, regardless of what we do with compression. As it happens, I'm already working on resurrecting a patch that Andres posted in 2010 to switch to zlib's faster CRC implementation. As it happens, I also wrote an implementation of Slice-by-4 the other day :-). Haven't gotten around to post it, but here it is. Incase we are using the implementation for everything that uses COMP_CRC32() macro, won't it give problem for older version databases. I have created a database with Head code and then tried to start server after applying this patch it gives below error: FATAL: incorrect checksum in control file That's indicative of a bug. This really shouldn't cause such problems - at least my version was compatible with the current definition, and IIRC Heikki's should be the same in theory. If I read it right. In general, the idea sounds quite promising. To see how it performs on small to medium size data, I have used attached test which is written be you (with some additional tests) during performance test of WAL reduction patch in 9.4. Yes, we should really do this. The patched version gives better results in all cases (in range of 10~15%), though this is not the perfect test, however it gives fair idea that the patch is quite promising. I think to test the benefit from crc calculation for full page, we can have some checkpoint during each test (may be after insert). Let me know what other kind of tests do you think are required to see the gain/loss from this patch. I actually think we don't really need this. It's pretty evident that slice-by-4 is a clear improvement. I think the main difference in this patch and what Andres has developed sometime back was code for manually unrolled loop doing 32bytes at once, so once Andres or Abhijit will post an updated version, we can do some performance tests to see if there is any additional gain. If Heikki's version works I see little need to use my/Abhijit's patch. That version has part of it under the zlib license. If Heikki's version is a 'clean room', then I'd say we go with it. It looks really quite similar though... We can make minor changes like additional unrolling without problems lateron. I used http://create.stephan-brumme.com/crc32/#slicing-by-8-overview as reference - you can probably see the similarity. Any implementation is going to look more or less the same, though; there aren't that many ways to write the implementation. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: CRC algorithm (was Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes)
On 2014-09-16 15:43:06 +0530, Amit Kapila wrote: > On Sat, Sep 13, 2014 at 1:33 AM, Heikki Linnakangas > wrote: > > On 09/12/2014 10:54 PM, Abhijit Menon-Sen wrote: > >> At 2014-09-12 22:38:01 +0300, hlinnakan...@vmware.com wrote: > >>> We probably should consider switching to a faster CRC algorithm again, > >>> regardless of what we do with compression. > >> > >> As it happens, I'm already working on resurrecting a patch that Andres > >> posted in 2010 to switch to zlib's faster CRC implementation. > > > > As it happens, I also wrote an implementation of Slice-by-4 the other day > :-). > > Haven't gotten around to post it, but here it is. > > Incase we are using the implementation for everything that uses > COMP_CRC32() macro, won't it give problem for older version > databases. I have created a database with Head code and then > tried to start server after applying this patch it gives below error: > FATAL: incorrect checksum in control file That's indicative of a bug. This really shouldn't cause such problems - at least my version was compatible with the current definition, and IIRC Heikki's should be the same in theory. If I read it right. > In general, the idea sounds quite promising. To see how it performs > on small to medium size data, I have used attached test which is > written be you (with some additional tests) during performance test > of WAL reduction patch in 9.4. Yes, we should really do this. > The patched version gives better results in all cases > (in range of 10~15%), though this is not the perfect test, however > it gives fair idea that the patch is quite promising. I think to test > the benefit from crc calculation for full page, we can have some > checkpoint during each test (may be after insert). Let me know > what other kind of tests do you think are required to see the > gain/loss from this patch. I actually think we don't really need this. It's pretty evident that slice-by-4 is a clear improvement. > I think the main difference in this patch and what Andres has > developed sometime back was code for manually unrolled loop > doing 32bytes at once, so once Andres or Abhijit will post an > updated version, we can do some performance tests to see > if there is any additional gain. If Heikki's version works I see little need to use my/Abhijit's patch. That version has part of it under the zlib license. If Heikki's version is a 'clean room', then I'd say we go with it. It looks really quite similar though... We can make minor changes like additional unrolling without problems lateron. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: CRC algorithm (was Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes)
On Sat, Sep 13, 2014 at 1:33 AM, Heikki Linnakangas wrote: > On 09/12/2014 10:54 PM, Abhijit Menon-Sen wrote: >> At 2014-09-12 22:38:01 +0300, hlinnakan...@vmware.com wrote: >>> We probably should consider switching to a faster CRC algorithm again, >>> regardless of what we do with compression. >> >> As it happens, I'm already working on resurrecting a patch that Andres >> posted in 2010 to switch to zlib's faster CRC implementation. > > As it happens, I also wrote an implementation of Slice-by-4 the other day :-). > Haven't gotten around to post it, but here it is. Incase we are using the implementation for everything that uses COMP_CRC32() macro, won't it give problem for older version databases. I have created a database with Head code and then tried to start server after applying this patch it gives below error: FATAL: incorrect checksum in control file In general, the idea sounds quite promising. To see how it performs on small to medium size data, I have used attached test which is written be you (with some additional tests) during performance test of WAL reduction patch in 9.4. Performance Data -- Non-default settings autovacuum = off checkpoint_segments = 256 checkpoint_timeout = 20 min HEAD - testname | wal_generated | duration -+---+-- two short fields, no change | 583802008 | 11.6727559566498 two short fields, no change | 580888024 | 11.8558299541473 two short fields, no change | 580889680 | 11.5449349880219 two short fields, one changed | 620646400 | 11.6657111644745 two short fields, one changed | 620667904 | 11.6010649204254 two short fields, one changed | 622079320 | 11.6774570941925 two short fields, both changed | 620649656 | 12.0892491340637 two short fields, both changed | 620648360 | 12.1650269031525 two short fields, both changed | 620653952 | 12.2125108242035 one short and one long field, no change | 329018192 | 4.74178600311279 one short and one long field, no change | 329021664 | 4.71507883071899 one short and one long field, no change | 330326496 | 4.84932994842529 ten tiny fields, all changed| 701358488 | 14.236780166626 ten tiny fields, all changed| 701355328 | 14.0777900218964 ten tiny fields, all changed| 701358272 | 14.1000919342041 hundred tiny fields, all changed| 315656568 | 6.99316620826721 hundred tiny fields, all changed| 314875488 | 6.85715913772583 hundred tiny fields, all changed| 315263768 | 6.94613790512085 hundred tiny fields, half changed | 314878360 | 6.89090895652771 hundred tiny fields, half changed | 314877216 | 7.05924606323242 hundred tiny fields, half changed | 314881816 | 6.93445992469788 hundred tiny fields, half nulled| 236244136 | 6.43347096443176 hundred tiny fields, half nulled| 236248104 | 6.30539107322693 hundred tiny fields, half nulled| 236501040 | 6.33403086662292 9 short and 1 long, short changed | 262373616 | 4.24646091461182 9 short and 1 long, short changed | 262375136 | 4.49821400642395 9 short and 1 long, short changed | 262379840 | 4.38264393806458 (27 rows) Patched - testname | wal_generated | duration -+---+-- two short fields, no change | 580897400 | 10.6518769264221 two short fields, no change | 581779816 | 10.7118690013885 two short fields, no change | 581013224 | 10.8294110298157 two short fields, one changed | 620646264 | 10.8309078216553 two short fields, one changed | 620652872 | 10.8480410575867 two short fields, one changed | 620812376 | 10.9162290096283 two short fields, both changed | 620651792 | 10.9025599956512 two short fields, both changed | 620652304 | 10.7771129608154 two short fields, both changed | 620649960 | 11.0185468196869 one short and one long field, no change | 329022000 | 3.88278198242188 one short and one long field, no change | 329023656 | 4.01899003982544 one short and one long field, no change | 329022992 | 3.91587209701538 ten tiny fields, all changed| 701353296 | 12.7748699188232 ten tiny fields, all changed| 701354848 | 12.761589050293 ten tiny fields, all changed| 701356520 | 12.6703131198883 hundred tiny fields, all changed| 314879424 | 6.25606894493103 hundred tiny fields, all changed| 314878416 | 6.32905578613281 hundred tiny fields, all changed| 314878464 | 6.28877377510071
Re: CRC algorithm (was Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes)
Em 12/09/2014 17:23, "Andres Freund" escreveu: > > On 2014-09-12 23:03:00 +0300, Heikki Linnakangas wrote: > > On 09/12/2014 10:54 PM, Abhijit Menon-Sen wrote: > > >At 2014-09-12 22:38:01 +0300, hlinnakan...@vmware.com wrote: > > >> > > >>We probably should consider switching to a faster CRC algorithm again, > > >>regardless of what we do with compression. > > > > > >As it happens, I'm already working on resurrecting a patch that Andres > > >posted in 2010 to switch to zlib's faster CRC implementation. > > > > As it happens, I also wrote an implementation of Slice-by-4 the other day > > :-). Haven't gotten around to post it, but here it is. > > > > What algorithm does zlib use for CRC calculation? > > Also slice-by-4, with a manually unrolled loop doing 32bytes at once, using > individual slice-by-4's. IIRC I tried and removing that slowed things > down overall. What it also did was move crc to a function. I'm not sure > why I did it that way, but it really might be beneficial - if you look > at profiles today there's sometimes icache/decoding stalls... > > Hm. Let me look: > http://archives.postgresql.org/message-id/201005202227.49990.andres%40anarazel.de > > Ick, there's quite some debugging leftovers ;) > > I think it might be a good idea to also switch the polynom at the same > time. I really really think we should, when the hardware supports, use > the polynom that's available in SSE4.2. It has similar properties, can > implemented in software just the same... > > Greetings, > > Andres Freund > > -- > Andres Freund http://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Training & Services > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers This Google library is worth a look https://code.google.com/p/crcutil/ as it has some extremely optimized versions.
Re: CRC algorithm (was Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes)
On 2014-09-12 23:03:00 +0300, Heikki Linnakangas wrote: > On 09/12/2014 10:54 PM, Abhijit Menon-Sen wrote: > >At 2014-09-12 22:38:01 +0300, hlinnakan...@vmware.com wrote: > >> > >>We probably should consider switching to a faster CRC algorithm again, > >>regardless of what we do with compression. > > > >As it happens, I'm already working on resurrecting a patch that Andres > >posted in 2010 to switch to zlib's faster CRC implementation. > > As it happens, I also wrote an implementation of Slice-by-4 the other day > :-). Haven't gotten around to post it, but here it is. > > What algorithm does zlib use for CRC calculation? Also slice-by-4, with a manually unrolled loop doing 32bytes at once, using individual slice-by-4's. IIRC I tried and removing that slowed things down overall. What it also did was move crc to a function. I'm not sure why I did it that way, but it really might be beneficial - if you look at profiles today there's sometimes icache/decoding stalls... Hm. Let me look: http://archives.postgresql.org/message-id/201005202227.49990.andres%40anarazel.de Ick, there's quite some debugging leftovers ;) I think it might be a good idea to also switch the polynom at the same time. I really really think we should, when the hardware supports, use the polynom that's available in SSE4.2. It has similar properties, can implemented in software just the same... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
CRC algorithm (was Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes)
On 09/12/2014 10:54 PM, Abhijit Menon-Sen wrote: At 2014-09-12 22:38:01 +0300, hlinnakan...@vmware.com wrote: We probably should consider switching to a faster CRC algorithm again, regardless of what we do with compression. As it happens, I'm already working on resurrecting a patch that Andres posted in 2010 to switch to zlib's faster CRC implementation. As it happens, I also wrote an implementation of Slice-by-4 the other day :-). Haven't gotten around to post it, but here it is. What algorithm does zlib use for CRC calculation? - Heikki diff --git a/src/include/utils/pg_crc.h b/src/include/utils/pg_crc.h index 375c405..da1af98 100644 --- a/src/include/utils/pg_crc.h +++ b/src/include/utils/pg_crc.h @@ -37,25 +37,60 @@ typedef uint32 pg_crc32; /* Finish a CRC calculation */ #define FIN_CRC32(crc) ((crc) ^= 0x) +static inline uint32 +swap(uint32 x) +{ + /* XXX: use configure check for this? */ +#if defined(__GNUC__) || defined(__clang__) + return __builtin_bswap32(x); +#else + return (x >> 24) | ((x >> 8) & 0xFF00) | ((x << 8) & 0x00FF) | (x << 24); +#endif +} + /* Accumulate some (more) bytes into a CRC */ #define COMP_CRC32(crc, data, len) \ do { \ const unsigned char *__data = (const unsigned char *) (data); \ uint32 __len = (len); \ + pg_crc32 __crc = (crc); \ + \ + /* Process byte at a time until the data address is aligned */ \ + while uintptr_t) __data) & 3) != 0 && __len > 0) \ + { \ + int __tab_index = ((int) ((__crc) >> 24) ^ *__data++) & 0xFF; \ + (__crc) = pg_crc32_table[0][__tab_index] ^ ((__crc) << 8); \ + __len--; \ + } \ + /* Process 4 bytes a time */ \ + while (__len >= 4) \ + { \ + uint32 *current = (uint32 *) __data; \ \ - while (__len-- > 0) \ + (__crc) = swap(__crc); \ + (__crc) ^= *current; \ + (__crc) = pg_crc32_table[3][ (__crc) & 0xFF] \ + ^ pg_crc32_table[2][((__crc)>>8 ) & 0xFF] \ + ^ pg_crc32_table[1][((__crc)>>16) & 0xFF] \ + ^ pg_crc32_table[0][ (__crc)>>24]; \ + __len -= 4; \ + __data += 4; \ + } \ + /* Process the last 0-3 bytes one byte at a time */ \ + while (__len > 0) \ { \ - int __tab_index = ((int) ((crc) >> 24) ^ *__data++) & 0xFF; \ - (crc) = pg_crc32_table[__tab_index] ^ ((crc) << 8); \ + int __tab_index = ((int) ((__crc) >> 24) ^ *__data++) & 0xFF; \ + (__crc) = pg_crc32_table[0][__tab_index] ^ ((__crc) << 8); \ + __len--; \ } \ + (crc) = __crc; \ } while (0) /* Check for equality of two CRCs */ #define EQ_CRC32(c1,c2) ((c1) == (c2)) /* Constant table for CRC calculation */ -extern CRCDLLIMPORT const uint32 pg_crc32_table[]; - +extern CRCDLLIMPORT const uint32 pg_crc32_table[4][256]; #ifdef PROVIDE_64BIT_CRC diff --git a/src/include/utils/pg_crc_tables.h b/src/include/utils/pg_crc_tables.h index a487ee4..fcc9a58 100644 --- a/src/include/utils/pg_crc_tables.h +++ b/src/include/utils/pg_crc_tables.h @@ -33,7 +33,9 @@ * x^32+x^26+x^23+x^22+x^16+x^12+x^11+x^10+x^8+x^7+x^5+x^4+x^2+x+1. * (This is the same polynomial used in Ethernet checksums, for instance.) */ -const uint32 pg_crc32_table[256] = { +const uint32 pg_crc32_table[4][256] = +{ +{ 0x, 0x77073096, 0xEE0E612C, 0x990951BA, 0x076DC419, 0x706AF48F, 0xE963A535, 0x9E6495A3, 0x0EDB8832, 0x79DCB8A4, 0xE0D5E91E, 0x97D2D988, @@ -98,6 +100,205 @@ const uint32 pg_crc32_table[256] = { 0xBAD03605, 0xCDD70693, 0x54DE5729, 0x23D967BF, 0xB3667A2E, 0xC4614AB8, 0x5D681B02, 0x2A6F2B94, 0xB40BBE37, 0xC30C8EA1, 0x5A05DF1B, 0x2D02EF8D +}, +{ + 0x, 0xC951729F, 0x49D3E37F, 0x808291E0, + 0xF3A08CA3, 0x3AF1FE3C, 0xBA736FDC, 0x73221D43, + 0x3C301F07, 0xF5616D98, 0x75E3FC78, 0xBCB28EE7, + 0xCF9093A4, 0x06C1E13B, 0x864370DB, 0x4F120244, + 0xD41608D9, 0x1D477A46, 0x9DC5EBA6, 0x54949939, + 0x27B6847A, 0xEEE7F6E5, 0x6E656705, 0xA734159A, + 0xE82617DE, 0x21776541, 0xA1F5F4A1, 0x68A4863E, + 0x1B869B7D, 0xD2D7E9E2, 0x52557802, 0x9B040A9D, + 0xDF2B2124, 0x167A53BB, 0x96F8C25B, 0x5FA9B0C4, + 0x2C8BAD87, 0xE5DADF18, 0x65584EF8, 0xAC093C67, + 0xE31B3E23, 0x2A4A4CBC, 0xAAC8DD5C, 0x6399AFC3, + 0x10BBB280, 0xD9EAC01F, 0x596851FF, 0x90392360, + 0x0B3D29FD, 0xC26C5B62, 0x42EECA82, 0x8BBFB81D, + 0xF89DA55E, 0x31CCD7C1, 0xB14E4621, 0x781F34BE, + 0x370D36FA, 0xFE5C4465, 0x7EDED585, 0xB78FA71A, + 0xC4ADBA59, 0x0DFCC8C6, 0x8D7E5926, 0x442F2BB9, + 0x65274409, 0xAC763696, 0x2CF4A776, 0xE5A5D5E9, + 0x9687C8AA, 0x5FD6BA35, 0xDF542BD5, 0x1605594A, + 0x59175B0E, 0x90462991, 0x10C4B871, 0xD995CAEE, + 0xAAB7D7AD, 0x63E6A532, 0xE36434D2, 0x2A35464D, + 0xB1314CD0, 0x78603E4F, 0xF8E2AFAF, 0x31B3DD30, + 0x4291C073, 0x8BC0B2EC, 0x0B42230C, 0xC2135193, + 0x8D0153D7, 0x44502148, 0xC4D2B0A8, 0x0D83C237, + 0x7EA1DF74, 0xB7F0ADEB, 0x37723C0B, 0xFE234E94, + 0xBA0C652D, 0x735D17B2, 0xF3DF8652, 0x3A8EF4CD, + 0x49ACE98E, 0x80FD9B11, 0x007F0AF1, 0xC92E786E, + 0x863C7A2A, 0x4F6D08B5, 0xCFEF9955, 0x06BEEBCA, + 0x759CF689, 0xBCCD8416, 0x3C4F15F6, 0xF51E6769, + 0x6E1A6DF4, 0xA74B1F6B, 0x27C98E8B, 0xEE98FC14, + 0x9DBAE157, 0x54EB93C8, 0xD469022