On Fri, Dec 12, 2014 at 11:50 PM, Michael Paquier <michael.paqu...@gmail.com> wrote: > > > On Wed, Dec 10, 2014 at 11:25 PM, Bruce Momjian <br...@momjian.us> wrote: >> >> On Wed, Dec 10, 2014 at 07:40:46PM +0530, Rahila Syed wrote: >> > The tests ran for around 30 mins.Manual checkpoint was run before each >> > test. >> > >> > Compression WAL generated %compression Latency-avg CPU usage >> > (seconds) TPS >> > Latency >> > stddev >> > >> > >> > on 1531.4 MB ~35 % 7.351 ms >> > user diff: 562.67s system diff: 41.40s 135.96 >> > 13.759 ms >> > >> > >> > off 2373.1 MB 6.781 >> > ms >> > user diff: 354.20s system diff: 39.67s 147.40 >> > 14.152 ms >> > >> > The compression obtained is quite high close to 35 %. >> > CPU usage at user level when compression is on is quite noticeably high >> > as >> > compared to that when compression is off. But gain in terms of reduction >> > of WAL >> > is also high. >> >> I am sorry but I can't understand the above results due to wrapping. >> Are you saying compression was twice as slow? > > > I got curious to see how the compression of an entire record would perform > and how it compares for small WAL records, and here are some numbers based > on the patch attached, this patch compresses the whole record including the > block headers, letting only XLogRecord out of it with a flag indicating that > the record is compressed (note that this patch contains a portion for replay > untested, still this patch gives an idea on how much compression of the > whole record affects user CPU in this test case). It uses a buffer of 4 * > BLCKSZ, if the record is longer than that compression is simply given up. > Those tests are using the hack upthread calculating user and system CPU > using getrusage() when a backend. > > Here is the simple test case I used with 512MB of shared_buffers and small > records, filling up a bunch of buffers, dirtying them and them compressing > FPWs with a checkpoint. > #!/bin/bash > psql <<EOF > SELECT pg_backend_pid(); > CREATE TABLE aa (a int); > CREATE TABLE results (phase text, position pg_lsn); > CREATE EXTENSION IF NOT EXISTS pg_prewarm; > ALTER TABLE aa SET (FILLFACTOR = 50); > INSERT INTO results VALUES ('pre-insert', pg_current_xlog_location()); > INSERT INTO aa VALUES (generate_series(1,7000000)); -- 484MB > SELECT pg_size_pretty(pg_relation_size('aa'::regclass)); > SELECT pg_prewarm('aa'::regclass); > CHECKPOINT; > INSERT INTO results VALUES ('pre-update', pg_current_xlog_location()); > UPDATE aa SET a = 7000000 + a; > CHECKPOINT; > INSERT INTO results VALUES ('post-update', pg_current_xlog_location()); > SELECT * FROM results; > EOF Re-using this test case, I have produced more results by changing the fillfactor of the table: =# select test || ', ffactor ' || ffactor, pg_size_pretty(post_update - pre_update), user_diff, system_diff from results; ?column? | pg_size_pretty | user_diff | system_diff -------------------------------+----------------+-----------+------------- FPW on + 2 bytes, ffactor 50 | 582 MB | 42.391894 | 0.807444 FPW on + 2 bytes, ffactor 20 | 229 MB | 14.330304 | 0.729626 FPW on + 2 bytes, ffactor 10 | 117 MB | 7.335442 | 0.570996 FPW off + 2 bytes, ffactor 50 | 746 MB | 25.330391 | 1.248503 FPW off + 2 bytes, ffactor 20 | 293 MB | 10.537475 | 0.755448 FPW off + 2 bytes, ffactor 10 | 148 MB | 5.762775 | 0.763761 HEAD, ffactor 50 | 746 MB | 25.181729 | 1.133433 HEAD, ffactor 20 | 293 MB | 9.962242 | 0.765970 HEAD, ffactor 10 | 148 MB | 5.693426 | 0.775371 Record, ffactor 50 | 582 MB | 54.904374 | 0.678204 Record, ffactor 20 | 229 MB | 19.798268 | 0.807220 Record, ffactor 10 | 116 MB | 9.401877 | 0.668454 (12 rows)
The following tests are run: - "Record" means the record-level compression - "HEAD" is postgres at 1c5c70df - "FPW off" is HEAD + patch with switch set to off - "FPW on" is HEAD + patch with switch set to on The gain in compression has a linear profile with the length of page hole. There was visibly some noise in the tests: you can see that the CPU of "FPW off" is a bit higher than HEAD. Something to be aware of btw is that this patch introduces an additional 8 bytes per block image in WAL as it contains additional information to control the compression. In this case this is the uint16 compress_len present in XLogRecordBlockImageHeader. In the case of the measurements done, knowing that 63638 FPWs have been written, there is a difference of a bit less than 500k in WAL between HEAD and "FPW off" in favor of HEAD. The gain with compression is welcome, still for the default there is a small price to track down if a block is compressed or not. This patch still takes advantage of it by not compressing the hole present in page and reducing CPU work a bit. Attached are as well updated patches, switching wal_compression to USERSET and cleaning up things related to this switch from PGC_POSTMASTER. I am attaching as well the results I got, feel free to have a look. Regards, -- Michael
From 4025f5a7d23e6238e1a6ef647e4bc43e5aeaebf9 Mon Sep 17 00:00:00 2001 From: Michael Paquier <mich...@otacoo.com> Date: Tue, 25 Nov 2014 14:05:59 +0900 Subject: [PATCH 1/2] Move pg_lzcompress.c to src/common Exposing compression and decompression APIs of pglz makes possible its use by extensions and contrib modules. pglz_decompress contained a call to elog to emit an error message in case of corrupted data. This function is changed to return a status code to let its callers return an error instead. Compression function is changed similarly to make the whole set consistent. --- src/backend/access/heap/tuptoaster.c | 11 +- src/backend/utils/adt/Makefile | 4 +- src/backend/utils/adt/pg_lzcompress.c | 779 --------------------------------- src/common/Makefile | 3 +- src/common/pg_lzcompress.c | 784 ++++++++++++++++++++++++++++++++++ src/include/utils/pg_lzcompress.h | 19 +- src/tools/msvc/Mkvcbuild.pm | 3 +- 7 files changed, 813 insertions(+), 790 deletions(-) delete mode 100644 src/backend/utils/adt/pg_lzcompress.c create mode 100644 src/common/pg_lzcompress.c diff --git a/src/backend/access/heap/tuptoaster.c b/src/backend/access/heap/tuptoaster.c index d230387..8269016 100644 --- a/src/backend/access/heap/tuptoaster.c +++ b/src/backend/access/heap/tuptoaster.c @@ -142,7 +142,8 @@ heap_tuple_untoast_attr(struct varlena * attr) attr = (struct varlena *) palloc(PGLZ_RAW_SIZE(tmp) + VARHDRSZ); SET_VARSIZE(attr, PGLZ_RAW_SIZE(tmp) + VARHDRSZ); - pglz_decompress(tmp, VARDATA(attr)); + if (pglz_decompress(tmp, VARDATA(attr)) != PGLZ_OK) + elog(ERROR, "compressed data is corrupted"); pfree(tmp); } } @@ -167,7 +168,8 @@ heap_tuple_untoast_attr(struct varlena * attr) attr = (struct varlena *) palloc(PGLZ_RAW_SIZE(tmp) + VARHDRSZ); SET_VARSIZE(attr, PGLZ_RAW_SIZE(tmp) + VARHDRSZ); - pglz_decompress(tmp, VARDATA(attr)); + if (pglz_decompress(tmp, VARDATA(attr)) != PGLZ_OK) + elog(ERROR, "compressed data is corrupted"); } else if (VARATT_IS_SHORT(attr)) { @@ -239,7 +241,8 @@ heap_tuple_untoast_attr_slice(struct varlena * attr, preslice = (struct varlena *) palloc(size); SET_VARSIZE(preslice, size); - pglz_decompress(tmp, VARDATA(preslice)); + if (pglz_decompress(tmp, VARDATA(preslice)) != PGLZ_OK) + elog(ERROR, "compressed data is corrupted"); if (tmp != (PGLZ_Header *) attr) pfree(tmp); @@ -1253,7 +1256,7 @@ toast_compress_datum(Datum value) * we insist on a savings of more than 2 bytes to ensure we have a gain. */ if (pglz_compress(VARDATA_ANY(DatumGetPointer(value)), valsize, - (PGLZ_Header *) tmp, PGLZ_strategy_default) && + (PGLZ_Header *) tmp, PGLZ_strategy_default) == PGLZ_OK && VARSIZE(tmp) < valsize - 2) { /* successful compression */ diff --git a/src/backend/utils/adt/Makefile b/src/backend/utils/adt/Makefile index 3ea9bf4..20e5ff1 100644 --- a/src/backend/utils/adt/Makefile +++ b/src/backend/utils/adt/Makefile @@ -25,8 +25,8 @@ OBJS = acl.o arrayfuncs.o array_selfuncs.o array_typanalyze.o \ jsonfuncs.o like.o lockfuncs.o mac.o misc.o nabstime.o name.o \ network.o network_gist.o network_selfuncs.o \ numeric.o numutils.o oid.o oracle_compat.o \ - orderedsetaggs.o pg_lzcompress.o pg_locale.o pg_lsn.o \ - pgstatfuncs.o pseudotypes.o quote.o rangetypes.o rangetypes_gist.o \ + orderedsetaggs.o pg_locale.o pg_lsn.o pgstatfuncs.o \ + pseudotypes.o quote.o rangetypes.o rangetypes_gist.o \ rangetypes_selfuncs.o rangetypes_spgist.o rangetypes_typanalyze.o \ regexp.o regproc.o ri_triggers.o rowtypes.o ruleutils.o \ selfuncs.o tid.o timestamp.o trigfuncs.o \ diff --git a/src/backend/utils/adt/pg_lzcompress.c b/src/backend/utils/adt/pg_lzcompress.c deleted file mode 100644 index fe08890..0000000 --- a/src/backend/utils/adt/pg_lzcompress.c +++ /dev/null @@ -1,779 +0,0 @@ -/* ---------- - * pg_lzcompress.c - - * - * This is an implementation of LZ compression for PostgreSQL. - * It uses a simple history table and generates 2-3 byte tags - * capable of backward copy information for 3-273 bytes with - * a max offset of 4095. - * - * Entry routines: - * - * bool - * pglz_compress(const char *source, int32 slen, PGLZ_Header *dest, - * const PGLZ_Strategy *strategy); - * - * source is the input data to be compressed. - * - * slen is the length of the input data. - * - * dest is the output area for the compressed result. - * It must be at least as big as PGLZ_MAX_OUTPUT(slen). - * - * strategy is a pointer to some information controlling - * the compression algorithm. If NULL, the compiled - * in default strategy is used. - * - * The return value is TRUE if compression succeeded, - * FALSE if not; in the latter case the contents of dest - * are undefined. - * - * void - * pglz_decompress(const PGLZ_Header *source, char *dest) - * - * source is the compressed input. - * - * dest is the area where the uncompressed data will be - * written to. It is the callers responsibility to - * provide enough space. The required amount can be - * obtained with the macro PGLZ_RAW_SIZE(source). - * - * The data is written to buff exactly as it was handed - * to pglz_compress(). No terminating zero byte is added. - * - * The decompression algorithm and internal data format: - * - * PGLZ_Header is defined as - * - * typedef struct PGLZ_Header { - * int32 vl_len_; - * int32 rawsize; - * } - * - * The header is followed by the compressed data itself. - * - * The data representation is easiest explained by describing - * the process of decompression. - * - * If VARSIZE(x) == rawsize + sizeof(PGLZ_Header), then the data - * is stored uncompressed as plain bytes. Thus, the decompressor - * simply copies rawsize bytes from the location after the - * header to the destination. - * - * Otherwise the first byte after the header tells what to do - * the next 8 times. We call this the control byte. - * - * An unset bit in the control byte means, that one uncompressed - * byte follows, which is copied from input to output. - * - * A set bit in the control byte means, that a tag of 2-3 bytes - * follows. A tag contains information to copy some bytes, that - * are already in the output buffer, to the current location in - * the output. Let's call the three tag bytes T1, T2 and T3. The - * position of the data to copy is coded as an offset from the - * actual output position. - * - * The offset is in the upper nibble of T1 and in T2. - * The length is in the lower nibble of T1. - * - * So the 16 bits of a 2 byte tag are coded as - * - * 7---T1--0 7---T2--0 - * OOOO LLLL OOOO OOOO - * - * This limits the offset to 1-4095 (12 bits) and the length - * to 3-18 (4 bits) because 3 is always added to it. To emit - * a tag of 2 bytes with a length of 2 only saves one control - * bit. But we lose one byte in the possible length of a tag. - * - * In the actual implementation, the 2 byte tag's length is - * limited to 3-17, because the value 0xF in the length nibble - * has special meaning. It means, that the next following - * byte (T3) has to be added to the length value of 18. That - * makes total limits of 1-4095 for offset and 3-273 for length. - * - * Now that we have successfully decoded a tag. We simply copy - * the output that occurred <offset> bytes back to the current - * output location in the specified <length>. Thus, a - * sequence of 200 spaces (think about bpchar fields) could be - * coded in 4 bytes. One literal space and a three byte tag to - * copy 199 bytes with a -1 offset. Whow - that's a compression - * rate of 98%! Well, the implementation needs to save the - * original data size too, so we need another 4 bytes for it - * and end up with a total compression rate of 96%, what's still - * worth a Whow. - * - * The compression algorithm - * - * The following uses numbers used in the default strategy. - * - * The compressor works best for attributes of a size between - * 1K and 1M. For smaller items there's not that much chance of - * redundancy in the character sequence (except for large areas - * of identical bytes like trailing spaces) and for bigger ones - * our 4K maximum look-back distance is too small. - * - * The compressor creates a table for lists of positions. - * For each input position (except the last 3), a hash key is - * built from the 4 next input bytes and the position remembered - * in the appropriate list. Thus, the table points to linked - * lists of likely to be at least in the first 4 characters - * matching strings. This is done on the fly while the input - * is compressed into the output area. Table entries are only - * kept for the last 4096 input positions, since we cannot use - * back-pointers larger than that anyway. The size of the hash - * table is chosen based on the size of the input - a larger table - * has a larger startup cost, as it needs to be initialized to - * zero, but reduces the number of hash collisions on long inputs. - * - * For each byte in the input, its hash key (built from this - * byte and the next 3) is used to find the appropriate list - * in the table. The lists remember the positions of all bytes - * that had the same hash key in the past in increasing backward - * offset order. Now for all entries in the used lists, the - * match length is computed by comparing the characters from the - * entries position with the characters from the actual input - * position. - * - * The compressor starts with a so called "good_match" of 128. - * It is a "prefer speed against compression ratio" optimizer. - * So if the first entry looked at already has 128 or more - * matching characters, the lookup stops and that position is - * used for the next tag in the output. - * - * For each subsequent entry in the history list, the "good_match" - * is lowered by 10%. So the compressor will be more happy with - * short matches the farer it has to go back in the history. - * Another "speed against ratio" preference characteristic of - * the algorithm. - * - * Thus there are 3 stop conditions for the lookup of matches: - * - * - a match >= good_match is found - * - there are no more history entries to look at - * - the next history entry is already too far back - * to be coded into a tag. - * - * Finally the match algorithm checks that at least a match - * of 3 or more bytes has been found, because thats the smallest - * amount of copy information to code into a tag. If so, a tag - * is omitted and all the input bytes covered by that are just - * scanned for the history add's, otherwise a literal character - * is omitted and only his history entry added. - * - * Acknowledgements: - * - * Many thanks to Adisak Pochanayon, who's article about SLZ - * inspired me to write the PostgreSQL compression this way. - * - * Jan Wieck - * - * Copyright (c) 1999-2014, PostgreSQL Global Development Group - * - * src/backend/utils/adt/pg_lzcompress.c - * ---------- - */ -#include "postgres.h" - -#include <limits.h> - -#include "utils/pg_lzcompress.h" - - -/* ---------- - * Local definitions - * ---------- - */ -#define PGLZ_MAX_HISTORY_LISTS 8192 /* must be power of 2 */ -#define PGLZ_HISTORY_SIZE 4096 -#define PGLZ_MAX_MATCH 273 - - -/* ---------- - * PGLZ_HistEntry - - * - * Linked list for the backward history lookup - * - * All the entries sharing a hash key are linked in a doubly linked list. - * This makes it easy to remove an entry when it's time to recycle it - * (because it's more than 4K positions old). - * ---------- - */ -typedef struct PGLZ_HistEntry -{ - struct PGLZ_HistEntry *next; /* links for my hash key's list */ - struct PGLZ_HistEntry *prev; - int hindex; /* my current hash key */ - const char *pos; /* my input position */ -} PGLZ_HistEntry; - - -/* ---------- - * The provided standard strategies - * ---------- - */ -static const PGLZ_Strategy strategy_default_data = { - 32, /* Data chunks less than 32 bytes are not - * compressed */ - INT_MAX, /* No upper limit on what we'll try to - * compress */ - 25, /* Require 25% compression rate, or not worth - * it */ - 1024, /* Give up if no compression in the first 1KB */ - 128, /* Stop history lookup if a match of 128 bytes - * is found */ - 10 /* Lower good match size by 10% at every loop - * iteration */ -}; -const PGLZ_Strategy *const PGLZ_strategy_default = &strategy_default_data; - - -static const PGLZ_Strategy strategy_always_data = { - 0, /* Chunks of any size are compressed */ - INT_MAX, - 0, /* It's enough to save one single byte */ - INT_MAX, /* Never give up early */ - 128, /* Stop history lookup if a match of 128 bytes - * is found */ - 6 /* Look harder for a good match */ -}; -const PGLZ_Strategy *const PGLZ_strategy_always = &strategy_always_data; - - -/* ---------- - * Statically allocated work arrays for history - * ---------- - */ -static int16 hist_start[PGLZ_MAX_HISTORY_LISTS]; -static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE + 1]; - -/* - * Element 0 in hist_entries is unused, and means 'invalid'. Likewise, - * INVALID_ENTRY_PTR in next/prev pointers mean 'invalid'. - */ -#define INVALID_ENTRY 0 -#define INVALID_ENTRY_PTR (&hist_entries[INVALID_ENTRY]) - -/* ---------- - * pglz_hist_idx - - * - * Computes the history table slot for the lookup by the next 4 - * characters in the input. - * - * NB: because we use the next 4 characters, we are not guaranteed to - * find 3-character matches; they very possibly will be in the wrong - * hash list. This seems an acceptable tradeoff for spreading out the - * hash keys more. - * ---------- - */ -#define pglz_hist_idx(_s,_e, _mask) ( \ - ((((_e) - (_s)) < 4) ? (int) (_s)[0] : \ - (((_s)[0] << 6) ^ ((_s)[1] << 4) ^ \ - ((_s)[2] << 2) ^ (_s)[3])) & (_mask) \ - ) - - -/* ---------- - * pglz_hist_add - - * - * Adds a new entry to the history table. - * - * If _recycle is true, then we are recycling a previously used entry, - * and must first delink it from its old hashcode's linked list. - * - * NOTE: beware of multiple evaluations of macro's arguments, and note that - * _hn and _recycle are modified in the macro. - * ---------- - */ -#define pglz_hist_add(_hs,_he,_hn,_recycle,_s,_e, _mask) \ -do { \ - int __hindex = pglz_hist_idx((_s),(_e), (_mask)); \ - int16 *__myhsp = &(_hs)[__hindex]; \ - PGLZ_HistEntry *__myhe = &(_he)[_hn]; \ - if (_recycle) { \ - if (__myhe->prev == NULL) \ - (_hs)[__myhe->hindex] = __myhe->next - (_he); \ - else \ - __myhe->prev->next = __myhe->next; \ - if (__myhe->next != NULL) \ - __myhe->next->prev = __myhe->prev; \ - } \ - __myhe->next = &(_he)[*__myhsp]; \ - __myhe->prev = NULL; \ - __myhe->hindex = __hindex; \ - __myhe->pos = (_s); \ - /* If there was an existing entry in this hash slot, link */ \ - /* this new entry to it. However, the 0th entry in the */ \ - /* entries table is unused, so we can freely scribble on it. */ \ - /* So don't bother checking if the slot was used - we'll */ \ - /* scribble on the unused entry if it was not, but that's */ \ - /* harmless. Avoiding the branch in this critical path */ \ - /* speeds this up a little bit. */ \ - /* if (*__myhsp != INVALID_ENTRY) */ \ - (_he)[(*__myhsp)].prev = __myhe; \ - *__myhsp = _hn; \ - if (++(_hn) >= PGLZ_HISTORY_SIZE + 1) { \ - (_hn) = 1; \ - (_recycle) = true; \ - } \ -} while (0) - - -/* ---------- - * pglz_out_ctrl - - * - * Outputs the last and allocates a new control byte if needed. - * ---------- - */ -#define pglz_out_ctrl(__ctrlp,__ctrlb,__ctrl,__buf) \ -do { \ - if ((__ctrl & 0xff) == 0) \ - { \ - *(__ctrlp) = __ctrlb; \ - __ctrlp = (__buf)++; \ - __ctrlb = 0; \ - __ctrl = 1; \ - } \ -} while (0) - - -/* ---------- - * pglz_out_literal - - * - * Outputs a literal byte to the destination buffer including the - * appropriate control bit. - * ---------- - */ -#define pglz_out_literal(_ctrlp,_ctrlb,_ctrl,_buf,_byte) \ -do { \ - pglz_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf); \ - *(_buf)++ = (unsigned char)(_byte); \ - _ctrl <<= 1; \ -} while (0) - - -/* ---------- - * pglz_out_tag - - * - * Outputs a backward reference tag of 2-4 bytes (depending on - * offset and length) to the destination buffer including the - * appropriate control bit. - * ---------- - */ -#define pglz_out_tag(_ctrlp,_ctrlb,_ctrl,_buf,_len,_off) \ -do { \ - pglz_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf); \ - _ctrlb |= _ctrl; \ - _ctrl <<= 1; \ - if (_len > 17) \ - { \ - (_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | 0x0f); \ - (_buf)[1] = (unsigned char)(((_off) & 0xff)); \ - (_buf)[2] = (unsigned char)((_len) - 18); \ - (_buf) += 3; \ - } else { \ - (_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | ((_len) - 3)); \ - (_buf)[1] = (unsigned char)((_off) & 0xff); \ - (_buf) += 2; \ - } \ -} while (0) - - -/* ---------- - * pglz_find_match - - * - * Lookup the history table if the actual input stream matches - * another sequence of characters, starting somewhere earlier - * in the input buffer. - * ---------- - */ -static inline int -pglz_find_match(int16 *hstart, const char *input, const char *end, - int *lenp, int *offp, int good_match, int good_drop, int mask) -{ - PGLZ_HistEntry *hent; - int16 hentno; - int32 len = 0; - int32 off = 0; - - /* - * Traverse the linked history list until a good enough match is found. - */ - hentno = hstart[pglz_hist_idx(input, end, mask)]; - hent = &hist_entries[hentno]; - while (hent != INVALID_ENTRY_PTR) - { - const char *ip = input; - const char *hp = hent->pos; - int32 thisoff; - int32 thislen; - - /* - * Stop if the offset does not fit into our tag anymore. - */ - thisoff = ip - hp; - if (thisoff >= 0x0fff) - break; - - /* - * Determine length of match. A better match must be larger than the - * best so far. And if we already have a match of 16 or more bytes, - * it's worth the call overhead to use memcmp() to check if this match - * is equal for the same size. After that we must fallback to - * character by character comparison to know the exact position where - * the diff occurred. - */ - thislen = 0; - if (len >= 16) - { - if (memcmp(ip, hp, len) == 0) - { - thislen = len; - ip += len; - hp += len; - while (ip < end && *ip == *hp && thislen < PGLZ_MAX_MATCH) - { - thislen++; - ip++; - hp++; - } - } - } - else - { - while (ip < end && *ip == *hp && thislen < PGLZ_MAX_MATCH) - { - thislen++; - ip++; - hp++; - } - } - - /* - * Remember this match as the best (if it is) - */ - if (thislen > len) - { - len = thislen; - off = thisoff; - } - - /* - * Advance to the next history entry - */ - hent = hent->next; - - /* - * Be happy with lesser good matches the more entries we visited. But - * no point in doing calculation if we're at end of list. - */ - if (hent != INVALID_ENTRY_PTR) - { - if (len >= good_match) - break; - good_match -= (good_match * good_drop) / 100; - } - } - - /* - * Return match information only if it results at least in one byte - * reduction. - */ - if (len > 2) - { - *lenp = len; - *offp = off; - return 1; - } - - return 0; -} - - -/* ---------- - * pglz_compress - - * - * Compresses source into dest using strategy. - * ---------- - */ -bool -pglz_compress(const char *source, int32 slen, PGLZ_Header *dest, - const PGLZ_Strategy *strategy) -{ - unsigned char *bp = ((unsigned char *) dest) + sizeof(PGLZ_Header); - unsigned char *bstart = bp; - int hist_next = 1; - bool hist_recycle = false; - const char *dp = source; - const char *dend = source + slen; - unsigned char ctrl_dummy = 0; - unsigned char *ctrlp = &ctrl_dummy; - unsigned char ctrlb = 0; - unsigned char ctrl = 0; - bool found_match = false; - int32 match_len; - int32 match_off; - int32 good_match; - int32 good_drop; - int32 result_size; - int32 result_max; - int32 need_rate; - int hashsz; - int mask; - - /* - * Our fallback strategy is the default. - */ - if (strategy == NULL) - strategy = PGLZ_strategy_default; - - /* - * If the strategy forbids compression (at all or if source chunk size out - * of range), fail. - */ - if (strategy->match_size_good <= 0 || - slen < strategy->min_input_size || - slen > strategy->max_input_size) - return false; - - /* - * Save the original source size in the header. - */ - dest->rawsize = slen; - - /* - * Limit the match parameters to the supported range. - */ - good_match = strategy->match_size_good; - if (good_match > PGLZ_MAX_MATCH) - good_match = PGLZ_MAX_MATCH; - else if (good_match < 17) - good_match = 17; - - good_drop = strategy->match_size_drop; - if (good_drop < 0) - good_drop = 0; - else if (good_drop > 100) - good_drop = 100; - - need_rate = strategy->min_comp_rate; - if (need_rate < 0) - need_rate = 0; - else if (need_rate > 99) - need_rate = 99; - - /* - * Compute the maximum result size allowed by the strategy, namely the - * input size minus the minimum wanted compression rate. This had better - * be <= slen, else we might overrun the provided output buffer. - */ - if (slen > (INT_MAX / 100)) - { - /* Approximate to avoid overflow */ - result_max = (slen / 100) * (100 - need_rate); - } - else - result_max = (slen * (100 - need_rate)) / 100; - - /* - * Experiments suggest that these hash sizes work pretty well. A large - * hash table minimizes collision, but has a higher startup cost. For a - * small input, the startup cost dominates. The table size must be a power - * of two. - */ - if (slen < 128) - hashsz = 512; - else if (slen < 256) - hashsz = 1024; - else if (slen < 512) - hashsz = 2048; - else if (slen < 1024) - hashsz = 4096; - else - hashsz = 8192; - mask = hashsz - 1; - - /* - * Initialize the history lists to empty. We do not need to zero the - * hist_entries[] array; its entries are initialized as they are used. - */ - memset(hist_start, 0, hashsz * sizeof(int16)); - - /* - * Compress the source directly into the output buffer. - */ - while (dp < dend) - { - /* - * If we already exceeded the maximum result size, fail. - * - * We check once per loop; since the loop body could emit as many as 4 - * bytes (a control byte and 3-byte tag), PGLZ_MAX_OUTPUT() had better - * allow 4 slop bytes. - */ - if (bp - bstart >= result_max) - return false; - - /* - * If we've emitted more than first_success_by bytes without finding - * anything compressible at all, fail. This lets us fall out - * reasonably quickly when looking at incompressible input (such as - * pre-compressed data). - */ - if (!found_match && bp - bstart >= strategy->first_success_by) - return false; - - /* - * Try to find a match in the history - */ - if (pglz_find_match(hist_start, dp, dend, &match_len, - &match_off, good_match, good_drop, mask)) - { - /* - * Create the tag and add history entries for all matched - * characters. - */ - pglz_out_tag(ctrlp, ctrlb, ctrl, bp, match_len, match_off); - while (match_len--) - { - pglz_hist_add(hist_start, hist_entries, - hist_next, hist_recycle, - dp, dend, mask); - dp++; /* Do not do this ++ in the line above! */ - /* The macro would do it four times - Jan. */ - } - found_match = true; - } - else - { - /* - * No match found. Copy one literal byte. - */ - pglz_out_literal(ctrlp, ctrlb, ctrl, bp, *dp); - pglz_hist_add(hist_start, hist_entries, - hist_next, hist_recycle, - dp, dend, mask); - dp++; /* Do not do this ++ in the line above! */ - /* The macro would do it four times - Jan. */ - } - } - - /* - * Write out the last control byte and check that we haven't overrun the - * output size allowed by the strategy. - */ - *ctrlp = ctrlb; - result_size = bp - bstart; - if (result_size >= result_max) - return false; - - /* - * Success - need only fill in the actual length of the compressed datum. - */ - SET_VARSIZE_COMPRESSED(dest, result_size + sizeof(PGLZ_Header)); - - return true; -} - - -/* ---------- - * pglz_decompress - - * - * Decompresses source into dest. - * ---------- - */ -void -pglz_decompress(const PGLZ_Header *source, char *dest) -{ - const unsigned char *sp; - const unsigned char *srcend; - unsigned char *dp; - unsigned char *destend; - - sp = ((const unsigned char *) source) + sizeof(PGLZ_Header); - srcend = ((const unsigned char *) source) + VARSIZE(source); - dp = (unsigned char *) dest; - destend = dp + source->rawsize; - - while (sp < srcend && dp < destend) - { - /* - * Read one control byte and process the next 8 items (or as many as - * remain in the compressed input). - */ - unsigned char ctrl = *sp++; - int ctrlc; - - for (ctrlc = 0; ctrlc < 8 && sp < srcend; ctrlc++) - { - if (ctrl & 1) - { - /* - * Otherwise it contains the match length minus 3 and the - * upper 4 bits of the offset. The next following byte - * contains the lower 8 bits of the offset. If the length is - * coded as 18, another extension tag byte tells how much - * longer the match really was (0-255). - */ - int32 len; - int32 off; - - len = (sp[0] & 0x0f) + 3; - off = ((sp[0] & 0xf0) << 4) | sp[1]; - sp += 2; - if (len == 18) - len += *sp++; - - /* - * Check for output buffer overrun, to ensure we don't clobber - * memory in case of corrupt input. Note: we must advance dp - * here to ensure the error is detected below the loop. We - * don't simply put the elog inside the loop since that will - * probably interfere with optimization. - */ - if (dp + len > destend) - { - dp += len; - break; - } - - /* - * Now we copy the bytes specified by the tag from OUTPUT to - * OUTPUT. It is dangerous and platform dependent to use - * memcpy() here, because the copied areas could overlap - * extremely! - */ - while (len--) - { - *dp = dp[-off]; - dp++; - } - } - else - { - /* - * An unset control bit means LITERAL BYTE. So we just copy - * one from INPUT to OUTPUT. - */ - if (dp >= destend) /* check for buffer overrun */ - break; /* do not clobber memory */ - - *dp++ = *sp++; - } - - /* - * Advance the control bit - */ - ctrl >>= 1; - } - } - - /* - * Check we decompressed the right amount. - */ - if (dp != destend || sp != srcend) - elog(ERROR, "compressed data is corrupt"); - - /* - * That's it. - */ -} diff --git a/src/common/Makefile b/src/common/Makefile index 7edbaaa..bd77c1d 100644 --- a/src/common/Makefile +++ b/src/common/Makefile @@ -23,7 +23,8 @@ include $(top_builddir)/src/Makefile.global override CPPFLAGS := -DFRONTEND $(CPPFLAGS) LIBS += $(PTHREAD_LIBS) -OBJS_COMMON = exec.o pgfnames.o psprintf.o relpath.o rmtree.o username.o wait_error.o +OBJS_COMMON = exec.o pg_lzcompress.o pgfnames.o psprintf.o relpath.o \ + rmtree.o username.o wait_error.o OBJS_FRONTEND = $(OBJS_COMMON) fe_memutils.o diff --git a/src/common/pg_lzcompress.c b/src/common/pg_lzcompress.c new file mode 100644 index 0000000..6163142 --- /dev/null +++ b/src/common/pg_lzcompress.c @@ -0,0 +1,784 @@ +/* ---------- + * pg_lzcompress.c - + * + * This is an implementation of LZ compression for PostgreSQL. + * It uses a simple history table and generates 2-3 byte tags + * capable of backward copy information for 3-273 bytes with + * a max offset of 4095. + * + * Entry routines: + * + * PGLZ_Status + * pglz_compress(const char *source, int32 slen, PGLZ_Header *dest, + * const PGLZ_Strategy *strategy); + * + * source is the input data to be compressed. + * + * slen is the length of the input data. + * + * dest is the output area for the compressed result. + * It must be at least as big as PGLZ_MAX_OUTPUT(slen). + * + * strategy is a pointer to some information controlling + * the compression algorithm. If NULL, the compiled + * in default strategy is used. + * + * The return value is PGLZ_OK if compression succeeded, + * or another state if not depending on the error reached; + * in the latter case the contents of dest are undefined. + * + * PGLZ_Status + * pglz_decompress(const PGLZ_Header *source, char *dest) + * + * source is the compressed input. + * + * dest is the area where the uncompressed data will be + * written to. It is the callers responsibility to + * provide enough space. The required amount can be + * obtained with the macro PGLZ_RAW_SIZE(source). + * + * The data is written to buff exactly as it was handed + * to pglz_compress(). No terminating zero byte is added. + * + * The return value is PGLZ_OK if decompression succeeded, + * or another state if not depending on the error reached. + * + * The decompression algorithm and internal data format: + * + * PGLZ_Header is defined as + * + * typedef struct PGLZ_Header { + * int32 vl_len_; + * int32 rawsize; + * } + * + * The header is followed by the compressed data itself. + * + * The data representation is easiest explained by describing + * the process of decompression. + * + * If VARSIZE(x) == rawsize + sizeof(PGLZ_Header), then the data + * is stored uncompressed as plain bytes. Thus, the decompressor + * simply copies rawsize bytes from the location after the + * header to the destination. + * + * Otherwise the first byte after the header tells what to do + * the next 8 times. We call this the control byte. + * + * An unset bit in the control byte means, that one uncompressed + * byte follows, which is copied from input to output. + * + * A set bit in the control byte means, that a tag of 2-3 bytes + * follows. A tag contains information to copy some bytes, that + * are already in the output buffer, to the current location in + * the output. Let's call the three tag bytes T1, T2 and T3. The + * position of the data to copy is coded as an offset from the + * actual output position. + * + * The offset is in the upper nibble of T1 and in T2. + * The length is in the lower nibble of T1. + * + * So the 16 bits of a 2 byte tag are coded as + * + * 7---T1--0 7---T2--0 + * OOOO LLLL OOOO OOOO + * + * This limits the offset to 1-4095 (12 bits) and the length + * to 3-18 (4 bits) because 3 is always added to it. To emit + * a tag of 2 bytes with a length of 2 only saves one control + * bit. But we lose one byte in the possible length of a tag. + * + * In the actual implementation, the 2 byte tag's length is + * limited to 3-17, because the value 0xF in the length nibble + * has special meaning. It means, that the next following + * byte (T3) has to be added to the length value of 18. That + * makes total limits of 1-4095 for offset and 3-273 for length. + * + * Now that we have successfully decoded a tag. We simply copy + * the output that occurred <offset> bytes back to the current + * output location in the specified <length>. Thus, a + * sequence of 200 spaces (think about bpchar fields) could be + * coded in 4 bytes. One literal space and a three byte tag to + * copy 199 bytes with a -1 offset. Whow - that's a compression + * rate of 98%! Well, the implementation needs to save the + * original data size too, so we need another 4 bytes for it + * and end up with a total compression rate of 96%, what's still + * worth a Whow. + * + * The compression algorithm + * + * The following uses numbers used in the default strategy. + * + * The compressor works best for attributes of a size between + * 1K and 1M. For smaller items there's not that much chance of + * redundancy in the character sequence (except for large areas + * of identical bytes like trailing spaces) and for bigger ones + * our 4K maximum look-back distance is too small. + * + * The compressor creates a table for lists of positions. + * For each input position (except the last 3), a hash key is + * built from the 4 next input bytes and the position remembered + * in the appropriate list. Thus, the table points to linked + * lists of likely to be at least in the first 4 characters + * matching strings. This is done on the fly while the input + * is compressed into the output area. Table entries are only + * kept for the last 4096 input positions, since we cannot use + * back-pointers larger than that anyway. The size of the hash + * table is chosen based on the size of the input - a larger table + * has a larger startup cost, as it needs to be initialized to + * zero, but reduces the number of hash collisions on long inputs. + * + * For each byte in the input, its hash key (built from this + * byte and the next 3) is used to find the appropriate list + * in the table. The lists remember the positions of all bytes + * that had the same hash key in the past in increasing backward + * offset order. Now for all entries in the used lists, the + * match length is computed by comparing the characters from the + * entries position with the characters from the actual input + * position. + * + * The compressor starts with a so called "good_match" of 128. + * It is a "prefer speed against compression ratio" optimizer. + * So if the first entry looked at already has 128 or more + * matching characters, the lookup stops and that position is + * used for the next tag in the output. + * + * For each subsequent entry in the history list, the "good_match" + * is lowered by 10%. So the compressor will be more happy with + * short matches the farer it has to go back in the history. + * Another "speed against ratio" preference characteristic of + * the algorithm. + * + * Thus there are 3 stop conditions for the lookup of matches: + * + * - a match >= good_match is found + * - there are no more history entries to look at + * - the next history entry is already too far back + * to be coded into a tag. + * + * Finally the match algorithm checks that at least a match + * of 3 or more bytes has been found, because thats the smallest + * amount of copy information to code into a tag. If so, a tag + * is omitted and all the input bytes covered by that are just + * scanned for the history add's, otherwise a literal character + * is omitted and only his history entry added. + * + * Acknowledgements: + * + * Many thanks to Adisak Pochanayon, who's article about SLZ + * inspired me to write the PostgreSQL compression this way. + * + * Jan Wieck + * + * Copyright (c) 1999-2014, PostgreSQL Global Development Group + * + * src/backend/utils/adt/pg_lzcompress.c + * ---------- + */ +#include "postgres.h" + +#include <limits.h> + +#include "utils/pg_lzcompress.h" + + +/* ---------- + * Local definitions + * ---------- + */ +#define PGLZ_MAX_HISTORY_LISTS 8192 /* must be power of 2 */ +#define PGLZ_HISTORY_SIZE 4096 +#define PGLZ_MAX_MATCH 273 + + +/* ---------- + * PGLZ_HistEntry - + * + * Linked list for the backward history lookup + * + * All the entries sharing a hash key are linked in a doubly linked list. + * This makes it easy to remove an entry when it's time to recycle it + * (because it's more than 4K positions old). + * ---------- + */ +typedef struct PGLZ_HistEntry +{ + struct PGLZ_HistEntry *next; /* links for my hash key's list */ + struct PGLZ_HistEntry *prev; + int hindex; /* my current hash key */ + const char *pos; /* my input position */ +} PGLZ_HistEntry; + + +/* ---------- + * The provided standard strategies + * ---------- + */ +static const PGLZ_Strategy strategy_default_data = { + 32, /* Data chunks less than 32 bytes are not + * compressed */ + INT_MAX, /* No upper limit on what we'll try to + * compress */ + 25, /* Require 25% compression rate, or not worth + * it */ + 1024, /* Give up if no compression in the first 1KB */ + 128, /* Stop history lookup if a match of 128 bytes + * is found */ + 10 /* Lower good match size by 10% at every loop + * iteration */ +}; +const PGLZ_Strategy *const PGLZ_strategy_default = &strategy_default_data; + + +static const PGLZ_Strategy strategy_always_data = { + 0, /* Chunks of any size are compressed */ + INT_MAX, + 0, /* It's enough to save one single byte */ + INT_MAX, /* Never give up early */ + 128, /* Stop history lookup if a match of 128 bytes + * is found */ + 6 /* Look harder for a good match */ +}; +const PGLZ_Strategy *const PGLZ_strategy_always = &strategy_always_data; + + +/* ---------- + * Statically allocated work arrays for history + * ---------- + */ +static int16 hist_start[PGLZ_MAX_HISTORY_LISTS]; +static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE + 1]; + +/* + * Element 0 in hist_entries is unused, and means 'invalid'. Likewise, + * INVALID_ENTRY_PTR in next/prev pointers mean 'invalid'. + */ +#define INVALID_ENTRY 0 +#define INVALID_ENTRY_PTR (&hist_entries[INVALID_ENTRY]) + +/* ---------- + * pglz_hist_idx - + * + * Computes the history table slot for the lookup by the next 4 + * characters in the input. + * + * NB: because we use the next 4 characters, we are not guaranteed to + * find 3-character matches; they very possibly will be in the wrong + * hash list. This seems an acceptable tradeoff for spreading out the + * hash keys more. + * ---------- + */ +#define pglz_hist_idx(_s,_e, _mask) ( \ + ((((_e) - (_s)) < 4) ? (int) (_s)[0] : \ + (((_s)[0] << 6) ^ ((_s)[1] << 4) ^ \ + ((_s)[2] << 2) ^ (_s)[3])) & (_mask) \ + ) + + +/* ---------- + * pglz_hist_add - + * + * Adds a new entry to the history table. + * + * If _recycle is true, then we are recycling a previously used entry, + * and must first delink it from its old hashcode's linked list. + * + * NOTE: beware of multiple evaluations of macro's arguments, and note that + * _hn and _recycle are modified in the macro. + * ---------- + */ +#define pglz_hist_add(_hs,_he,_hn,_recycle,_s,_e, _mask) \ +do { \ + int __hindex = pglz_hist_idx((_s),(_e), (_mask)); \ + int16 *__myhsp = &(_hs)[__hindex]; \ + PGLZ_HistEntry *__myhe = &(_he)[_hn]; \ + if (_recycle) { \ + if (__myhe->prev == NULL) \ + (_hs)[__myhe->hindex] = __myhe->next - (_he); \ + else \ + __myhe->prev->next = __myhe->next; \ + if (__myhe->next != NULL) \ + __myhe->next->prev = __myhe->prev; \ + } \ + __myhe->next = &(_he)[*__myhsp]; \ + __myhe->prev = NULL; \ + __myhe->hindex = __hindex; \ + __myhe->pos = (_s); \ + /* If there was an existing entry in this hash slot, link */ \ + /* this new entry to it. However, the 0th entry in the */ \ + /* entries table is unused, so we can freely scribble on it. */ \ + /* So don't bother checking if the slot was used - we'll */ \ + /* scribble on the unused entry if it was not, but that's */ \ + /* harmless. Avoiding the branch in this critical path */ \ + /* speeds this up a little bit. */ \ + /* if (*__myhsp != INVALID_ENTRY) */ \ + (_he)[(*__myhsp)].prev = __myhe; \ + *__myhsp = _hn; \ + if (++(_hn) >= PGLZ_HISTORY_SIZE + 1) { \ + (_hn) = 1; \ + (_recycle) = true; \ + } \ +} while (0) + + +/* ---------- + * pglz_out_ctrl - + * + * Outputs the last and allocates a new control byte if needed. + * ---------- + */ +#define pglz_out_ctrl(__ctrlp,__ctrlb,__ctrl,__buf) \ +do { \ + if ((__ctrl & 0xff) == 0) \ + { \ + *(__ctrlp) = __ctrlb; \ + __ctrlp = (__buf)++; \ + __ctrlb = 0; \ + __ctrl = 1; \ + } \ +} while (0) + + +/* ---------- + * pglz_out_literal - + * + * Outputs a literal byte to the destination buffer including the + * appropriate control bit. + * ---------- + */ +#define pglz_out_literal(_ctrlp,_ctrlb,_ctrl,_buf,_byte) \ +do { \ + pglz_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf); \ + *(_buf)++ = (unsigned char)(_byte); \ + _ctrl <<= 1; \ +} while (0) + + +/* ---------- + * pglz_out_tag - + * + * Outputs a backward reference tag of 2-4 bytes (depending on + * offset and length) to the destination buffer including the + * appropriate control bit. + * ---------- + */ +#define pglz_out_tag(_ctrlp,_ctrlb,_ctrl,_buf,_len,_off) \ +do { \ + pglz_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf); \ + _ctrlb |= _ctrl; \ + _ctrl <<= 1; \ + if (_len > 17) \ + { \ + (_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | 0x0f); \ + (_buf)[1] = (unsigned char)(((_off) & 0xff)); \ + (_buf)[2] = (unsigned char)((_len) - 18); \ + (_buf) += 3; \ + } else { \ + (_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | ((_len) - 3)); \ + (_buf)[1] = (unsigned char)((_off) & 0xff); \ + (_buf) += 2; \ + } \ +} while (0) + + +/* ---------- + * pglz_find_match - + * + * Lookup the history table if the actual input stream matches + * another sequence of characters, starting somewhere earlier + * in the input buffer. + * ---------- + */ +static inline int +pglz_find_match(int16 *hstart, const char *input, const char *end, + int *lenp, int *offp, int good_match, int good_drop, int mask) +{ + PGLZ_HistEntry *hent; + int16 hentno; + int32 len = 0; + int32 off = 0; + + /* + * Traverse the linked history list until a good enough match is found. + */ + hentno = hstart[pglz_hist_idx(input, end, mask)]; + hent = &hist_entries[hentno]; + while (hent != INVALID_ENTRY_PTR) + { + const char *ip = input; + const char *hp = hent->pos; + int32 thisoff; + int32 thislen; + + /* + * Stop if the offset does not fit into our tag anymore. + */ + thisoff = ip - hp; + if (thisoff >= 0x0fff) + break; + + /* + * Determine length of match. A better match must be larger than the + * best so far. And if we already have a match of 16 or more bytes, + * it's worth the call overhead to use memcmp() to check if this match + * is equal for the same size. After that we must fallback to + * character by character comparison to know the exact position where + * the diff occurred. + */ + thislen = 0; + if (len >= 16) + { + if (memcmp(ip, hp, len) == 0) + { + thislen = len; + ip += len; + hp += len; + while (ip < end && *ip == *hp && thislen < PGLZ_MAX_MATCH) + { + thislen++; + ip++; + hp++; + } + } + } + else + { + while (ip < end && *ip == *hp && thislen < PGLZ_MAX_MATCH) + { + thislen++; + ip++; + hp++; + } + } + + /* + * Remember this match as the best (if it is) + */ + if (thislen > len) + { + len = thislen; + off = thisoff; + } + + /* + * Advance to the next history entry + */ + hent = hent->next; + + /* + * Be happy with lesser good matches the more entries we visited. But + * no point in doing calculation if we're at end of list. + */ + if (hent != INVALID_ENTRY_PTR) + { + if (len >= good_match) + break; + good_match -= (good_match * good_drop) / 100; + } + } + + /* + * Return match information only if it results at least in one byte + * reduction. + */ + if (len > 2) + { + *lenp = len; + *offp = off; + return 1; + } + + return 0; +} + + +/* ---------- + * pglz_compress - + * + * Compresses source into dest using strategy. + * ---------- + */ +PGLZ_Status +pglz_compress(const char *source, int32 slen, PGLZ_Header *dest, + const PGLZ_Strategy *strategy) +{ + unsigned char *bp = ((unsigned char *) dest) + sizeof(PGLZ_Header); + unsigned char *bstart = bp; + int hist_next = 1; + bool hist_recycle = false; + const char *dp = source; + const char *dend = source + slen; + unsigned char ctrl_dummy = 0; + unsigned char *ctrlp = &ctrl_dummy; + unsigned char ctrlb = 0; + unsigned char ctrl = 0; + bool found_match = false; + int32 match_len; + int32 match_off; + int32 good_match; + int32 good_drop; + int32 result_size; + int32 result_max; + int32 need_rate; + int hashsz; + int mask; + + /* + * Our fallback strategy is the default. + */ + if (strategy == NULL) + strategy = PGLZ_strategy_default; + + /* + * If the strategy forbids compression (at all or if source chunk size out + * of range), fail. + */ + if (strategy->match_size_good <= 0 || + slen < strategy->min_input_size || + slen > strategy->max_input_size) + return PGLZ_FORBIDDEN; + + /* + * Save the original source size in the header. + */ + dest->rawsize = slen; + + /* + * Limit the match parameters to the supported range. + */ + good_match = strategy->match_size_good; + if (good_match > PGLZ_MAX_MATCH) + good_match = PGLZ_MAX_MATCH; + else if (good_match < 17) + good_match = 17; + + good_drop = strategy->match_size_drop; + if (good_drop < 0) + good_drop = 0; + else if (good_drop > 100) + good_drop = 100; + + need_rate = strategy->min_comp_rate; + if (need_rate < 0) + need_rate = 0; + else if (need_rate > 99) + need_rate = 99; + + /* + * Compute the maximum result size allowed by the strategy, namely the + * input size minus the minimum wanted compression rate. This had better + * be <= slen, else we might overrun the provided output buffer. + */ + if (slen > (INT_MAX / 100)) + { + /* Approximate to avoid overflow */ + result_max = (slen / 100) * (100 - need_rate); + } + else + result_max = (slen * (100 - need_rate)) / 100; + + /* + * Experiments suggest that these hash sizes work pretty well. A large + * hash table minimizes collision, but has a higher startup cost. For a + * small input, the startup cost dominates. The table size must be a power + * of two. + */ + if (slen < 128) + hashsz = 512; + else if (slen < 256) + hashsz = 1024; + else if (slen < 512) + hashsz = 2048; + else if (slen < 1024) + hashsz = 4096; + else + hashsz = 8192; + mask = hashsz - 1; + + /* + * Initialize the history lists to empty. We do not need to zero the + * hist_entries[] array; its entries are initialized as they are used. + */ + memset(hist_start, 0, hashsz * sizeof(int16)); + + /* + * Compress the source directly into the output buffer. + */ + while (dp < dend) + { + /* + * If we already exceeded the maximum result size, fail. + * + * We check once per loop; since the loop body could emit as many as 4 + * bytes (a control byte and 3-byte tag), PGLZ_MAX_OUTPUT() had better + * allow 4 slop bytes. + */ + if (bp - bstart >= result_max) + return PGLZ_OVERRUN; + + /* + * If we've emitted more than first_success_by bytes without finding + * anything compressible at all, fail. This lets us fall out + * reasonably quickly when looking at incompressible input (such as + * pre-compressed data). + */ + if (!found_match && bp - bstart >= strategy->first_success_by) + return PGLZ_INCOMPRESSIBLE; + + /* + * Try to find a match in the history + */ + if (pglz_find_match(hist_start, dp, dend, &match_len, + &match_off, good_match, good_drop, mask)) + { + /* + * Create the tag and add history entries for all matched + * characters. + */ + pglz_out_tag(ctrlp, ctrlb, ctrl, bp, match_len, match_off); + while (match_len--) + { + pglz_hist_add(hist_start, hist_entries, + hist_next, hist_recycle, + dp, dend, mask); + dp++; /* Do not do this ++ in the line above! */ + /* The macro would do it four times - Jan. */ + } + found_match = true; + } + else + { + /* + * No match found. Copy one literal byte. + */ + pglz_out_literal(ctrlp, ctrlb, ctrl, bp, *dp); + pglz_hist_add(hist_start, hist_entries, + hist_next, hist_recycle, + dp, dend, mask); + dp++; /* Do not do this ++ in the line above! */ + /* The macro would do it four times - Jan. */ + } + } + + /* + * Write out the last control byte and check that we haven't overrun the + * output size allowed by the strategy. + */ + *ctrlp = ctrlb; + result_size = bp - bstart; + if (result_size >= result_max) + return PGLZ_OVERRUN; + + /* + * Success - need only fill in the actual length of the compressed datum. + */ + SET_VARSIZE_COMPRESSED(dest, result_size + sizeof(PGLZ_Header)); + + return PGLZ_OK; +} + + +/* ---------- + * pglz_decompress - + * + * Decompresses source into dest. Returns false if a failure + * occurred, true in case of success. + * ---------- + */ +PGLZ_Status +pglz_decompress(const PGLZ_Header *source, char *dest) +{ + const unsigned char *sp; + const unsigned char *srcend; + unsigned char *dp; + unsigned char *destend; + + sp = ((const unsigned char *) source) + sizeof(PGLZ_Header); + srcend = ((const unsigned char *) source) + VARSIZE(source); + dp = (unsigned char *) dest; + destend = dp + source->rawsize; + + while (sp < srcend && dp < destend) + { + /* + * Read one control byte and process the next 8 items (or as many as + * remain in the compressed input). + */ + unsigned char ctrl = *sp++; + int ctrlc; + + for (ctrlc = 0; ctrlc < 8 && sp < srcend; ctrlc++) + { + if (ctrl & 1) + { + /* + * Otherwise it contains the match length minus 3 and the + * upper 4 bits of the offset. The next following byte + * contains the lower 8 bits of the offset. If the length is + * coded as 18, another extension tag byte tells how much + * longer the match really was (0-255). + */ + int32 len; + int32 off; + + len = (sp[0] & 0x0f) + 3; + off = ((sp[0] & 0xf0) << 4) | sp[1]; + sp += 2; + if (len == 18) + len += *sp++; + + /* + * Check for output buffer overrun, to ensure we don't clobber + * memory in case of corrupt input. Note: we must advance dp + * here to ensure the error is detected below the loop. We + * don't simply put the elog inside the loop since that will + * probably interfere with optimization. + */ + if (dp + len > destend) + { + dp += len; + break; + } + + /* + * Now we copy the bytes specified by the tag from OUTPUT to + * OUTPUT. It is dangerous and platform dependent to use + * memcpy() here, because the copied areas could overlap + * extremely! + */ + while (len--) + { + *dp = dp[-off]; + dp++; + } + } + else + { + /* + * An unset control bit means LITERAL BYTE. So we just copy + * one from INPUT to OUTPUT. + */ + if (dp >= destend) /* check for buffer overrun */ + break; /* do not clobber memory */ + + *dp++ = *sp++; + } + + /* + * Advance the control bit + */ + ctrl >>= 1; + } + } + + /* + * Check we decompressed the right amount. + */ + if (dp != destend || sp != srcend) + return PGLZ_OVERRUN; + + /* + * That's it. + */ + return PGLZ_OK; +} diff --git a/src/include/utils/pg_lzcompress.h b/src/include/utils/pg_lzcompress.h index 4af24a3..529619c 100644 --- a/src/include/utils/pg_lzcompress.h +++ b/src/include/utils/pg_lzcompress.h @@ -23,6 +23,19 @@ typedef struct PGLZ_Header int32 rawsize; } PGLZ_Header; +/* ---------- + * PGLZ_Status - + * + * Return status of compression and decompression functions. + * ---------- + */ +typedef enum PGLZ_Status +{ + PGLZ_OK = 0, + PGLZ_INCOMPRESSIBLE, /* incompressible data */ + PGLZ_FORBIDDEN, /* strategy failure */ + PGLZ_OVERRUN /* exceed result size */ +} PGLZ_Status; /* ---------- * PGLZ_MAX_OUTPUT - @@ -105,8 +118,8 @@ extern const PGLZ_Strategy *const PGLZ_strategy_always; * Global function declarations * ---------- */ -extern bool pglz_compress(const char *source, int32 slen, PGLZ_Header *dest, - const PGLZ_Strategy *strategy); -extern void pglz_decompress(const PGLZ_Header *source, char *dest); +extern PGLZ_Status pglz_compress(const char *source, int32 slen, + PGLZ_Header *dest, const PGLZ_Strategy *strategy); +extern PGLZ_Status pglz_decompress(const PGLZ_Header *source, char *dest); #endif /* _PG_LZCOMPRESS_H_ */ diff --git a/src/tools/msvc/Mkvcbuild.pm b/src/tools/msvc/Mkvcbuild.pm index 004942c..6779b18 100644 --- a/src/tools/msvc/Mkvcbuild.pm +++ b/src/tools/msvc/Mkvcbuild.pm @@ -76,7 +76,8 @@ sub mkvcbuild push(@pgportfiles, 'rint.c') if ($vsVersion < '12.00'); our @pgcommonallfiles = qw( - exec.c pgfnames.c psprintf.c relpath.c rmtree.c username.c wait_error.c); + exec.c pg_lzcompress.c pgfnames.c psprintf.c relpath.c rmtree.c + username.c wait_error.c); our @pgcommonfrontendfiles = (@pgcommonallfiles, qw(fe_memutils.c)); -- 2.2.0
From 4ef8c90b3bcd79d4f9363527d022d7c04cbae737 Mon Sep 17 00:00:00 2001 From: Michael Paquier <mich...@otacoo.com> Date: Tue, 25 Nov 2014 14:24:26 +0900 Subject: [PATCH 2/2] Support compression for full-page writes in WAL Compression is controlled with a new parameter called wal_compression. This parameter can be changed at session level to control WAL compression. --- contrib/pg_xlogdump/pg_xlogdump.c | 9 +- doc/src/sgml/config.sgml | 24 +++++ src/backend/access/transam/xlog.c | 5 ++ src/backend/access/transam/xloginsert.c | 121 ++++++++++++++++++++++---- src/backend/access/transam/xlogreader.c | 39 +++++++-- src/backend/utils/misc/guc.c | 9 ++ src/backend/utils/misc/postgresql.conf.sample | 1 + src/bin/pg_controldata/pg_controldata.c | 2 + src/bin/pg_resetxlog/pg_resetxlog.c | 2 + src/include/access/xlog.h | 1 + src/include/access/xlog_internal.h | 1 + src/include/access/xlogreader.h | 4 + src/include/access/xlogrecord.h | 5 ++ src/include/catalog/pg_control.h | 1 + 14 files changed, 201 insertions(+), 23 deletions(-) diff --git a/contrib/pg_xlogdump/pg_xlogdump.c b/contrib/pg_xlogdump/pg_xlogdump.c index 9f05e25..b3db55f 100644 --- a/contrib/pg_xlogdump/pg_xlogdump.c +++ b/contrib/pg_xlogdump/pg_xlogdump.c @@ -369,7 +369,9 @@ XLogDumpCountRecord(XLogDumpConfig *config, XLogDumpStats *stats, fpi_len = 0; for (block_id = 0; block_id <= record->max_block_id; block_id++) { - if (XLogRecHasBlockImage(record, block_id)) + if (XLogRecHasCompressedBlockImage(record, block_id)) + fpi_len += record->blocks[block_id].compress_len; + else if (XLogRecHasBlockImage(record, block_id)) fpi_len += BLCKSZ - record->blocks[block_id].hole_length; } @@ -465,9 +467,10 @@ XLogDumpDisplayRecord(XLogDumpConfig *config, XLogReaderState *record) blk); if (XLogRecHasBlockImage(record, block_id)) { - printf(" (FPW); hole: offset: %u, length: %u\n", + printf(" (FPW); hole: offset: %u, length: %u, compressed: %u\n", record->blocks[block_id].hole_offset, - record->blocks[block_id].hole_length); + record->blocks[block_id].hole_length, + record->blocks[block_id].compress_len); } putchar('\n'); } diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index d607eca..4778c77 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -2254,6 +2254,30 @@ include_dir 'conf.d' </listitem> </varlistentry> + <varlistentry id="guc-wal-compression" xreflabel="wal_compression"> + <term><varname>wal_compression</varname> (<type>boolean</type>) + <indexterm> + <primary><varname>wal_compression</> configuration parameter</primary> + </indexterm> + </term> + <listitem> + <para> + When this parameter is <literal>on</>, the <productname>PostgreSQL</> + server compresses the content of full-page writes when necessary and + inserts in WAL a records with smaller sizes, reducing the amount of + WAL stored on disk. + </para> + + <para> + Compression has the advantage of reducing the amount of disk I/O when + doing WAL-logging, at the cost of some extra CPU to perform the + compression of an image. At WAL replay, compressed images still need + some more CPU cycles to perform the decompression of each block image, + but it can reduce as well replay time in I/O bounded environments. + </para> + </listitem> + </varlistentry> + <varlistentry id="guc-wal-buffers" xreflabel="wal_buffers"> <term><varname>wal_buffers</varname> (<type>integer</type>) <indexterm> diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 0f09add..50dfed0 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -88,6 +88,7 @@ char *XLogArchiveCommand = NULL; bool EnableHotStandby = false; bool fullPageWrites = true; bool wal_log_hints = false; +bool wal_compression = false; bool log_checkpoints = false; int sync_method = DEFAULT_SYNC_METHOD; int wal_level = WAL_LEVEL_MINIMAL; @@ -4610,6 +4611,7 @@ BootStrapXLOG(void) ControlFile->max_locks_per_xact = max_locks_per_xact; ControlFile->wal_level = wal_level; ControlFile->wal_log_hints = wal_log_hints; + ControlFile->wal_compression = wal_compression; ControlFile->track_commit_timestamp = track_commit_timestamp; ControlFile->data_checksum_version = bootstrap_data_checksum_version; @@ -8498,6 +8500,7 @@ XLogReportParameters(void) { if (wal_level != ControlFile->wal_level || wal_log_hints != ControlFile->wal_log_hints || + wal_compression != ControlFile->wal_compression || MaxConnections != ControlFile->MaxConnections || max_worker_processes != ControlFile->max_worker_processes || max_prepared_xacts != ControlFile->max_prepared_xacts || @@ -8522,6 +8525,7 @@ XLogReportParameters(void) xlrec.max_locks_per_xact = max_locks_per_xact; xlrec.wal_level = wal_level; xlrec.wal_log_hints = wal_log_hints; + xlrec.wal_compression = wal_compression; xlrec.track_commit_timestamp = track_commit_timestamp; XLogBeginInsert(); @@ -8537,6 +8541,7 @@ XLogReportParameters(void) ControlFile->max_locks_per_xact = max_locks_per_xact; ControlFile->wal_level = wal_level; ControlFile->wal_log_hints = wal_log_hints; + ControlFile->wal_compression = wal_compression; ControlFile->track_commit_timestamp = track_commit_timestamp; UpdateControlFile(); } diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c index f3d610f..0b65eaf 100644 --- a/src/backend/access/transam/xloginsert.c +++ b/src/backend/access/transam/xloginsert.c @@ -27,9 +27,13 @@ #include "miscadmin.h" #include "storage/bufmgr.h" #include "storage/proc.h" +#include "utils/pg_lzcompress.h" #include "utils/memutils.h" #include "pg_trace.h" +/* maximum size for compression buffer of block image */ +#define PGLZ_MAX_BLCKSZ PGLZ_MAX_OUTPUT(BLCKSZ) + /* * For each block reference registered with XLogRegisterBuffer, we fill in * a registered_buffer struct. @@ -50,6 +54,8 @@ typedef struct XLogRecData bkp_rdatas[2]; /* temporary rdatas used to hold references to * backup block data in XLogRecordAssemble() */ + char compressed_page[PGLZ_MAX_BLCKSZ]; /* recipient for compressed + * page */ } registered_buffer; static registered_buffer *registered_buffers; @@ -57,6 +63,9 @@ static int max_registered_buffers; /* allocated size */ static int max_registered_block_id = 0; /* highest block_id + 1 * currently registered */ +/* Scratch buffer used to store block image to-be-compressed */ +static char compression_scratch[PGLZ_MAX_BLCKSZ]; + /* * A chain of XLogRecDatas to hold the "main data" of a WAL record, registered * with XLogRegisterData(...). @@ -97,6 +106,9 @@ static XLogRecData *XLogRecordAssemble(RmgrId rmid, uint8 info, XLogRecPtr RedoRecPtr, bool doPageWrites, XLogRecPtr *fpw_lsn); +static bool XLogCompressBackupBlock(char *page, uint32 orig_len, + char *dest, uint16 *len); + /* * Begin constructing a WAL record. This must be called before the * XLogRegister* functions and XLogInsert(). @@ -150,6 +162,7 @@ XLogEnsureRecordSpace(int max_block_id, int ndatas) if (nbuffers > max_registered_buffers) { + int i; registered_buffers = (registered_buffer *) repalloc(registered_buffers, sizeof(registered_buffer) * nbuffers); @@ -529,6 +542,7 @@ XLogRecordAssemble(RmgrId rmid, uint8 info, if (needs_backup) { Page page = regbuf->page; + int compression_done = false; /* * The page needs to be backed up, so set up *bimg @@ -563,29 +577,76 @@ XLogRecordAssemble(RmgrId rmid, uint8 info, /* Fill in the remaining fields in the XLogRecordBlockData struct */ bkpb.fork_flags |= BKPBLOCK_HAS_IMAGE; - total_len += BLCKSZ - bimg.hole_length; - /* - * Construct XLogRecData entries for the page content. + * Construct XLogRecData entries for the page content. If page + * compression is enabled instead of creating a new entry store + * the data in dedicated buffer to prepare for the compression. + * If page has a hole skip it, allowing to achieve a two-level + * of compression. */ - rdt_datas_last->next = ®buf->bkp_rdatas[0]; - rdt_datas_last = rdt_datas_last->next; - if (bimg.hole_length == 0) + if (wal_compression) { - rdt_datas_last->data = page; - rdt_datas_last->len = BLCKSZ; + int page_len = BLCKSZ - bimg.hole_length; + uint16 compression_len; + + /* shape block image for compression and skip hole if any */ + if (bimg.hole_length == 0) + memcpy(compression_scratch, page, BLCKSZ); + else + { + /* Copy page content without hole */ + memcpy(compression_scratch, page, bimg.hole_offset); + memcpy(compression_scratch + bimg.hole_offset, + page + bimg.hole_offset + bimg.hole_length, + BLCKSZ - (bimg.hole_offset + bimg.hole_length)); + } + + /* Perform compression of block */ + if (XLogCompressBackupBlock(compression_scratch, + page_len, + regbuf->compressed_page, + &compression_len)) + { + /* compression is done, add record */ + compression_done = true; + bimg.compress_len = compression_len; + + rdt_datas_last->next = ®buf->bkp_rdatas[0]; + rdt_datas_last = rdt_datas_last->next; + rdt_datas_last->data = regbuf->compressed_page; + rdt_datas_last->len = compression_len; + total_len += compression_len; + } } - else + + /* + * If compression has not been done store normally this + * block image. + */ + if (!compression_done) { - /* must skip the hole */ - rdt_datas_last->data = page; - rdt_datas_last->len = bimg.hole_offset; + total_len += BLCKSZ - bimg.hole_length; - rdt_datas_last->next = ®buf->bkp_rdatas[1]; + rdt_datas_last->next = ®buf->bkp_rdatas[0]; rdt_datas_last = rdt_datas_last->next; + if (bimg.hole_length == 0) + { + rdt_datas_last->data = page; + rdt_datas_last->len = BLCKSZ; + } + else + { + /* must skip the hole */ + rdt_datas_last->data = page; + rdt_datas_last->len = bimg.hole_offset; - rdt_datas_last->data = page + (bimg.hole_offset + bimg.hole_length); - rdt_datas_last->len = BLCKSZ - (bimg.hole_offset + bimg.hole_length); + rdt_datas_last->next = ®buf->bkp_rdatas[1]; + rdt_datas_last = rdt_datas_last->next; + + rdt_datas_last->data = page + (bimg.hole_offset + bimg.hole_length); + rdt_datas_last->len = BLCKSZ - (bimg.hole_offset + bimg.hole_length); + } + bimg.compress_len = 0; } } @@ -681,6 +742,35 @@ XLogRecordAssemble(RmgrId rmid, uint8 info, } /* + * Create a compressed version of a backup block. If successful, return + * true and set 'len' to its length. If block cannot be compressed or if + * compression failed return false. + */ +static bool +XLogCompressBackupBlock(char *page, uint32 orig_len, char *dest, uint16 *len) +{ + /* leave if data can not be compressed */ + if (pglz_compress(page, orig_len, (PGLZ_Header *) dest, + PGLZ_strategy_default) != PGLZ_OK) + return false; + + /* + * We recheck the actual size even if pglz_compress() report success, + * because it might be satisfied with having saved as little as one byte + * in the compressed data --- which could turn into a net loss once you + * consider header and alignment padding. Worst case, the compressed + * format might require three padding bytes (plus header, which is + * included in VARSIZE(buf)), whereas the uncompressed format would take + * only one header byte and no padding if the value is short enough. So + * we insist on a savings of more than 2 bytes to ensure we have a gain. + */ + *len = VARSIZE((struct varlena *) dest); + if (*len >= orig_len - 2) + return false; + return true; +} + +/* * Determine whether the buffer referenced has to be backed up. * * Since we don't yet have the insert lock, fullPageWrites and forcePageWrites @@ -875,6 +965,7 @@ InitXLogInsert(void) if (registered_buffers == NULL) { + int i; registered_buffers = (registered_buffer *) MemoryContextAllocZero(xloginsert_cxt, sizeof(registered_buffer) * (XLR_NORMAL_MAX_BLOCK_ID + 1)); diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c index 67d6223..462266a 100644 --- a/src/backend/access/transam/xlogreader.c +++ b/src/backend/access/transam/xlogreader.c @@ -20,6 +20,7 @@ #include "access/xlog_internal.h" #include "access/xlogreader.h" #include "catalog/pg_control.h" +#include "utils/pg_lzcompress.h" static bool allocate_recordbuf(XLogReaderState *state, uint32 reclength); @@ -1034,7 +1035,13 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg) { COPY_HEADER_FIELD(&blk->hole_offset, sizeof(uint16)); COPY_HEADER_FIELD(&blk->hole_length, sizeof(uint16)); - datatotal += BLCKSZ - blk->hole_length; + COPY_HEADER_FIELD(&blk->compress_len, sizeof(uint16)); + + /* adapt depending on presence of compressed image */ + if (blk->compress_len != 0) + datatotal += blk->compress_len; + else + datatotal += BLCKSZ - blk->hole_length; } if (!(fork_flags & BKPBLOCK_SAME_REL)) { @@ -1089,7 +1096,12 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg) if (blk->has_image) { blk->bkp_image = ptr; - ptr += BLCKSZ - blk->hole_length; + + /* adapt depending on presence of compressed image */ + if (blk->compress_len != 0) + ptr += blk->compress_len; + else + ptr += BLCKSZ - blk->hole_length; } if (blk->has_data) { @@ -1195,6 +1207,8 @@ bool RestoreBlockImage(XLogReaderState *record, uint8 block_id, char *page) { DecodedBkpBlock *bkpb; + char *uncompressed_page = NULL; + char *block_image; if (!record->blocks[block_id].in_use) return false; @@ -1202,20 +1216,35 @@ RestoreBlockImage(XLogReaderState *record, uint8 block_id, char *page) return false; bkpb = &record->blocks[block_id]; + block_image = bkpb->bkp_image; + + /* decompress block if needed before processing */ + if (bkpb->compress_len != 0) + { + PGLZ_Header *header = (PGLZ_Header *) block_image; + uncompressed_page = (char *) + palloc(PGLZ_RAW_SIZE(header)); + /* XXX: should check for status code here */ + pglz_decompress(header, uncompressed_page); + block_image = uncompressed_page; + } + /* generate page, taking into account hole if necessary */ if (bkpb->hole_length == 0) { - memcpy(page, bkpb->bkp_image, BLCKSZ); + memcpy(page, block_image, BLCKSZ); } else { - memcpy(page, bkpb->bkp_image, bkpb->hole_offset); + memcpy(page, block_image, bkpb->hole_offset); /* must zero-fill the hole */ MemSet(page + bkpb->hole_offset, 0, bkpb->hole_length); memcpy(page + (bkpb->hole_offset + bkpb->hole_length), - bkpb->bkp_image + bkpb->hole_offset, + block_image + bkpb->hole_offset, BLCKSZ - (bkpb->hole_offset + bkpb->hole_length)); } + if (uncompressed_page) + pfree(uncompressed_page); return true; } diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c index b1bff7f..beb1bc2 100644 --- a/src/backend/utils/misc/guc.c +++ b/src/backend/utils/misc/guc.c @@ -927,6 +927,15 @@ static struct config_bool ConfigureNamesBool[] = false, NULL, NULL, NULL }, + { + {"wal_compression", PGC_USERSET, WAL_SETTINGS, + gettext_noop("Compresses full-page writes written in WAL file."), + NULL + }, + &wal_compression, + false, + NULL, NULL, NULL + }, { {"log_checkpoints", PGC_SIGHUP, LOGGING_WHAT, diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample index b053659..3e928f8 100644 --- a/src/backend/utils/misc/postgresql.conf.sample +++ b/src/backend/utils/misc/postgresql.conf.sample @@ -191,6 +191,7 @@ #wal_buffers = -1 # min 32kB, -1 sets based on shared_buffers # (change requires restart) #wal_writer_delay = 200ms # 1-10000 milliseconds +#wal_compression = off # enable compression of full-page writes #commit_delay = 0 # range 0-100000, in microseconds #commit_siblings = 5 # range 1-1000 diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c index a838bb5..c15f5f4 100644 --- a/src/bin/pg_controldata/pg_controldata.c +++ b/src/bin/pg_controldata/pg_controldata.c @@ -294,6 +294,8 @@ main(int argc, char *argv[]) wal_level_str(ControlFile.wal_level)); printf(_("Current wal_log_hints setting: %s\n"), ControlFile.wal_log_hints ? _("on") : _("off")); + printf(_("Current wal_compression setting: %s\n"), + ControlFile.wal_compression ? _("on") : _("off")); printf(_("Current max_connections setting: %d\n"), ControlFile.MaxConnections); printf(_("Current max_worker_processes setting: %d\n"), diff --git a/src/bin/pg_resetxlog/pg_resetxlog.c b/src/bin/pg_resetxlog/pg_resetxlog.c index f42d515..f4abe3c 100644 --- a/src/bin/pg_resetxlog/pg_resetxlog.c +++ b/src/bin/pg_resetxlog/pg_resetxlog.c @@ -579,6 +579,7 @@ GuessControlValues(void) ControlFile.wal_level = WAL_LEVEL_MINIMAL; ControlFile.wal_log_hints = false; + ControlFile.wal_compression = false; ControlFile.track_commit_timestamp = false; ControlFile.MaxConnections = 100; ControlFile.max_worker_processes = 8; @@ -795,6 +796,7 @@ RewriteControlFile(void) */ ControlFile.wal_level = WAL_LEVEL_MINIMAL; ControlFile.wal_log_hints = false; + ControlFile.wal_compression = false; ControlFile.track_commit_timestamp = false; ControlFile.MaxConnections = 100; ControlFile.max_worker_processes = 8; diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h index d06fbc0..6bdfa4a 100644 --- a/src/include/access/xlog.h +++ b/src/include/access/xlog.h @@ -98,6 +98,7 @@ extern char *XLogArchiveCommand; extern bool EnableHotStandby; extern bool fullPageWrites; extern bool wal_log_hints; +extern bool wal_compression; extern bool log_checkpoints; /* WAL levels */ diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h index 825cf54..fd058ad 100644 --- a/src/include/access/xlog_internal.h +++ b/src/include/access/xlog_internal.h @@ -186,6 +186,7 @@ typedef struct xl_parameter_change int max_locks_per_xact; int wal_level; bool wal_log_hints; + bool wal_compression; bool track_commit_timestamp; } xl_parameter_change; diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h index eb6cc89..3db312d 100644 --- a/src/include/access/xlogreader.h +++ b/src/include/access/xlogreader.h @@ -55,6 +55,7 @@ typedef struct char *bkp_image; uint16 hole_offset; uint16 hole_length; + uint16 compress_len; /* Buffer holding the rmgr-specific data associated with this block */ bool has_data; @@ -191,6 +192,9 @@ extern bool DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, ((decoder)->blocks[block_id].in_use) #define XLogRecHasBlockImage(decoder, block_id) \ ((decoder)->blocks[block_id].has_image) +#define XLogRecHasCompressedBlockImage(decoder, block_id) \ + (XLogRecHasBlockImage(decoder, block_id) && \ + (decoder)->blocks[block_id].compress_len != 0) extern bool RestoreBlockImage(XLogReaderState *recoder, uint8 block_id, char *dst); extern char *XLogRecGetBlockData(XLogReaderState *record, uint8 block_id, Size *len); diff --git a/src/include/access/xlogrecord.h b/src/include/access/xlogrecord.h index 11ddfac..cb58422 100644 --- a/src/include/access/xlogrecord.h +++ b/src/include/access/xlogrecord.h @@ -103,11 +103,16 @@ typedef struct XLogRecordBlockHeader * such a "hole" from the stored data (and it's not counted in the * XLOG record's CRC, either). Hence, the amount of block data actually * present is BLCKSZ - hole_length bytes. + * + * compress_len indicates the length of this block when compressed. A length + * of 0 means that this block is not compressed. If the block image has a hole + * the block image is compressed without the hole. */ typedef struct XLogRecordBlockImageHeader { uint16 hole_offset; /* number of bytes before "hole" */ uint16 hole_length; /* number of bytes in "hole" */ + uint16 compress_len; /* size of compressed block */ } XLogRecordBlockImageHeader; #define SizeOfXLogRecordBlockImageHeader sizeof(XLogRecordBlockImageHeader) diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h index 6e9cac9..296e5b0 100644 --- a/src/include/catalog/pg_control.h +++ b/src/include/catalog/pg_control.h @@ -175,6 +175,7 @@ typedef struct ControlFileData */ int wal_level; bool wal_log_hints; + bool wal_compression; int MaxConnections; int max_worker_processes; int max_prepared_xacts; -- 2.2.0
results.sql
Description: Binary data
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers