Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation

Heikki Linnakangas Wed, 29 Jan 2014 02:13:22 -0800

On 01/28/2014 07:01 PM, Heikki Linnakangas wrote:

On 01/27/2014 07:03 PM, Amit Kapila wrote:

I have tried to improve algorithm in another way so that we can get
benefit of same chunks during find match (something similar to lz).
The main change is to consider chunks at fixed boundary (4 byte)
and after finding match, try to find if there is a longer match than
current chunk. While finding longer match, it still takes care that
next bigger match should be at chunk boundary. I am not
completely sure about the chunk boundary may be 8 or 16 can give
better results.


Since you're only putting a value in the history every 4 bytes, you
wouldn't need to calculate the hash in a rolling fashion. You could just
take next four bytes, calculate hash, put it in history table. Then next
four bytes, calculate hash, and so on. Might save some cycles when
building the history table...

On a closer look, you're putting a chunk in the history table only everyfour bytes, but you're *also* checking the history table for a matchonly every four bytes. That completely destroys the shift-resistence ofthe algorithm. For example, if the new tuple is an exact copy of the oldtuple, except for one additional byte in the beginning, the algorithmwould fail to recognize that. It would be good to add a test case likethat in the test suite.

You can skip bytes when building the history table, or when findingmatches, but not both. Or you could skip N bytes, and then check formatches for the next four bytes, then skip again and so forth, as longas you always check four consecutive bytes (because the hash function iscalculated from four bytes).

I couldn't resist the challenge, and started hacking this. First, someobservations from your latest patch (pgrb_delta_encoding_v5.patch):

1. There are a lot of comments and code that refers to "chunks", whichseem obsolete. For example, ck_size field in PGRB_HistEntry is alwaysset to a constant, 4, except maybe at the very end of the historystring. The algorithm has nothing to do with Rabin-Karp anymore.

2. The 'hindex' field in PGRB_HistEntry is unused. Also, ck_start_pos isredundant with the index of the entry in the array, because the array isfilled in order. That only leaves us just the 'next' field, and that canbe represented as a int16 rather than a pointer. So, we really only needa simple int16 array as the history entries array.

3. You're not gaining much by calculating the hash in a rolling fashion.A single rolling step requires two multiplications and two sums, plusshifting the variables around. Calculating the hash from scratchrequires three multiplications and three sums.

4. Given that we're not doing the Rabin-Karp variable-length chunkthing, we could use a cheaper hash function to begin with. Like, the oneused in pglz. The multiply-by-prime method probably gives fewercollisions than pglz's shift/xor method, but I don't think that mattersas much as computation speed. No-one has mentioned or measured theeffect of collisions in this thread; that either means that it's anon-issue or that no-one's just realized how big a problem it is yet.I'm guessing that it's not a problem, and if it is, it's mitigated byonly trying to find matches every N bytes; collisions would make findingmatches slower, and that's exactly what skipping helps with.

After addressing the above, we're pretty much back to PGLZ approach. Ikept the change to only find matches every four bytes, that does makesome difference. And I like having this new encoding code in a separatefile, not mingled with pglz stuff, it's sufficiently different thatthat's better. I haven't done all much testing with this, so take itwith a grain of salt.

I don't know if this is better or worse than the other patches that havebeen floated around, but I though I might as well share it..


- Heikki

diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml
index e0b8a4e..c4ac2bd 100644
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -1014,6 +1014,22 @@ CREATE [ [ GLOBAL | LOCAL ] { TEMPORARY | TEMP } | UNLOGGED ] TABLE [ IF NOT EXI
     </listitem>
    </varlistentry>
 
+   <varlistentry>
+    <term><literal>wal_compress_update</> (<type>boolean</>)</term>
+    <listitem>
+     <para>
+      Enables or disables the WAL tuple compression for <command>UPDATE</>
+      on this table.  Default value of this option is false to maintain
+      backward compatability for the command. If true, all the update
+      operations on this table which will place the new tuple on same page
+      as it's original tuple will compress the WAL for new tuple and
+      subsequently reduce the WAL volume.  It is recommended to enable
+      this option for tables where <command>UPDATE</> changes less than
+      50 percent of tuple data.
+     </para>
+     </listitem>
+    </varlistentry>
+
    </variablelist>
 
   </refsect2>
diff --git a/src/backend/access/common/heaptuple.c b/src/backend/access/common/heaptuple.c
index 3d8d2eb..27c1479 100644
--- a/src/backend/access/common/heaptuple.c
+++ b/src/backend/access/common/heaptuple.c
@@ -60,6 +60,7 @@
 #include "access/sysattr.h"
 #include "access/tuptoaster.h"
 #include "executor/tuptable.h"
+#include "utils/pg_rbcompress.h"
 
 
 /* Does att's datatype allow packing into the 1-byte-header varlena format? */
@@ -617,6 +618,44 @@ heap_copytuple_with_tuple(HeapTuple src, HeapTuple dest)
 	memcpy((char *) dest->t_data, (char *) src->t_data, src->t_len);
 }
 
+/* ----------------
+ * heap_delta_encode
+ *
+ *		Calculate the delta between two tuples and generate
+ *  encoded wal tuple (EWT), using pgrb. The result is stored
+ *  in *encdata.
+ * ----------------
+ */
+bool
+heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup, HeapTuple newtup,
+				  char *encdata, uint32 *enclen)
+{
+	return pgrb_delta_encode(
+		(char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+		newtup->t_len - offsetof(HeapTupleHeaderData, t_bits),
+		(char *) oldtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+		oldtup->t_len - offsetof(HeapTupleHeaderData, t_bits),
+		encdata, enclen, NULL
+		);
+}
+
+/* ----------------
+ * heap_delta_decode
+ *
+ *		Decode a tuple using delta-encoded WAL tuple and old tuple version.
+ * ----------------
+ */
+void
+heap_delta_decode(char *encdata, uint32 enclen, HeapTuple oldtup, HeapTuple newtup)
+{
+	pgrb_delta_decode(encdata, enclen,
+			 (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+			 MaxHeapTupleSize - offsetof(HeapTupleHeaderData, t_bits),
+			 &newtup->t_len,
+			 (char *) oldtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+			 oldtup->t_len - offsetof(HeapTupleHeaderData, t_bits));
+}
+
 /*
  * heap_form_tuple
  *		construct a tuple from the given values[] and isnull[] arrays,
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index fa08c45..2123a61 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -85,6 +85,14 @@ static relopt_bool boolRelOpts[] =
 		},
 		false
 	},
+	{
+		{
+			"wal_compress_update",
+			"Compress the wal tuple for update operation on this relation",
+			RELOPT_KIND_HEAP
+		},
+		true
+	},
 	/* list terminator */
 	{{NULL}}
 };
@@ -1175,7 +1183,9 @@ default_reloptions(Datum reloptions, bool validate, relopt_kind kind)
 		{"check_option", RELOPT_TYPE_STRING,
 		offsetof(StdRdOptions, check_option_offset)},
 		{"user_catalog_table", RELOPT_TYPE_BOOL,
-		 offsetof(StdRdOptions, user_catalog_table)}
+		 offsetof(StdRdOptions, user_catalog_table)},
+		{"wal_compress_update", RELOPT_TYPE_BOOL,
+		 offsetof(StdRdOptions, wal_compress_update)}
 	};
 
 	options = parseRelOptions(reloptions, validate, kind, &numoptions);
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index a771ccb..2724188 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -70,6 +70,7 @@
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
 #include "utils/tqual.h"
+#include "utils/pg_rbcompress.h"
 
 
 /* GUC variable */
@@ -6597,6 +6598,12 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	XLogRecPtr	recptr;
 	XLogRecData rdata[7];
 	Page		page = BufferGetPage(newbuf);
+	char	   *newtupdata;
+	int			newtuplen;
+	bool		compressed = false;
+
+	/* Structure which holds EWT */
+	char		buf[MaxHeapTupleSize];
 	bool		need_tuple_data = RelationIsLogicallyLogged(reln);
 
 	/* Caller should not call me on a non-WAL-logged relation */
@@ -6607,6 +6614,37 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	else
 		info = XLOG_HEAP_UPDATE;
 
+	newtupdata = ((char *) newtup->t_data) + offsetof(HeapTupleHeaderData, t_bits);
+	newtuplen = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+
+	/*
+	 * EWT can be generated for all new tuple versions created by Update
+	 * operation. Currently we do it when both the old and new tuple versions
+	 * are on same page, because during recovery if the page containing old
+	 * tuple is corrupt, it should not cascade that corruption to other pages.
+	 * Under the general assumption that for long runs most updates tend to
+	 * create new tuple version on same page, there should not be significant
+	 * impact on WAL reduction or performance.
+	 *
+	 * We should not generate EWT when we need to backup the whole block in
+	 * WAL as in that case there is no saving by reduced WAL size.
+	 */
+
+	if (RelationIsEnabledForWalCompression(reln) &&
+		(oldbuf == newbuf) &&
+		!XLogCheckBufferNeedsBackup(newbuf))
+	{
+		uint32		enclen;
+
+		/* Delta-encode the new tuple using the old tuple */
+		if (heap_delta_encode(reln->rd_att, oldtup, newtup, buf, &enclen))
+		{
+			compressed = true;
+			newtupdata = buf;
+			newtuplen = enclen;
+		}
+	}
+
 	xlrec.target.node = reln->rd_node;
 	xlrec.target.tid = oldtup->t_self;
 	xlrec.old_xmax = HeapTupleHeaderGetRawXmax(oldtup->t_data);
@@ -6619,6 +6657,8 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	xlrec.newtid = newtup->t_self;
 	if (new_all_visible_cleared)
 		xlrec.flags |= XLOG_HEAP_NEW_ALL_VISIBLE_CLEARED;
+	if (compressed)
+		xlrec.flags |= XLOG_HEAP_DELTA_ENCODED;
 
 	rdata[0].data = (char *) &xlrec;
 	rdata[0].len = SizeOfHeapUpdate;
@@ -6634,7 +6674,7 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	xlhdr.header.t_infomask2 = newtup->t_data->t_infomask2;
 	xlhdr.header.t_infomask = newtup->t_data->t_infomask;
 	xlhdr.header.t_hoff = newtup->t_data->t_hoff;
-	xlhdr.t_len = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+	xlhdr.t_len = newtuplen;
 
 	/*
 	 * As with insert records, we need not store the rdata[2] segment
@@ -6647,10 +6687,13 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	rdata[2].buffer_std = true;
 	rdata[2].next = &(rdata[3]);
 
-	/* PG73FORMAT: write bitmap [+ padding] [+ oid] + data */
-	rdata[3].data = (char *) newtup->t_data
-		+ offsetof(HeapTupleHeaderData, t_bits);
-	rdata[3].len = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+	/*
+	 * PG73FORMAT: write bitmap [+ padding] [+ oid] + data OR
+	 * PG94FORMAT [If encoded]: Control byte + history reference (2 - 3)bytes
+	 *							+ literal byte + ...
+	 */
+	rdata[3].data = newtupdata;
+	rdata[3].len = newtuplen;
 	rdata[3].buffer = need_tuple_data ? InvalidBuffer : newbuf;
 	rdata[3].buffer_std = true;
 	rdata[3].next = NULL;
@@ -7739,7 +7782,10 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
 	Page		page;
 	OffsetNumber offnum;
 	ItemId		lp = NULL;
+	HeapTupleData newtup;
+	HeapTupleData oldtup;
 	HeapTupleHeader htup;
+	HeapTupleHeader oldtupdata = NULL;
 	struct
 	{
 		HeapTupleHeaderData hdr;
@@ -7814,7 +7860,7 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
 	if (PageGetMaxOffsetNumber(page) < offnum || !ItemIdIsNormal(lp))
 		elog(PANIC, "heap_update_redo: invalid lp");
 
-	htup = (HeapTupleHeader) PageGetItem(page, lp);
+	oldtupdata = htup = (HeapTupleHeader) PageGetItem(page, lp);
 
 	htup->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
 	htup->t_infomask2 &= ~HEAP_KEYS_UPDATED;
@@ -7923,10 +7969,31 @@ newsame:;
 	Assert(newlen <= MaxHeapTupleSize);
 	htup = &tbuf.hdr;
 	MemSet((char *) htup, 0, sizeof(HeapTupleHeaderData));
-	/* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
-	memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
-		   (char *) xlrec + hsize,
-		   newlen);
+
+	/*
+	 * If the record is EWT then decode it.
+	 */
+	if (xlrec->flags & XLOG_HEAP_DELTA_ENCODED)
+	{
+		/*
+		 * PG94FORMAT: Control byte + history reference (2 - 3)bytes
+		 * + literal byte + ...
+		 */
+		oldtup.t_data = oldtupdata;
+		oldtup.t_len = ItemIdGetLength(lp);
+		newtup.t_data = htup;
+
+		heap_delta_decode((char *) xlrec + hsize, newlen, &oldtup, &newtup);
+		newlen = newtup.t_len;
+	}
+	else
+	{
+		/* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
+		memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
+			   (char *) xlrec + hsize,
+			   newlen);
+	}
+
 	newlen += offsetof(HeapTupleHeaderData, t_bits);
 	htup->t_infomask2 = xlhdr.header.t_infomask2;
 	htup->t_infomask = xlhdr.header.t_infomask;
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index b333d82..92c4f00 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2327,6 +2327,28 @@ XLogRecPtrToBytePos(XLogRecPtr ptr)
 }
 
 /*
+ * Determine whether the buffer referenced has to be backed up. Since we don't
+ * yet have the insert lock, fullPageWrites and forcePageWrites could change
+ * later, but will not cause any problem because this function is used only to
+ * identify whether EWT is required for update.
+ */
+bool
+XLogCheckBufferNeedsBackup(Buffer buffer)
+{
+	bool		doPageWrites;
+	Page		page;
+
+	page = BufferGetPage(buffer);
+
+	doPageWrites = XLogCtl->Insert.fullPageWrites || XLogCtl->Insert.forcePageWrites;
+
+	if (doPageWrites && PageGetLSN(page) <= RedoRecPtr)
+		return true;			/* buffer requires backup */
+
+	return false;				/* buffer does not need to be backed up */
+}
+
+/*
  * Determine whether the buffer referenced by an XLogRecData item has to
  * be backed up, and if so fill a BkpBlock struct for it.  In any case
  * save the buffer's LSN at *lsn.
diff --git a/src/backend/utils/adt/Makefile b/src/backend/utils/adt/Makefile
index 1ae9fa0..04dc17c 100644
--- a/src/backend/utils/adt/Makefile
+++ b/src/backend/utils/adt/Makefile
@@ -26,7 +26,7 @@ OBJS = acl.o arrayfuncs.o array_selfuncs.o array_typanalyze.o \
 	rowtypes.o regexp.o regproc.o ruleutils.o selfuncs.o \
 	tid.o timestamp.o varbit.o varchar.o varlena.o version.o xid.o \
 	network.o mac.o inet_cidr_ntop.o inet_net_pton.o \
-	ri_triggers.o pg_lzcompress.o pg_locale.o formatting.o \
+	ri_triggers.o pg_lzcompress.o pg_rbcompress.o pg_locale.o formatting.o \
 	ascii.o quote.o pgstatfuncs.o encode.o dbsize.o genfile.o trigfuncs.o \
 	tsginidx.o tsgistidx.o tsquery.o tsquery_cleanup.o tsquery_gist.o \
 	tsquery_op.o tsquery_rewrite.o tsquery_util.o tsrank.o \
diff --git a/src/backend/utils/adt/pg_rbcompress.c b/src/backend/utils/adt/pg_rbcompress.c
new file mode 100644
index 0000000..e1e9df6
--- /dev/null
+++ b/src/backend/utils/adt/pg_rbcompress.c
@@ -0,0 +1,522 @@
+/* ----------
+ * pg_rbcompress.c -
+ *
+ *		This is a delta encoding scheme. For each "old" input offset, it
+ *		calculates a hash of the next four bytes, and stores the offset in
+ *		lookup table indexed by the hash. Then, the new input is scanned,
+ *		looking for matches in the old input, using the lookup table.
+ *		Output format is almost identical to the LZ compression format used
+ *		in pg_lzcompress.c
+ *
+ * Copyright (c) 1999-2014, PostgreSQL Global Development Group
+ *
+ * src/backend/utils/adt/pg_rbcompress.c
+ * ----------
+ */
+#include "postgres.h"
+
+#include <limits.h>
+
+#include "utils/pg_rbcompress.h"
+
+/* ----------
+ * Local definitions
+ * ----------
+ */
+#define PGRB_MAX_HISTORY_LISTS	8192	/* must be power of 2 */
+#define PGRB_HISTORY_SIZE		4096
+#define PGRB_MAX_MATCH			273
+
+#define PGRB_MIN_CHUNK_SIZE		4
+
+
+
+/* ----------
+ * The provided standard strategies
+ * ----------
+ */
+static const PGRB_Strategy strategy_default_data = {
+	32,							/* Data chunks less than 32 bytes are not
+								 * compressed */
+	INT_MAX,					/* No upper limit on what we'll try to
+								 * compress */
+	25,							/* Require 25% compression rate, or not worth
+								 * it */
+	1024,						/* Give up if no compression in the first 1KB */
+	128,						/* Stop history lookup if a match of 128 bytes
+								 * is found */
+	10							/* Lower good match size by 10% at every loop
+								 * iteration */
+};
+const PGRB_Strategy *const PGRB_strategy_default = &strategy_default_data;
+
+/* ----------
+ * Statically allocated work arrays for history
+ * ----------
+ */
+static int16 hist_start[PGRB_MAX_HISTORY_LISTS];
+static uint16 rb_hist_entries[PGRB_HISTORY_SIZE + 1];
+
+/*
+ * Element 0 in hist_entries is unused, and means 'invalid'.
+ */
+#define INVALID_ENTRY			0
+
+
+/*
+ * Hash function, same as that used in pg_lzcompress.c.
+ */
+#define pgrb_hash_calc(_p,hindex)								\
+	hindex = ((_p)[0] << 6) ^ ((_p)[1] << 4) ^ ((_p)[2] << 2) ^ ((_p)[3])
+
+/* The same, calculated in a rolling fashion */
+#define pgrb_hash_init(_p,hindex)								\
+	hindex = ((_p)[0] << 4) ^ ((_p)[1] << 2) ^ ((_p)[2])
+
+#define pgrb_hash_roll(_p, hindex)								\
+	hindex = (hindex << 2) ^ (_p)[3]
+
+#define pgrb_hash_unroll(_p, hindex)								\
+	hindex = hindex ^ ((_p)[0] << 8)
+
+
+/* ----------
+ * pgrb_out_ctrl -
+ *
+ *		Outputs the last and allocates a new control byte if needed.
+ * ----------
+ */
+#define pgrb_out_ctrl(__ctrlp,__ctrlb,__ctrl,__buf) \
+do { \
+	if ((__ctrl & 0xff) == 0)												\
+	{																		\
+		*(__ctrlp) = __ctrlb;												\
+		__ctrlp = (__buf)++;												\
+		__ctrlb = 0;														\
+		__ctrl = 1;															\
+	}																		\
+} while (0)
+
+
+/* ----------
+ * pgrb_out_literal -
+ *
+ *		Outputs a literal byte to the destination buffer including the
+ *		appropriate control bit.
+ * ----------
+ */
+#define pgrb_out_literal(_ctrlp,_ctrlb,_ctrl,_buf,_byte) \
+do { \
+	pgrb_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);								\
+	*(_buf)++ = (unsigned char)(_byte);										\
+	_ctrl <<= 1;															\
+} while (0)
+
+
+/* ----------
+ * pgrb_out_tag -
+ *
+ *		Outputs a backward reference tag of 2-4 bytes (depending on
+ *		offset and length) to the destination buffer including the
+ *		appropriate control bit.
+ * ----------
+ */
+#define pgrb_out_tag(_ctrlp,_ctrlb,_ctrl,_buf,_len,_off) \
+do { \
+	pgrb_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);								\
+	_ctrlb |= _ctrl;														\
+	_ctrl <<= 1;															\
+	if (_len > 17)															\
+	{																		\
+		(_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | 0x0f);		\
+		(_buf)[1] = (unsigned char)(((_off) & 0xff));						\
+		(_buf)[2] = (unsigned char)((_len) - 18);							\
+		(_buf) += 3;														\
+	} else {																\
+		(_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | ((_len) - 3)); \
+		(_buf)[1] = (unsigned char)((_off) & 0xff);							\
+		(_buf) += 2;														\
+	}																		\
+} while (0)
+
+
+/* ----------
+ * pgrb_find_match -
+ *
+ * ----------
+ */
+static inline int
+pgrb_find_match(int16 *hstart, const char *hbegin,
+				const char* input_chunk_start,
+				int16 *lenp, int16 *offp,
+				const char *hend, int hindex)
+{
+	int16		hoff;
+	const char  *hp;
+	const char  *ip;
+	int32		len = 0;
+	int32		off = 0;
+
+	hoff = hstart[hindex];
+	while (hoff != INVALID_ENTRY)
+	{
+		int32		thisoff;
+		int32		thislen;
+		int32		maxlen;
+
+		/* We use offset 0 to mean invalid entries, the actual offset is +1 */
+		hp = hbegin + hoff - 1;
+		ip = input_chunk_start;
+
+		maxlen = PGRB_MAX_MATCH;
+		if (hend - hp < maxlen)
+			maxlen = hend - hp;
+
+		thisoff = hend - hp;
+		if (thisoff >= 0x0fff)
+			break;
+
+		/* How long is the match? */
+		thislen = 0;
+		while (*ip == *hp && thislen < maxlen)
+			thislen++;
+
+		/*
+		 * Remember this match as the best (if it is)
+		 */
+		if (thislen > len)
+		{
+			len = thislen;
+			off = thisoff;
+		}
+
+		/*
+		 * Advance to the next history entry
+		 */
+		hoff = rb_hist_entries[hoff];
+	}
+
+	/*
+	 * Return match information only if it results at least in one chunk
+	 * reduction.
+	 */
+	if (len >= PGRB_MIN_CHUNK_SIZE)
+	{
+		*lenp = len;
+		*offp = off;
+		return 1;
+	}
+
+	return 0;
+}
+
+static int
+choose_hash_size(int slen)
+{
+	int			hashsz;
+
+	/*
+	 * Experiments suggest that these hash sizes work pretty well. A large
+	 * hash table minimizes collision, but has a higher startup cost. For a
+	 * small input, the startup cost dominates. The table size must be a power
+	 * of two.
+	 */
+	if (slen < 128)
+		hashsz = 512;
+	else if (slen < 256)
+		hashsz = 1024;
+	else if (slen < 512)
+		hashsz = 2048;
+	else if (slen < 1024)
+		hashsz = 4096;
+	else
+		hashsz = 8192;
+	return hashsz;
+}
+
+/* ----------
+ * pgrb_delta_decode - Compresses source data by referring history.
+ *
+ *	source is the input data to be compressed
+ *	slen is the length of source data
+ *  history is the data which is used as reference for compression
+ *	hlen is the length of history data
+ *	The encoded result is written to dest, and its length is returned in
+ *	finallen.
+ *	The return value is TRUE if compression succeeded,
+ *	FALSE if not; in the latter case the contents of dest
+ *	are undefined.
+ *	----------
+ */
+bool
+pgrb_delta_encode(const char *source, int32 slen,
+				  const char *history, int32 hlen,
+				  char *dest, uint32 *finallen,
+				  const PGRB_Strategy *strategy)
+{
+	unsigned char *bp = ((unsigned char *) dest);
+	unsigned char *bstart = bp;
+	const char *dp = source;
+	const char *dend = source + slen;
+	const char *hp = history;
+	const char *hend = history + hlen;
+	unsigned char ctrl_dummy = 0;
+	unsigned char *ctrlp = &ctrl_dummy;
+	unsigned char ctrlb = 0;
+	unsigned char ctrl = 0;
+	bool		found_match = false;
+	int16		match_len = 0;
+	int16		match_off;
+	int32		result_size;
+	int32		result_max;
+	int32		need_rate;
+	int			hoff;
+	int			hashsz;
+	int			mask;
+	int32		hindex;
+
+	/*
+	 * Tuples of length greater than PGRB_HISTORY_SIZE are not allowed for
+	 * delta encode as this is the maximum size of history offset.
+	 */
+	if (hlen >= PGRB_HISTORY_SIZE || hlen < 4)
+		return false;
+
+	/*
+	 * Our fallback strategy is the default.
+	 */
+	if (strategy == NULL)
+		strategy = PGRB_strategy_default;
+
+	/*
+	 * If the strategy forbids compression (at all or if source chunk size out
+	 * of range), fail.
+	 */
+	if (strategy->match_size_good <= 0 ||
+		slen < strategy->min_input_size ||
+		slen > strategy->max_input_size)
+		return false;
+
+	need_rate = strategy->min_comp_rate;
+	if (need_rate < 0)
+		need_rate = 0;
+	else if (need_rate > 99)
+		need_rate = 99;
+
+	/*
+	 * Compute the maximum result size allowed by the strategy, namely the
+	 * input size minus the minimum wanted compression rate.  This had better
+	 * be <= slen, else we might overrun the provided output buffer.
+	 */
+	if (slen > (INT_MAX / 100))
+	{
+		/* Approximate to avoid overflow */
+		result_max = (slen / 100) * (100 - need_rate);
+	}
+	else
+		result_max = (slen * (100 - need_rate)) / 100;
+
+	hashsz = choose_hash_size(hlen);
+	mask = hashsz - 1;
+
+	/*
+	 * Initialize the history lists to empty.  We do not need to zero the
+	 * rb_hist_entries[] array; its entries are initialized as they are used.
+	 */
+	memset(hist_start, INVALID_ENTRY, hashsz * sizeof(int16));
+
+	/* Populate the history lists */
+	pgrb_hash_init(hp, hindex);
+	hoff = 0;
+	while (hp < hend - 4)
+	{
+		pgrb_hash_roll(hp, hindex);
+
+		/* add this offset to the history table */
+		hoff++;
+		rb_hist_entries[hoff] = hist_start[hindex & mask];
+		hist_start[hindex & mask] = hoff;
+
+		pgrb_hash_unroll(hp, hindex);
+		hp++;
+	}
+
+	/*
+	 * Loop through the input.
+	 */
+	while (dp < dend - 4)
+	{
+		/*
+		 * If we already exceeded the maximum result size, fail.
+		 *
+		 * We check once per loop; since the loop body could emit as many as 4
+		 * bytes (a control byte and 3-byte tag), PGLZ_MAX_OUTPUT() had better
+		 * allow 4 slop bytes.
+		 */
+		if (bp - bstart >= result_max)
+			return false;
+
+		pgrb_hash_calc(dp, hindex);
+
+		/*
+		 * Try to find a match in the history
+		 */
+		if (pgrb_find_match(hist_start, history, dp,
+							&match_len, &match_off,
+							hend, (hindex & mask)))
+		{
+			pgrb_out_tag(ctrlp, ctrlb, ctrl, bp, match_len, match_off);
+			dp += match_len;
+			found_match = true;
+		}
+		else
+		{
+			/*
+			 * No match found. To save effort, we only check for matches every
+			 * four bytes. This naturally reduces the compression rate
+			 * somewhat, but we prefer speed over compression rate.
+			 */
+			pgrb_out_literal(ctrlp, ctrlb, ctrl, bp, *dp);
+			dp++;
+			pgrb_out_literal(ctrlp, ctrlb, ctrl, bp, *dp);
+			dp++;
+			pgrb_out_literal(ctrlp, ctrlb, ctrl, bp, *dp);
+			dp++;
+			pgrb_out_literal(ctrlp, ctrlb, ctrl, bp, *dp);
+			dp++;
+		}
+	}
+
+	if (!found_match)
+		return false;
+
+	/* Handle the last few bytes as literals */
+	while (dp < dend)
+	{
+		pgrb_out_literal(ctrlp, ctrlb, ctrl, bp, *dp);
+		dp++;					/* Do not do this ++ in the line above! */
+	}
+
+	/*
+	 * Write out the last control byte and check that we haven't overrun the
+	 * output size allowed by the strategy.
+	 */
+	*ctrlp = ctrlb;
+	result_size = bp - bstart;
+
+#ifdef DELTA_DEBUG
+	elog(LOG, "old %d new %d compressed %d", hlen, slen, result_size);
+#endif
+
+	/*
+	 * Success - need only fill in the actual length of the compressed datum.
+	 */
+	*finallen = result_size;
+
+	return true;
+}
+
+/* ----------
+ * pgrb_delta_decode
+ *
+ *		Decompresses source into dest.
+ *		To decompress, it uses history if provided.
+ * ----------
+ */
+void
+pgrb_delta_decode(const char *source, uint32 srclen,
+				  char *dest, uint32 destlen, uint32 *finallen,
+				  const char *history, uint32 histlen)
+{
+	const unsigned char *sp;
+	const unsigned char *srcend;
+	unsigned char *dp;
+	unsigned char *destend;
+	const char *hend;
+
+	sp = ((const unsigned char *) source);
+	srcend = ((const unsigned char *) source) + srclen;
+	dp = (unsigned char *) dest;
+	destend = dp + destlen;
+	hend = history + histlen;
+
+	while (sp < srcend && dp < destend)
+	{
+		/*
+		 * Read one control byte and process the next 8 items (or as many as
+		 * remain in the compressed input).
+		 */
+		unsigned char ctrl = *sp++;
+		int			ctrlc;
+
+		for (ctrlc = 0; ctrlc < 8 && sp < srcend; ctrlc++)
+		{
+			if (ctrl & 1)
+			{
+				/*
+				 * Otherwise it contains the match length minus 3 and the
+				 * upper 4 bits of the offset. The next following byte
+				 * contains the lower 8 bits of the offset. If the length is
+				 * coded as 18, another extension tag byte tells how much
+				 * longer the match really was (0-255).
+				 */
+				int32		len;
+				int32		off;
+
+				len = (sp[0] & 0x0f) + 3;
+				off = ((sp[0] & 0xf0) << 4) | sp[1];
+				sp += 2;
+				if (len == 18)
+					len += *sp++;
+
+				/*
+				 * Check for output buffer overrun, to ensure we don't clobber
+				 * memory in case of corrupt input.  Note: we must advance dp
+				 * here to ensure the error is detected below the loop.  We
+				 * don't simply put the elog inside the loop since that will
+				 * probably interfere with optimization.
+				 */
+				if (dp + len > destend)
+				{
+					dp += len;
+					break;
+				}
+
+				/*
+				 * Now we copy the bytes specified by the tag from history to
+				 * OUTPUT. We can safely use memcpy here because source and
+				 * destination strings will not overlap as in case of LZ.
+				 */
+				memcpy(dp, hend - off, len);
+				dp += len;
+			}
+			else
+			{
+				/*
+				 * An unset control bit means LITERAL BYTE. So we just copy
+				 * one from INPUT to OUTPUT.
+				 */
+				if (dp >= destend)		/* check for buffer overrun */
+					break;		/* do not clobber memory */
+
+				*dp++ = *sp++;
+			}
+
+			/*
+			 * Advance the control bit
+			 */
+			ctrl >>= 1;
+		}
+	}
+
+	/*
+	 * Check we decompressed the right amount.
+	 */
+	if (sp != srcend)
+		elog(PANIC, "compressed data is corrupt");
+
+	/*
+	 * That's it.
+	 */
+	*finallen = ((char *) dp - dest);
+}
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index d4383ab..df64096 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -67,6 +67,7 @@
 #define XLOG_HEAP_CONTAINS_OLD_TUPLE		(1<<2)
 #define XLOG_HEAP_CONTAINS_OLD_KEY			(1<<3)
 #define XLOG_HEAP_CONTAINS_NEW_TUPLE		(1<<4)
+#define XLOG_HEAP_DELTA_ENCODED				(1<<5)
 
 /* convenience macro for checking whether any form of old tuple was logged */
 #define XLOG_HEAP_CONTAINS_OLD 						\
diff --git a/src/include/access/htup_details.h b/src/include/access/htup_details.h
index a3eba98..abb5620 100644
--- a/src/include/access/htup_details.h
+++ b/src/include/access/htup_details.h
@@ -740,6 +740,11 @@ extern HeapTuple heap_modify_tuple(HeapTuple tuple,
 extern void heap_deform_tuple(HeapTuple tuple, TupleDesc tupleDesc,
 				  Datum *values, bool *isnull);
 
+extern bool heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup,
+				  HeapTuple newtup, char *encdata, uint32 *enclen);
+extern void heap_delta_decode (char *encdata, uint32 enclen, HeapTuple oldtup,
+				HeapTuple newtup);
+
 /* these three are deprecated versions of the three above: */
 extern HeapTuple heap_formtuple(TupleDesc tupleDescriptor,
 			   Datum *values, char *nulls);
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 47e3022..51d6925 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -279,6 +279,7 @@ typedef struct CheckpointStatsData
 extern CheckpointStatsData CheckpointStats;
 
 extern XLogRecPtr XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata);
+extern bool XLogCheckBufferNeedsBackup(Buffer buffer);
 extern void XLogFlush(XLogRecPtr RecPtr);
 extern bool XLogBackgroundFlush(void);
 extern bool XLogNeedsFlush(XLogRecPtr RecPtr);
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 9b8a4c9..717b90b 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -218,6 +218,7 @@ typedef struct StdRdOptions
 	bool		security_barrier;		/* for views */
 	int			check_option_offset;	/* for views */
 	bool		user_catalog_table;		/* use as an additional catalog relation */
+	bool		wal_compress_update;	/* compress wal tuple for update */
 } StdRdOptions;
 
 #define HEAP_MIN_FILLFACTOR			10
@@ -296,6 +297,15 @@ typedef struct StdRdOptions
 	 ((StdRdOptions *) (relation)->rd_options)->user_catalog_table : false)
 
 /*
+ * RelationIsEnabledForWalCompression
+ *		Returns whether the wal for update operation on relation can
+ *      be compressed.
+ */
+#define RelationIsEnabledForWalCompression(relation)	\
+	((relation)->rd_options ?				\
+	 ((StdRdOptions *) (relation)->rd_options)->wal_compress_update : true)
+
+/*
  * RelationIsValid
  *		True iff relation descriptor is valid.
  */
diff --git a/src/test/regress/expected/update.out b/src/test/regress/expected/update.out
index 71b856f..af46df2 100644
--- a/src/test/regress/expected/update.out
+++ b/src/test/regress/expected/update.out
@@ -97,3 +97,73 @@ SELECT a, b, char_length(c) FROM update_test;
 (2 rows)
 
 DROP TABLE update_test;
+--
+-- Test to update continuos and non continuos columns
+--
+DROP TABLE IF EXISTS update_test;
+NOTICE:  table "update_test" does not exist, skipping
+CREATE TABLE update_test (
+		bser bigserial,
+		bln boolean,
+		ename VARCHAR(25),
+		perf_f float(8),
+		grade CHAR,
+		dept CHAR(5) NOT NULL,
+		dob DATE,
+		idnum INT,
+		addr VARCHAR(30) NOT NULL,
+		destn CHAR(6),
+		Gend CHAR,
+		samba BIGINT,
+		hgt float,
+		ctime TIME
+);
+INSERT INTO update_test VALUES (
+		nextval('update_test_bser_seq'::regclass),
+		TRUE,
+		'Test',
+		7.169,
+		'B',
+		'CSD',
+		'2000-01-01',
+		520,
+		'road2,
+		streeeeet2,
+		city2',
+		'dcy2',
+		'M',
+		12000,
+		50.4,
+		'00:00:00.0'
+);
+SELECT * from update_test;
+ bser | bln | ename | perf_f | grade | dept  |    dob     | idnum |            addr             | destn  | gend | samba | hgt  |  ctime   
+------+-----+-------+--------+-------+-------+------------+-------+-----------------------------+--------+------+-------+------+----------
+    1 | t   | Test  |  7.169 | B     | CSD   | 01-01-2000 |   520 | road2,                     +| dcy2   | M    | 12000 | 50.4 | 00:00:00
+      |     |       |        |       |       |            |       |                 streeeeet2,+|        |      |       |      | 
+      |     |       |        |       |       |            |       |                 city2       |        |      |       |      | 
+(1 row)
+
+-- update first column
+UPDATE update_test SET bser = bser - 1 + 1;
+-- update middle column
+UPDATE update_test SET perf_f = 8.9;
+-- update last column
+UPDATE update_test SET ctime = '00:00:00.1';
+-- update 3 continuos columns
+UPDATE update_test SET destn = 'dcy2', samba = 0 WHERE Gend = 'M' and dept = 'CSD';
+-- update two non continuos columns
+UPDATE update_test SET destn = 'moved', samba = 0;
+UPDATE update_test SET bln = FALSE, hgt = 10.1;
+-- update causing some column alignment difference
+UPDATE update_test SET ename = 'Tes';
+UPDATE update_test SET dept = 'Test';
+SELECT * from update_test;
+ bser | bln | ename | perf_f | grade | dept  |    dob     | idnum |            addr             | destn  | gend | samba | hgt  |   ctime    
+------+-----+-------+--------+-------+-------+------------+-------+-----------------------------+--------+------+-------+------+------------
+    1 | f   | Tes   |    8.9 | B     | Test  | 01-01-2000 |   520 | road2,                     +| moved  | M    |     0 | 10.1 | 00:00:00.1
+      |     |       |        |       |       |            |       |                 streeeeet2,+|        |      |       |      | 
+      |     |       |        |       |       |            |       |                 city2       |        |      |       |      | 
+(1 row)
+
+DROP TABLE update_test;
diff --git a/src/test/regress/sql/update.sql b/src/test/regress/sql/update.sql
index a8a028f..1806992 100644
--- a/src/test/regress/sql/update.sql
+++ b/src/test/regress/sql/update.sql
@@ -59,3 +59,70 @@ UPDATE update_test SET c = repeat('x', 10000) WHERE c = 'car';
 SELECT a, b, char_length(c) FROM update_test;
 
 DROP TABLE update_test;
+
+
+--
+-- Test to update continuos and non continuos columns
+--
+
+DROP TABLE IF EXISTS update_test;
+CREATE TABLE update_test (
+		bser bigserial,
+		bln boolean,
+		ename VARCHAR(25),
+		perf_f float(8),
+		grade CHAR,
+		dept CHAR(5) NOT NULL,
+		dob DATE,
+		idnum INT,
+		addr VARCHAR(30) NOT NULL,
+		destn CHAR(6),
+		Gend CHAR,
+		samba BIGINT,
+		hgt float,
+		ctime TIME
+);
+
+INSERT INTO update_test VALUES (
+		nextval('update_test_bser_seq'::regclass),
+		TRUE,
+		'Test',
+		7.169,
+		'B',
+		'CSD',
+		'2000-01-01',
+		520,
+		'road2,
+		streeeeet2,
+		city2',
+		'dcy2',
+		'M',
+		12000,
+		50.4,
+		'00:00:00.0'
+);
+
+SELECT * from update_test;
+
+-- update first column
+UPDATE update_test SET bser = bser - 1 + 1;
+
+-- update middle column
+UPDATE update_test SET perf_f = 8.9;
+
+-- update last column
+UPDATE update_test SET ctime = '00:00:00.1';
+
+-- update 3 continuos columns
+UPDATE update_test SET destn = 'dcy2', samba = 0 WHERE Gend = 'M' and dept = 'CSD';
+
+-- update two non continuos columns
+UPDATE update_test SET destn = 'moved', samba = 0;
+UPDATE update_test SET bln = FALSE, hgt = 10.1;
+
+-- update causing some column alignment difference
+UPDATE update_test SET ename = 'Tes';
+UPDATE update_test SET dept = 'Test';
+
+SELECT * from update_test;
+DROP TABLE update_test;

-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation

Reply via email to