Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

Michael Paquier Mon, 24 Nov 2014 22:33:51 -0800

On Thu, Nov 13, 2014 at 12:15 AM, Andres Freund <and...@2ndquadrant.com> wrote:
>
> On 2014-11-12 10:13:18 -0500, Robert Haas wrote:
> > On Tue, Nov 11, 2014 at 4:27 AM, Andres Freund <and...@2ndquadrant.com> 
> > wrote:
> > > The more important thing here is that I see little chance of this
> > > getting in before Heikki's larger rework of the wal format gets
> > > in. Since that'll change everything around anyay I'm unsure how much
> > > point there is to iterate till that's done. I know that sucks, but I
> > > don't see much of an alternative.
> >
> > Why not do this first?  Heikki's patch seems quite far from being
> > ready to commit at this point - it significantly increases WAL volume
> > and reduces performance.  Heikki may well be able to fix that, but I
> > don't know that it's a good idea to make everyone else wait while he
> > does.
>
> Because it imo builds the infrastructure to do the compression more
> sanely. I.e. provide proper space to store information about the
> compressedness of the blocks and such.


Now that the new WAL format has been committed, here are some comments
about this patch and what we can do. First, in xlogrecord.h there is a
short description of how a record looks like. The portion of a block
data looks like that for a given block ID:
1) block image if BKPBLOCK_HAS_IMAGE, whose size of BLCKSZ - hole
2) data related to the block if BKPBLOCK_HAS_DATA, with a size
determined by what the caller inserts with XLogRegisterBufData for a
given block.
The data associated with a block has a length that cannot be
determined before XLogRegisterBufData is used. We could add a 3rd
parameter to XLogEnsureRecordSpace to allocate enough space for a
buffer wide enough to allocate data for a single buffer before
compression (BLCKSZ * number of blocks + total size of block data) but
this seems really error-prone for new features as well as existing
features. So for those reasons I think that it would be wise to not
include the block data in what is compressed.

This brings me to the second point: we would need to reorder the
entries in the record chain if we are going to do the compression of
all the blocks inside a single buffer, it has the following
advantages:
- More compression, as proved with measurements on this thread
And the following disadvantages:
- Need to change the entries in record chain once again for this
release to something like that for the block data (note that current
record chain format is quite elegant btw):
compressed block images
block data of ID = M
block data of ID = N
etc.
- Slightly longer replay time, because we would need to loop two times
through the block data to fill in DecodedBkpBlock: once to decompress
all the blocks, and once for the data of each block. It is not much
because there are not that many blocks replayed per record, but still.

So, all those things gathered, with a couple of hours hacking this
code, make me think that it would be more elegant to do the
compression per block and not per group of blocks in a single record.
I actually found a couple of extra things:
- pg_lzcompress and pg_lzdecompress should be in src/port to make
pg_xlogdump work. Note that pg_lzdecompress has one call to elog,
hence it would be better to have it return a boolean state and let the
caller return an error of decompression failed.
- In the previous patch versions, a WAL record was doing unnecessary
processing: first it built uncompressed image block entries, then
compressed them, and replaced in the record chain the existing
uncompressed records by the compressed ones.
- CompressBackupBlocks enforced compression to BLCKSZ, which was
incorrect for groups of blocks, it should have been BLCKSZ *
num_blocks.
- it looks to be better to add a simple uint16 in
XLogRecordBlockImageHeader to store the compressed length of a block,
if 0 the block is not compressed. This helps the new decoder facility
to track the length of data received. If a block has a hole, it is
compressed without it.

Now here are two patches:
- Move pg_lzcompress.c to src/port to make pg_xlogdump work with the
2nd patch. I imagine that this would be useful as well for client
utilities, similarly to what has been done for pg_crc some time ago.
- The patch itself doing the FPW compression, note that it passes
regression tests but at replay there is still one bug, triggered
roughly before numeric.sql when replaying changes on a standby. I am
still looking at it, but it does not prevent basic testing as well as
a continuation of the discussion.
For now here are the patches either way, so feel free to comment.
Regards,
-- 
Michael

From eda0730d991f8b4dfbacc4d7a953ec5bff8b2ffe Mon Sep 17 00:00:00 2001
From: Michael Paquier <mich...@otacoo.com>
Date: Fri, 21 Nov 2014 13:40:11 +0900
Subject: [PATCH 1/2] Fix flag marking GIN index as being built for new entries

This was somewhat missing in the current implementation, and leaded
to problems for code that needed special handling with fresh indexes
being built. Note that this does not impact current code as there are
no such operations being done yet but it may be a problem if in the
future a bug fix needs to make this distinction.
---
 src/backend/access/gin/gininsert.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index c1ad0fd..c6d8b40 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -191,6 +191,7 @@ ginEntryInsert(GinState *ginstate,
 		buildStats->nEntries++;
 
 	ginPrepareEntryScan(&btree, attnum, key, category, ginstate);
+	btree.isBuild = (buildStats != NULL);
 
 	stack = ginFindLeafPage(&btree, false);
 	page = BufferGetPage(stack->buffer);
-- 
2.1.3

From eb48305ac0295fa4a46ffec5f8db447cd4c5f6b2 Mon Sep 17 00:00:00 2001
From: Michael Paquier <mich...@otacoo.com>
Date: Fri, 21 Nov 2014 14:08:54 +0900
Subject: [PATCH 2/2] Support fillfactor for GIN indexes

Users can call this new storage parameter to fill in the entry and
leaf pages of a newly-built index as wanted. Fillfactor range varies
between 20 and 100.
---
 doc/src/sgml/ref/create_index.sgml     |  4 ++--
 src/backend/access/common/reloptions.c |  9 +++++++++
 src/backend/access/gin/gindatapage.c   | 22 ++++++++++++++++++----
 src/backend/access/gin/ginentrypage.c  | 20 +++++++++++++++++++-
 src/backend/access/gin/ginutil.c       |  3 ++-
 src/include/access/gin_private.h       |  3 +++
 6 files changed, 53 insertions(+), 8 deletions(-)

diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index 6b2ee28..c0ba24a 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -294,8 +294,8 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
    <para>
     The optional <literal>WITH</> clause specifies <firstterm>storage
     parameters</> for the index.  Each index method has its own set of allowed
-    storage parameters.  The B-tree, hash, GiST and SP-GiST index methods all
-    accept this parameter:
+    storage parameters.  The B-tree, hash, GIN, GiST and SP-GiST index methods
+    all accept this parameter:
    </para>
 
    <variablelist>
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index c16b38e..7137ba9 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -15,6 +15,7 @@
 
 #include "postgres.h"
 
+#include "access/gin_private.h"
 #include "access/gist_private.h"
 #include "access/hash.h"
 #include "access/htup_details.h"
@@ -133,6 +134,14 @@ static relopt_int intRelOpts[] =
 	},
 	{
 		{
+			"fillfactor",
+			"Packs gin index pages only to this percentage",
+			RELOPT_KIND_GIN
+		},
+		GIN_DEFAULT_FILLFACTOR, GIN_MIN_FILLFACTOR, 100
+	},
+	{
+		{
 			"autovacuum_vacuum_threshold",
 			"Minimum number of tuple updates or deletes prior to vacuum",
 			RELOPT_KIND_HEAP | RELOPT_KIND_TOAST
diff --git a/src/backend/access/gin/gindatapage.c b/src/backend/access/gin/gindatapage.c
index 012225e..f322004 100644
--- a/src/backend/access/gin/gindatapage.c
+++ b/src/backend/access/gin/gindatapage.c
@@ -446,11 +446,21 @@ dataPlaceToPageLeaf(GinBtree btree, Buffer buf, GinBtreeStack *stack,
 	leafSegmentInfo *lastleftinfo;
 	ItemPointerData maxOldItem;
 	ItemPointerData remaining;
+	int			fillfactor;
 
 	Assert(GinPageIsData(page));
 
 	rbound = *GinDataPageGetRightBound(page);
 
+	/* Grab option values */
+	if (btree->index->rd_options)
+	{
+		GinOptions *options = (GinOptions *) btree->index->rd_options;
+		fillfactor = options->fillfactor;
+	}
+	else
+		fillfactor = GIN_DEFAULT_FILLFACTOR;
+
 	/*
 	 * Count how many of the new items belong to this page.
 	 */
@@ -511,15 +521,19 @@ dataPlaceToPageLeaf(GinBtree btree, Buffer buf, GinBtreeStack *stack,
 
 	/*
 	 * If we're appending to the end of the page, we will append as many items
-	 * as we can fit (after splitting), and stop when the pages becomes full.
-	 * Otherwise we have to limit the number of new items to insert, because
-	 * once we start packing we can't just stop when we run out of space,
-	 * because we must make sure that all the old items still fit.
+	 * as we can fit up to the given fillfactor at build (after splitting),
+	 * and stop when the pages becomes full at this rate. Otherwise we have to
+	 * limit the number of new items to insert, because once we start packing
+	 * we can't just stop when we run out of space, because we must make sure
+	 * that all the old items still fit.
 	 */
 	if (GinPageIsCompressed(page))
 		freespace = GinDataLeafPageGetFreeSpace(page);
+	else if (btree->isBuild)
+		freespace = BLCKSZ * (100 - fillfactor) / 100;
 	else
 		freespace = 0;
+
 	if (append)
 	{
 		/*
diff --git a/src/backend/access/gin/ginentrypage.c b/src/backend/access/gin/ginentrypage.c
index 2dae7b9..36eccd7 100644
--- a/src/backend/access/gin/ginentrypage.c
+++ b/src/backend/access/gin/ginentrypage.c
@@ -462,6 +462,16 @@ entryIsEnoughSpace(GinBtree btree, Buffer buf, OffsetNumber off,
 	Size		releasedsz = 0;
 	Size		addedsz;
 	Page		page = BufferGetPage(buf);
+	int			fillfactor;
+	int			freespace;
+
+	if (btree->index->rd_options)
+	{
+		GinOptions *options = (GinOptions *) btree->index->rd_options;
+		fillfactor = options->fillfactor;
+	}
+	else
+		fillfactor = GIN_DEFAULT_FILLFACTOR;
 
 	Assert(insertData->entry);
 	Assert(!GinPageIsData(page));
@@ -475,7 +485,15 @@ entryIsEnoughSpace(GinBtree btree, Buffer buf, OffsetNumber off,
 
 	addedsz = MAXALIGN(IndexTupleSize(insertData->entry)) + sizeof(ItemIdData);
 
-	if (PageGetFreeSpace(page) + releasedsz >= addedsz)
+	/*
+	 * Calculate freespace available. When building the index take into
+	 * account the fillfactor.
+	 */
+	if (btree->isBuild)
+		freespace = PageGetFreeSpace(page) - BLCKSZ * (100 - fillfactor) / 100;
+	else
+		freespace = PageGetFreeSpace(page);
+	if (freespace + releasedsz >= addedsz)
 		return true;
 
 	return false;
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index f593a72..bfdfd0c 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -527,7 +527,8 @@ ginoptions(PG_FUNCTION_ARGS)
 	static const relopt_parse_elt tab[] = {
 		{"fastupdate", RELOPT_TYPE_BOOL, offsetof(GinOptions, useFastUpdate)},
 		{"gin_pending_list_limit", RELOPT_TYPE_INT, offsetof(GinOptions,
-																pendingListCleanupSize)}
+															 pendingListCleanupSize)},
+		{"fillfactor", RELOPT_TYPE_INT, offsetof(GinOptions, fillfactor)}
 	};
 
 	options = parseRelOptions(reloptions, validate, RELOPT_KIND_GIN,
diff --git a/src/include/access/gin_private.h b/src/include/access/gin_private.h
index 3d46f20..855091b 100644
--- a/src/include/access/gin_private.h
+++ b/src/include/access/gin_private.h
@@ -315,6 +315,7 @@ typedef struct GinOptions
 	int32		vl_len_;		/* varlena header (do not touch directly!) */
 	bool		useFastUpdate;	/* use fast updates? */
 	int			pendingListCleanupSize;	/* maximum size of pending list */
+	int			fillfactor;		/* page fillfactor in percent (0..100) */
 } GinOptions;
 
 #define GIN_DEFAULT_USE_FASTUPDATE	true
@@ -327,6 +328,8 @@ typedef struct GinOptions
 	 ((GinOptions *) (relation)->rd_options)->pendingListCleanupSize : \
 	 gin_pending_list_limit)
 
+#define GIN_MIN_FILLFACTOR			20
+#define GIN_DEFAULT_FILLFACTOR		100
 
 /* Macros for buffer lock/unlock operations */
 #define GIN_UNLOCK	BUFFER_LOCK_UNLOCK
-- 
2.1.3

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

Reply via email to