Re: Patch: dumping tables data in multiple chunks in pg_dump

Hannu Krosing Sat, 28 Mar 2026 03:59:56 -0700

Hi Zsolt and Dilip,

Thanks for review and useful comments!


On Tue, Feb 3, 2026 at 10:10 PM Zsolt Parragi <[email protected]> wrote:
>
> Hello!
>
> I did some testing with this patch, and I think there are some issues
> during restoration:
>
> 1. Isn't there a possible race / scheduling mistake during restore
> because of missing dependencies? The code now prints out "TABLE DATA
> (pages %u:%u)", while the restore code checks for the explicit "TABLE
> DATA" string for dependency tracking (pg_backup_archiver.c:2013 and a
> few other places). This causes POST DATA to have no dependency on the
> table data, and can be scheduled before we load all table data.

I have resolved this by adding a second array to the reverse dependencies
mechanism in buildTocEntryArrays() for chunked dump where I collect arrays
of ids in AH->tableDataChunkIds.

For this I extracted the list management part from DumpableObject

typedef struct _DependencyList
{
DumpId    *dependencies; /* dumpIds of objects this one depends on */
int nDeps; /* number of valid dependencies */
int allocDeps; /* allocated size of dependencies[] */
} DependencyList;

And added addStandaloneDependency() based addObjectDependency()

I simplified it to always use realloc, as it can handle the NULL case

void
addStandaloneDependency(DependencyList *dobj, DumpId refId)
{
if (dobj->nDeps >= dobj->allocDeps)
{
dobj->allocDeps = (dobj->allocDeps <= 0) ? 16 : dobj->allocDeps * 2;
dobj->dependencies = pg_realloc_array(dobj->dependencies,
  DumpId, dobj->allocDeps);
}
dobj->dependencies[dobj->nDeps++] = refId;
}

And then I use AH->tableDataChunkIds in repoint_table_dependencies() to
- replace the dependency on table def with dependency on first chunk
- add the remaining cunks at the end of dependency list.

> I was able to verify the scheduling issue with an index: the INDEX
> part is scheduled too early, before all TABLE DATA completes, but then
> locking prevents it from progressing, so everything completed fine in
> the end. Even if that's guaranteed, which I'm not 100% sure of, it's
> still based on luck and not proper logic, and takes up a slot (or
> multiple), reducing parallelism.
>
> 2. Fixing the TABLE DATA strcmp checks solves the scheduling issue,
> but it's not that simple, because then it causes truncation issues
> during restore, which needs additional changes in the restore code. I
> did a quick fix for that by adding an additional condition to the
> created flag, and with that it seems to restore everything properly,
> and with proper ordering, only starting index/constraint/etc after all
> table data is completed. However this was definitely just a quick test
> fix, this needs a proper better solution.
>
> Other issues I see are more minor, but numerous:

I collect the chunk dependencies in a separate array, which
should solve the truncation issue.

Can you advise a good check to add to tap tests for verifying?

> 3. The patch still has lots of debug output (pg_log_WARNING("CHUNKING
> ...")); Is this intended? Shouldn't these be behind some verbose
> check, and maybe use info instead of warning?

This left in for easing initial reviewing. I have either removed them
or turned them into pg_log_debug()

> 4. The is_segment macro should have () around the use of tdiptr

Thanks, fixed.

> 5. There's still a 32000 magic constant, shouldn't that have some
> descriptive name / explanatory comment?

I turned this into "ctid < (pagenr+1, 0)" for clarity and
futureproofing, as it is not entirely impossible that we could have
at some point more than 32000 items per page.

> 6. formatting issues at multiple places, mostly missing spaces after
> if/while/for statements

My hope was that the pre-release automatic formatting run takes care of
this.

I will eyeball to see if I find theem, but I don't think I have a good
way to detect them all.

Suggestions very much welcome!

> 7. inconsistent error messages (not in range vs must be in range)

> 8. There's a remaining TODO that seems stale, current_chunk_start is
> already uint64

Removed.

> 9. typo: "the computed from pog_relation_size" -> "then computed from
> pg_relation_size"

Fixed.

On Thu, Feb 12, 2026 at 7:13 AM Dilip Kumar <[email protected]> wrote:
>
> On Wed, Jan 28, 2026 at 11:00 PM Hannu Krosing <[email protected]> wrote:
> >
> > v13 has added a proper test comparing original and restored table data
> >
> I was reviewing v13 and here are some initial comments I have
>
> 1. IMHO the commit message details about the work progress instead of
> a high level idea about what it actually does and how.
> Suggestion:
>
> SUBJECT: Add --max-table-segment-pages option to pg_dump for parallel
> table dumping.
>
> This patch introduces the ability to split large heap tables into segments
> based on a specified number of pages. These segments can then be dumped in
> parallel using the existing jobs infrastructure, significantly reducing
> the time required to dump very large tables.
>
> The implementation uses ctid-based range queries (e.g., WHERE ctid >=
> '(start,1)'
> AND ctid <= '(end,32000)') to extract specific chunks of the relation.
>
> <more architecture details and limitation if any>

SUBJECT: Add --max-table-segment-pages option to pg_dump for parallel
table dumping.

This patch introduces the ability to split large heap tables into segments
based on a specified number of pages. These segments can then be dumped in
parallel using the existing jobs infrastructure, significantly reducing
the time required to dump very large tables.

This --max-table-segment-pages number specifically applies to main table
pages which does not guarantee anything about output size.
The output could be empty if there are no live tuples in the page range.
Or it can be almost 200 GB if the page has just pointers to 1GB TOAST items.

The implementation uses ctid-based range queries (e.g., WHERE ctid >=
'(startPage,1)' AND ctid <= '(endPage+1,0)') to extract specific chunks of
the relation.

This is only effectively supported for PostgreSQL version 14+ though it does
work inefficiently on earlier versions

The patch only supports "heap" access method as others may not even have the
ctid column

> 2.
> + pg_log_warning("CHUNKING: set dopt.max_table_segment_pages to [%u]",
> dopt.max_table_segment_pages);
> + break;
>
> IMHO we don't need to place warning here while processing the input parameters

Either removed or turned to pg_log_debug()

> 3.
> + printf(_("  --max-table-segment-pages=NUMPAGES\n"
> +      "                               Number of main table pages
> above which data is \n"
> + "                               copied out in chunks, also
> determines the chunk size\n"));
>
> Check the comment formatting, all the parameter description starts
> with lower case, so better we start with "number" rather than "Number"

Fixed

> 4.
> + if (is_segment(tdinfo))
> + {
> + appendPQExpBufferStr(q, tdinfo->filtercond?" AND ":" WHERE ");
> + if(tdinfo->startPage == 0)
> + appendPQExpBuffer(q, "ctid <= '(%u,32000)'", tdinfo->endPage);
> + else if(tdinfo->endPage != InvalidBlockNumber)
> + appendPQExpBuffer(q, "ctid BETWEEN '(%u,1)' AND '(%u,32000)'",
> + tdinfo->startPage, tdinfo->endPage);
> + else
> + appendPQExpBuffer(q, "ctid >= '(%u,1)'", tdinfo->startPage);
> + pg_log_warning("CHUNKING: pages [%u:%u]",tdinfo->startPage, 
> tdinfo->endPage);
> + }
>
> IMHO we should explain this chunking logic in the comment above this code 
> block?

I added the comment.
I also changed the chunk end logic to "ctid < '(LastPage+1,0)'" for clarity and
future-proofing.

----
Best Regards

Hannu

From d9442eb6476ba27e0f3dee085e48de2efbb445d6 Mon Sep 17 00:00:00 2001
From: Hannu Krosing <[email protected]>
Date: Sat, 28 Mar 2026 11:53:39 +0100
Subject: [PATCH v14] SUBJECT: Add --max-table-segment-pages option to pg_dump
 for parallel table dumping.

This patch introduces the ability to split large heap tables into segments
based on a specified number of pages. These segments can then be dumped in
parallel using the existing jobs infrastructure, significantly reducing
the time required to dump very large tables.

This --max-table-segment-pages number specifically applies to main table
pages which does not guarantee anything about output size.
The output could be empty if there are no live tuples in the page range.
Or it can be almost 200 GB if the page has just pointers to 1GB TOAST items.

The implementation uses ctid-based range queries (e.g., WHERE ctid >=
'(startPage,1)' AND ctid < '(endPage+1,0)') to extract specific chunks of
the relation.

This is only effectively supported for PostgreSQL version 14+ though it does
work inefficiently on earlier versions

The patch only supports "heap" access method as others may not even have the
ctid column
---
 doc/src/sgml/ref/pg_dump.sgml             |  24 +++
 src/bin/pg_dump/pg_backup.h               |   2 +
 src/bin/pg_dump/pg_backup_archiver.c      |  84 +++++++++-
 src/bin/pg_dump/pg_backup_archiver.h      |  12 +-
 src/bin/pg_dump/pg_dump.c                 | 177 +++++++++++++++++-----
 src/bin/pg_dump/pg_dump.h                 |  22 ++-
 src/bin/pg_dump/t/004_pg_dump_parallel.pl |  31 ++++
 src/fe_utils/option_utils.c               |  55 +++++++
 src/include/fe_utils/option_utils.h       |   3 +
 9 files changed, 364 insertions(+), 46 deletions(-)

diff --git a/doc/src/sgml/ref/pg_dump.sgml b/doc/src/sgml/ref/pg_dump.sgml
index 7f538e90194..5f056bb4af6 100644
--- a/doc/src/sgml/ref/pg_dump.sgml
+++ b/doc/src/sgml/ref/pg_dump.sgml
@@ -1066,6 +1066,30 @@ PostgreSQL documentation
       </listitem>
      </varlistentry>
 
+     <varlistentry>
+      <term><option>--max-table-segment-pages=<replaceable class="parameter">npages</replaceable></option></term>
+      <listitem>
+       <para>
+        Dump data in segments based on number of pages in the main relation.
+        If the number of data pages in the relation is more than <replaceable class="parameter">npages</replaceable> 
+        the data is split into segments based on that number of pages.
+        Individual segments can be dumped in parallel.
+       </para>
+
+       <note>
+        <para>
+         The option <option>--max-table-segment-pages</option> is applied to only pages
+         in the main heap and if the table has a large TOASTed part this has to be
+         taken into account when deciding on the number of pages to use.
+         In the extreme case a single 8kB heap page can have ~200 toast pointers each 
+         corresponding to 1GB of data. If this data is also non-compressible then a 
+         single-page segment can dump as 200GB file.
+        </para>
+       </note>
+
+      </listitem>
+     </varlistentry>
+
      <varlistentry>
       <term><option>--no-comments</option></term>
       <listitem>
diff --git a/src/bin/pg_dump/pg_backup.h b/src/bin/pg_dump/pg_backup.h
index fda912ba0a9..11863a1915f 100644
--- a/src/bin/pg_dump/pg_backup.h
+++ b/src/bin/pg_dump/pg_backup.h
@@ -27,6 +27,7 @@
 #include "common/file_utils.h"
 #include "fe_utils/simple_list.h"
 #include "libpq-fe.h"
+#include "storage/block.h"
 
 
 typedef enum trivalue
@@ -179,6 +180,7 @@ typedef struct _dumpOptions
 	bool		aclsSkip;
 	const char *lockWaitTimeout;
 	int			dump_inserts;	/* 0 = COPY, otherwise rows per INSERT */
+	BlockNumber	max_table_segment_pages; /* chunk when relpages is above this */
 
 	/* flags for various command-line long options */
 	int			disable_dollar_quoting;
diff --git a/src/bin/pg_dump/pg_backup_archiver.c b/src/bin/pg_dump/pg_backup_archiver.c
index 271a2c3e481..384add0713b 100644
--- a/src/bin/pg_dump/pg_backup_archiver.c
+++ b/src/bin/pg_dump/pg_backup_archiver.c
@@ -44,6 +44,7 @@
 #include "pg_backup_archiver.h"
 #include "pg_backup_db.h"
 #include "pg_backup_utils.h"
+#include "storage/block.h"
 
 #define TEXT_DUMP_HEADER "--\n-- PostgreSQL database dump\n--\n\n"
 #define TEXT_DUMPALL_HEADER "--\n-- PostgreSQL database cluster dump\n--\n\n"
@@ -154,6 +155,7 @@ InitDumpOptions(DumpOptions *opts)
 	opts->dumpSchema = true;
 	opts->dumpData = true;
 	opts->dumpStatistics = false;
+	opts->max_table_segment_pages = InvalidBlockNumber;
 }
 
 /*
@@ -1995,6 +1997,28 @@ _moveBefore(TocEntry *pos, TocEntry *te)
 	pos->prev = te;
 }
 
+/*
+ * Add a dependency id to a DependencyList object
+ * This is currently used for collecting reverse 
+ * dependencies for chunked data dump 
+ *
+ * Note: duplicate dependencies are currently not eliminated
+ */
+void
+addStandaloneDependency(DependencyList *dobj, DumpId refId)
+{
+	pg_log_warning("Adding dep: list %p + dep %u", (void *) dobj->dependencies, refId);
+	if (dobj->nDeps >= dobj->allocDeps)
+	{
+		dobj->allocDeps = (dobj->allocDeps <= 0) ? 16 : dobj->allocDeps * 2;
+		dobj->dependencies = pg_realloc_array(dobj->dependencies,
+											  DumpId, dobj->allocDeps);
+		pg_log_warning("Realloced list %p to size %d", (void *) dobj->dependencies, dobj->allocDeps);
+	}
+	pg_log_warning("Added dep: list %p + dep %u", (void *) dobj->dependencies, refId);
+	dobj->dependencies[dobj->nDeps++] = refId;
+}
+
 /*
  * Build index arrays for the TOC list
  *
@@ -2014,6 +2038,7 @@ buildTocEntryArrays(ArchiveHandle *AH)
 
 	AH->tocsByDumpId = pg_malloc0_array(TocEntry *, (maxDumpId + 1));
 	AH->tableDataId = pg_malloc0_array(DumpId, (maxDumpId + 1));
+	AH->tableDataChunkIds = pg_malloc0_array(DependencyList, (maxDumpId + 1));
 
 	for (te = AH->toc->next; te != AH->toc; te = te->next)
 	{
@@ -2029,8 +2054,12 @@ buildTocEntryArrays(ArchiveHandle *AH)
 		 * TOC entry that has a DATA item.  We compute this by reversing the
 		 * TABLE DATA item's dependency, knowing that a TABLE DATA item has
 		 * just one dependency and it is the TABLE item.
+		 *
+		 * For chunked table data, the TABLE DATA item has a description like
+		 * "TABLE DATA (pages 100:199)", and we collect all such items as
+		 * reverse dependencies for the parent table's entry in tableDataChunkIds.
 		 */
-		if (strcmp(te->desc, "TABLE DATA") == 0 && te->nDeps > 0)
+		if (strncmp(te->desc, "TABLE DATA", 10) == 0 && te->nDeps > 0)
 		{
 			DumpId		tableId = te->dependencies[0];
 
@@ -2042,7 +2071,14 @@ buildTocEntryArrays(ArchiveHandle *AH)
 			if (tableId <= 0 || tableId > maxDumpId)
 				pg_fatal("bad table dumpId for TABLE DATA item");
 
-			AH->tableDataId[tableId] = te->dumpId;
+			if (te->desc[10] == '\0') /* te->desc == "TABLE DATA" */
+				AH->tableDataId[tableId] = te->dumpId;
+			else
+			{
+				/* Chunked table data, the description is "TABLE DATA (pages %u:%u)" */
+				addStandaloneDependency(&(AH->tableDataChunkIds[tableId]), te->dumpId);
+				pg_log_debug("Added chunked table data dependency: tableId %u + chunkId %u",
+							 tableId, te->dumpId);}
 		}
 	}
 }
@@ -5017,6 +5053,12 @@ fix_dependencies(ArchiveHandle *AH)
  * that parallel restore will prioritize larger jobs (index builds, FK
  * constraint checks, etc) over smaller ones, avoiding situations where we
  * end a restore with only one active job working on a large table.
+ *
+ * In case of chunked dumps, we change the depenency on table with depedency
+ * on the first chunk of data and add the remaingi chunk ids, if any, to the 
+ * end of depencency list
+ * we also calculate the fullDataLength as the sum of the lengths of chunk
+ * data items and use that to set the item's dataLength.
  */
 static void
 repoint_table_dependencies(ArchiveHandle *AH)
@@ -5032,8 +5074,9 @@ repoint_table_dependencies(ArchiveHandle *AH)
 		for (i = 0; i < te->nDeps; i++)
 		{
 			olddep = te->dependencies[i];
-			if (olddep <= AH->maxDumpId &&
-				AH->tableDataId[olddep] != 0)
+			if (olddep > AH->maxDumpId)
+				continue;
+			if (AH->tableDataId[olddep] != 0)
 			{
 				DumpId		tabledataid = AH->tableDataId[olddep];
 				TocEntry   *tabledatate = AH->tocsByDumpId[tabledataid];
@@ -5043,6 +5086,39 @@ repoint_table_dependencies(ArchiveHandle *AH)
 				pg_log_debug("transferring dependency %d -> %d to %d",
 							 te->dumpId, olddep, tabledataid);
 			}
+			else if (AH->tableDataChunkIds[olddep].nDeps > 0)
+			{
+				int			j;
+				DumpId		chunkdataid;
+				uint64		fullDataLength;
+				DependencyList *deplist = &AH->tableDataChunkIds[olddep];
+
+				/* first in list replaces the dependency on table */
+				chunkdataid = deplist->dependencies[0];
+				te->dependencies[i] = chunkdataid;
+				fullDataLength = AH->tocsByDumpId[chunkdataid]->dataLength;
+				pg_log_debug("transferring chunk list %d -> %d to %d",
+							 te->dumpId, olddep, chunkdataid);
+
+				if (deplist->nDeps > 1)
+				{
+					/* make space */
+					te->dependencies = pg_realloc_array(te->dependencies,
+												  DumpId,
+												  te->nDeps + deplist->nDeps - 1);
+
+					/* the rest are appended to dependencies */
+					for (j = 1; j < deplist->nDeps; j++)
+					{
+						chunkdataid = deplist->dependencies[j];
+						te->dependencies[te->nDeps + j] = chunkdataid;
+						fullDataLength += AH->tocsByDumpId[chunkdataid]->dataLength;
+						pg_log_debug("adding chunk list %d -> %d to %d",
+									te->dumpId, olddep, chunkdataid);
+					}
+				}
+				te->dataLength = Max(te->dataLength, fullDataLength);
+			}
 		}
 	}
 }
diff --git a/src/bin/pg_dump/pg_backup_archiver.h b/src/bin/pg_dump/pg_backup_archiver.h
index 365073b3eae..cfa3ea1bbd6 100644
--- a/src/bin/pg_dump/pg_backup_archiver.h
+++ b/src/bin/pg_dump/pg_backup_archiver.h
@@ -179,6 +179,13 @@ typedef enum
 	OUTPUT_OTHERDATA,			/* writing data as INSERT commands */
 } ArchiverOutput;
 
+typedef struct _DependencyList
+{
+	DumpId	   *dependencies;	/* dumpIds of objects this one depends on */
+	int			nDeps;			/* number of valid dependencies */
+	int			allocDeps;		/* allocated size of dependencies[] */
+} DependencyList;
+
 /*
  * For historical reasons, ACL items are interspersed with everything else in
  * a dump file's TOC; typically they're right after the object they're for.
@@ -311,6 +318,7 @@ struct _archiveHandle
 	/* arrays created after the TOC list is complete: */
 	struct _tocEntry **tocsByDumpId;	/* TOCs indexed by dumpId */
 	DumpId	   *tableDataId;	/* TABLE DATA ids, indexed by table dumpId */
+	DependencyList *tableDataChunkIds; /* dependencies indexed by dumpId */
 
 	struct _tocEntry *currToc;	/* Used when dumping data */
 	pg_compress_specification compression_spec; /* Requested specification for
@@ -377,7 +385,7 @@ struct _tocEntry
 	size_t		defnLen;		/* length of dumped definition */
 
 	/* working state while dumping/restoring */
-	pgoff_t		dataLength;		/* item's data size; 0 if none or unknown */
+	uint64		dataLength;		/* item's data size; 0 if none or unknown */
 	int			reqs;			/* do we need schema and/or data of object
 								 * (REQ_* bit mask) */
 	bool		created;		/* set for DATA member if TABLE was created */
@@ -437,6 +445,8 @@ extern int	TocIDRequired(ArchiveHandle *AH, DumpId id);
 TocEntry   *getTocEntryByDumpId(ArchiveHandle *AH, DumpId id);
 extern bool checkSeek(FILE *fp);
 
+extern void addStandaloneDependency(DependencyList *dobj, DumpId refId);
+
 #define appendStringLiteralAHX(buf,str,AH) \
 	appendStringLiteral(buf, str, (AH)->public.encoding, (AH)->public.std_strings)
 
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 5d1f7682f11..1e7d9a3f7f3 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -535,6 +535,7 @@ main(int argc, char **argv)
 		{"exclude-extension", required_argument, NULL, 17},
 		{"sequence-data", no_argument, &dopt.sequence_data, 1},
 		{"restrict-key", required_argument, NULL, 25},
+		{"max-table-segment-pages", required_argument, NULL, 26},
 
 		{NULL, 0, NULL, 0}
 	};
@@ -799,6 +800,12 @@ main(int argc, char **argv)
 				dopt.restrict_key = pg_strdup(optarg);
 				break;
 
+			case 26:
+				if (!option_parse_uint32(optarg, "--max-table-segment-pages", 1, MaxBlockNumber,
+									  &dopt.max_table_segment_pages))
+					exit_nicely(1);
+				break;
+
 			default:
 				/* getopt_long already emitted a complaint */
 				pg_log_error_hint("Try \"%s --help\" for more information.", progname);
@@ -1344,6 +1351,9 @@ help(const char *progname)
 	printf(_("  --extra-float-digits=NUM     override default setting for extra_float_digits\n"));
 	printf(_("  --filter=FILENAME            include or exclude objects and data from dump\n"
 			 "                               based on expressions in FILENAME\n"));
+	printf(_("  --max-table-segment-pages=NUMPAGES\n"
+		     "                               number of main table pages above which data is \n"
+			 "                               copied out in chunks, also determines the chunk size\n"));
 	printf(_("  --if-exists                  use IF EXISTS when dropping objects\n"));
 	printf(_("  --include-foreign-data=PATTERN\n"
 			 "                               include data of foreign tables on foreign\n"
@@ -2396,7 +2406,7 @@ dumpTableData_copy(Archive *fout, const void *dcontext)
 	 * dumping an old pg_largeobject_metadata defined WITH OIDS.  For other
 	 * cases a simple COPY suffices.
 	 */
-	if (tdinfo->filtercond || tbinfo->relkind == RELKIND_FOREIGN_TABLE ||
+	if (tdinfo->filtercond || is_segment(tdinfo) || tbinfo->relkind == RELKIND_FOREIGN_TABLE ||
 		(fout->dopt->binary_upgrade && fout->remoteVersion < 120000 &&
 		 tbinfo->dobj.catId.oid == LargeObjectMetadataRelationId))
 	{
@@ -2414,9 +2424,37 @@ dumpTableData_copy(Archive *fout, const void *dcontext)
 		else
 			appendPQExpBufferStr(q, "* ");
 
-		appendPQExpBuffer(q, "FROM %s %s) TO stdout;",
+		appendPQExpBuffer(q, "FROM %s %s",
 						  fmtQualifiedDumpable(tbinfo),
 						  tdinfo->filtercond ? tdinfo->filtercond : "");
+		/* If it's a segment, we need to add a filter condition to select the
+		 * right page range 
+		 * - for first segment we add "ctid < (endPage+1, 0)" 
+		 *   first segment is the one with startPage == 0
+		 * - for last segment we add "ctid >= (startPage, 1)"
+		 *   last segment is the one with endPage == InvalidBlockNumber
+		 *   we leave to upper bound open for the case where more pages 
+		 *   were added after we measured 
+		 * - for middle segments we add 
+		 *   "ctid >= (startPage, 1) AND ctid < (endPage+1, 0)"
+		 *
+		 * "ctid < (endPage+1, 0)" instead of "ctid <= (endPage, maxtuple)"
+		 * was chosen as range end so that we do not have to estimate the maxtuple
+		 * 
+		 */
+		if (is_segment(tdinfo))
+		{
+			appendPQExpBufferStr(q, tdinfo->filtercond?" AND ":" WHERE ");
+			if(tdinfo->startPage == 0)
+				appendPQExpBuffer(q, "ctid < '(%u,0)'", tdinfo->endPage+1);			
+			else if(tdinfo->endPage != InvalidBlockNumber)
+				appendPQExpBuffer(q, "ctid >= '(%u,1)' AND ctid < '(%u,0)'",
+								 tdinfo->startPage, tdinfo->endPage+1);
+			else
+				appendPQExpBuffer(q, "ctid >= '(%u,1)'", tdinfo->startPage);
+		}
+
+		appendPQExpBuffer(q, ") TO stdout;");
 	}
 	else
 	{
@@ -2424,6 +2462,10 @@ dumpTableData_copy(Archive *fout, const void *dcontext)
 						  fmtQualifiedDumpable(tbinfo),
 						  column_list);
 	}
+
+	if (is_segment(tdinfo))
+		pg_log_debug("CHUNKING: data query: %s", q->data);
+	
 	res = ExecuteSqlQuery(fout, q->data, PGRES_COPY_OUT);
 	PQclear(res);
 	destroyPQExpBuffer(clistBuf);
@@ -2919,42 +2961,89 @@ dumpTableData(Archive *fout, const TableDataInfo *tdinfo)
 	{
 		TocEntry   *te;
 
-		te = ArchiveEntry(fout, tdinfo->dobj.catId, tdinfo->dobj.dumpId,
-						  ARCHIVE_OPTS(.tag = tbinfo->dobj.name,
-									   .namespace = tbinfo->dobj.namespace->dobj.name,
-									   .owner = tbinfo->rolname,
-									   .description = "TABLE DATA",
-									   .section = SECTION_DATA,
-									   .createStmt = tdDefn,
-									   .copyStmt = copyStmt,
-									   .deps = &(tbinfo->dobj.dumpId),
-									   .nDeps = 1,
-									   .dumpFn = dumpFn,
-									   .dumpArg = tdinfo));
-
-		/*
-		 * Set the TocEntry's dataLength in case we are doing a parallel dump
-		 * and want to order dump jobs by table size.  We choose to measure
-		 * dataLength in table pages (including TOAST pages) during dump, so
-		 * no scaling is needed.
-		 *
-		 * However, relpages is declared as "integer" in pg_class, and hence
-		 * also in TableInfo, but it's really BlockNumber a/k/a unsigned int.
-		 * Cast so that we get the right interpretation of table sizes
-		 * exceeding INT_MAX pages.
+		/* data chunking works off relpages, which are computed exactly using
+		 * pg_relation_size() when --max-table-segment-pages was set
+		 * 
+		 * We also don't chunk if table access method is not "heap"
+		 * TODO: we may add chunking for other access methods later, maybe 
+		 * based on primary key tranges
 		 */
-		te->dataLength = (BlockNumber) tbinfo->relpages;
-		te->dataLength += (BlockNumber) tbinfo->toastpages;
+		if (tbinfo->relpages <= dopt->max_table_segment_pages || 
+			strcmp(tbinfo->amname, "heap") != 0)
+		{
+			te = ArchiveEntry(fout, tdinfo->dobj.catId, tdinfo->dobj.dumpId,
+							ARCHIVE_OPTS(.tag = tbinfo->dobj.name,
+										.namespace = tbinfo->dobj.namespace->dobj.name,
+										.owner = tbinfo->rolname,
+										.description = "TABLE DATA",
+										.section = SECTION_DATA,
+										.createStmt = tdDefn,
+										.copyStmt = copyStmt,
+										.deps = &(tbinfo->dobj.dumpId),
+										.nDeps = 1,
+										.dumpFn = dumpFn,
+										.dumpArg = tdinfo));
 
-		/*
-		 * If pgoff_t is only 32 bits wide, the above refinement is useless,
-		 * and instead we'd better worry about integer overflow.  Clamp to
-		 * INT_MAX if the correct result exceeds that.
-		 */
-		if (sizeof(te->dataLength) == 4 &&
-			(tbinfo->relpages < 0 || tbinfo->toastpages < 0 ||
-			 te->dataLength < 0))
-			te->dataLength = INT_MAX;
+			/*
+			 * Set the TocEntry's dataLength in case we are doing a parallel dump
+			 * and want to order dump jobs by table size.  We choose to measure
+			 * dataLength in table pages (including TOAST pages) during dump, so
+			 * no scaling is needed.
+			 *
+			 * While pg_class.relpages which stores BlockNumber, a/k/a unsigned int,
+			 * is declared as "integer" we convert it back and store it as 
+			 * BlockNumber in TableInfo.
+			 * And dataLenght is pgoff_t (long int) so does now overflow for
+			 * 2 x UINT32_MAX 
+			 */
+			te->dataLength = tbinfo->relpages;
+			te->dataLength += tbinfo->toastpages;
+		}
+		else
+		{
+			uint64 current_chunk_start = 0;
+			PQExpBuffer chunk_desc = createPQExpBuffer();
+
+			while (current_chunk_start < tbinfo->relpages)
+			{
+				TableDataInfo *chunk_tdinfo = (TableDataInfo *) pg_malloc(sizeof(TableDataInfo));
+
+				memcpy(chunk_tdinfo, tdinfo, sizeof(TableDataInfo));
+				AssignDumpId(&chunk_tdinfo->dobj);
+				addObjectDependency(&chunk_tdinfo->dobj, tbinfo->dobj.dumpId);
+				chunk_tdinfo->startPage = (BlockNumber) current_chunk_start;
+				chunk_tdinfo->endPage = chunk_tdinfo->startPage + dopt->max_table_segment_pages - 1;
+				
+				current_chunk_start += dopt->max_table_segment_pages;
+				if (current_chunk_start >= tbinfo->relpages)
+					chunk_tdinfo->endPage = InvalidBlockNumber; /* last chunk is for "all the rest" */
+
+				printfPQExpBuffer(chunk_desc, "TABLE DATA (pages %u:%u)", chunk_tdinfo->startPage, chunk_tdinfo->endPage);
+
+				te = ArchiveEntry(fout, chunk_tdinfo->dobj.catId, chunk_tdinfo->dobj.dumpId,
+							ARCHIVE_OPTS(.tag = tbinfo->dobj.name,
+										.namespace = tbinfo->dobj.namespace->dobj.name,
+										.owner = tbinfo->rolname,
+										.description = chunk_desc->data,
+										.section = SECTION_DATA,
+										.createStmt = tdDefn,
+										.copyStmt = copyStmt,
+										.deps = &(tbinfo->dobj.dumpId),
+										.nDeps = 1,
+										.dumpFn = dumpFn,
+										.dumpArg = chunk_tdinfo));
+
+				if(chunk_tdinfo->endPage == InvalidBlockNumber)
+					te->dataLength = tbinfo->relpages - chunk_tdinfo->startPage;
+				else
+					te->dataLength = dopt->max_table_segment_pages;
+				/* let's assume toast pages distribute evenly among chunks */
+				if(tbinfo->relpages)
+					te->dataLength += te->dataLength * tbinfo->toastpages / tbinfo->relpages;
+			}
+
+			destroyPQExpBuffer(chunk_desc);
+		}
 	}
 
 	destroyPQExpBuffer(copyBuf);
@@ -3081,6 +3170,8 @@ makeTableDataInfo(DumpOptions *dopt, TableInfo *tbinfo)
 	tdinfo->dobj.namespace = tbinfo->dobj.namespace;
 	tdinfo->tdtable = tbinfo;
 	tdinfo->filtercond = NULL;	/* might get set later */
+	tdinfo->startPage = InvalidBlockNumber; /* we use this as indication that no chunking is needed */
+	tdinfo->endPage = InvalidBlockNumber;
 	addObjectDependency(&tdinfo->dobj, tbinfo->dobj.dumpId);
 
 	/* A TableDataInfo contains data, of course */
@@ -7347,8 +7438,16 @@ getTables(Archive *fout, int *numTables)
 						 "c.relnamespace, c.relkind, c.reltype, "
 						 "c.relowner, "
 						 "c.relchecks, "
-						 "c.relhasindex, c.relhasrules, c.relpages, "
-						 "c.reltuples, c.relallvisible, ");
+						 "c.relhasindex, c.relhasrules, ");
+
+	/* fetch current relation size if chunking is requested */
+	if(dopt->max_table_segment_pages != InvalidBlockNumber)
+		appendPQExpBufferStr(query, "pg_relation_size(c.oid)/current_setting('block_size')::int AS relpages, ");
+	else
+		/* pg_class.relpages stores BlockNumber (uint32) in an int field, convert to oid to get unsigned int out */
+		appendPQExpBufferStr(query, "c.relpages::oid, ");
+
+	appendPQExpBufferStr(query, "c.reltuples, c.relallvisible, ");
 
 	if (fout->remoteVersion >= 180000)
 		appendPQExpBufferStr(query, "c.relallfrozen, ");
@@ -7589,7 +7688,7 @@ getTables(Archive *fout, int *numTables)
 		tblinfo[i].ncheck = atoi(PQgetvalue(res, i, i_relchecks));
 		tblinfo[i].hasindex = (strcmp(PQgetvalue(res, i, i_relhasindex), "t") == 0);
 		tblinfo[i].hasrules = (strcmp(PQgetvalue(res, i, i_relhasrules), "t") == 0);
-		tblinfo[i].relpages = atoi(PQgetvalue(res, i, i_relpages));
+		tblinfo[i].relpages = strtoul(PQgetvalue(res, i, i_relpages), NULL, 10);
 		if (PQgetisnull(res, i, i_toastpages))
 			tblinfo[i].toastpages = 0;
 		else
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index 5a6726d8b12..84e682d585f 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -16,6 +16,7 @@
 
 #include "pg_backup.h"
 #include "catalog/pg_publication_d.h"
+#include "storage/block.h"
 
 
 #define oidcmp(x,y) ( ((x) < (y) ? -1 : ((x) > (y)) ?  1 : 0) )
@@ -335,7 +336,11 @@ typedef struct _tableInfo
 	Oid			owning_tab;		/* OID of table owning sequence */
 	int			owning_col;		/* attr # of column owning sequence */
 	bool		is_identity_sequence;
-	int32		relpages;		/* table's size in pages (from pg_class) */
+	BlockNumber	relpages;		/* table's size in pages (from pg_class) 
+	                             * converted to unsigned integer
+								 * when --max-table-segment-pages is set
+								 * the computed from pg_relation_size()
+	                             */
 	int			toastpages;		/* toast table's size in pages, if any */
 
 	bool		interesting;	/* true if need to collect more data */
@@ -413,8 +418,21 @@ typedef struct _tableDataInfo
 	DumpableObject dobj;
 	TableInfo  *tdtable;		/* link to table to dump */
 	char	   *filtercond;		/* WHERE condition to limit rows dumped */
+	/* startPage and endPage to support segmented dump */
+	BlockNumber	startPage;		/* As we always know the lowest segment page
+								 * number we can use InvalidBlockNumber here
+								 * to recognize no segmenting case.
+								 * When 0 for the first page of first
+								 * segment we can omit in range query */
+	BlockNumber	endPage;		/* last page in segment for page-range dump,
+	                    		 * startPage+max_table_segment_pages-1 for 
+								 * most segments, but InvalidBlockNumber for
+								 * the last one to indicate open range
+								 */
 } TableDataInfo;
 
+#define is_segment(tdiptr) ((tdiptr)->startPage != InvalidBlockNumber)
+
 typedef struct _indxInfo
 {
 	DumpableObject dobj;
@@ -449,7 +467,7 @@ typedef struct _relStatsInfo
 {
 	DumpableObject dobj;
 	Oid			relid;
-	int32		relpages;
+	BlockNumber	relpages;
 	char	   *reltuples;
 	int32		relallvisible;
 	int32		relallfrozen;
diff --git a/src/bin/pg_dump/t/004_pg_dump_parallel.pl b/src/bin/pg_dump/t/004_pg_dump_parallel.pl
index 738f34b1c1b..4f35aeed9b9 100644
--- a/src/bin/pg_dump/t/004_pg_dump_parallel.pl
+++ b/src/bin/pg_dump/t/004_pg_dump_parallel.pl
@@ -11,6 +11,7 @@ use Test::More;
 my $dbname1 = 'regression_src';
 my $dbname2 = 'regression_dest1';
 my $dbname3 = 'regression_dest2';
+my $dbname4 = 'regression_dest3';
 
 my $node = PostgreSQL::Test::Cluster->new('main');
 $node->init;
@@ -21,6 +22,7 @@ my $backupdir = $node->backup_dir;
 $node->run_log([ 'createdb', $dbname1 ]);
 $node->run_log([ 'createdb', $dbname2 ]);
 $node->run_log([ 'createdb', $dbname3 ]);
+$node->run_log([ 'createdb', $dbname4 ]);
 
 $node->safe_psql(
 	$dbname1,
@@ -87,4 +89,33 @@ $node->command_ok(
 	],
 	'parallel restore as inserts');
 
+$node->command_ok(
+	[
+		'pg_dump',
+		'--format' => 'directory',
+		'--max-table-segment-pages' => 2,
+		'--no-sync',
+		'--jobs' => 2,
+		'--file' => "$backupdir/dump3",
+		$node->connstr($dbname1),
+	],
+	'parallel dump with chunks of two heap pages');
+
+$node->command_ok(
+	[
+		'pg_restore', '--verbose',
+		'--dbname' => $node->connstr($dbname4),
+		'--jobs' => 3,
+		"$backupdir/dump3",
+	],
+	'parallel restore with chunks of two heap pages');
+
+my $table = 'tplain';
+my $tablehash_query = "SELECT '$table', sum(hashtext(t::text)), count(*) FROM $table AS t";
+
+my $result_1 = $node->safe_psql($dbname1, $tablehash_query);
+my $result_4 = $node->safe_psql($dbname4, $tablehash_query);
+
+is($result_4, $result_1, "Hash check for $table: restored db ($result_4) vs original db ($result_1)");
+
 done_testing();
diff --git a/src/fe_utils/option_utils.c b/src/fe_utils/option_utils.c
index 8d0659c1164..a516d8c86a9 100644
--- a/src/fe_utils/option_utils.c
+++ b/src/fe_utils/option_utils.c
@@ -83,6 +83,61 @@ option_parse_int(const char *optarg, const char *optname,
 	return true;
 }
 
+/*
+ * option_parse_uint32
+ *
+ * Parse unsigned integer value for an option.  If the parsing is successful,
+ * returns true and stores the result in *result if that's given;
+ * if parsing fails, returns false.
+ */
+bool
+option_parse_uint32(const char *optarg, const char *optname,
+				 uint32 min_range, uint32 max_range,
+				 uint32 *result)
+{
+	char	   		*endptr;
+	unsigned long	val;
+
+	/* Fail if there is a minus sign at the start of value */
+	while(isspace((unsigned char) *optarg))
+		optarg++;
+	if(*optarg == '-')
+	{
+		pg_log_error("value \"%s\" for option %s can not be negative",
+					optarg, optname);
+		return false;
+	}
+
+	errno = 0;
+	val = strtoul(optarg, &endptr, 10);
+
+	/*
+	 * Skip any trailing whitespace; if anything but whitespace remains before
+	 * the terminating character, fail.
+	 */
+	while (*endptr != '\0' && isspace((unsigned char) *endptr))
+		endptr++;
+
+	if (*endptr != '\0')
+	{
+		pg_log_error("invalid value \"%s\" for option %s",
+					 optarg, optname);
+		return false;
+	}
+
+	/* as min_range and max_range are uint32 then the range check will
+	 * catch the case where unsigned long val is outside 32 bit range */
+	if (errno == ERANGE || val < min_range || val > max_range)
+	{
+		pg_log_error("%s not in range %u..%u", optname, min_range, max_range);
+		return false;
+	}
+
+	if (result)
+		*result = (uint32) val;
+	return true;
+}
+
 /*
  * Provide strictly harmonized handling of the --sync-method option.
  */
diff --git a/src/include/fe_utils/option_utils.h b/src/include/fe_utils/option_utils.h
index d975db77af2..67fd3650d7a 100644
--- a/src/include/fe_utils/option_utils.h
+++ b/src/include/fe_utils/option_utils.h
@@ -22,6 +22,9 @@ extern void handle_help_version_opts(int argc, char *argv[],
 extern bool option_parse_int(const char *optarg, const char *optname,
 							 int min_range, int max_range,
 							 int *result);
+extern bool option_parse_uint32(const char *optarg, const char *optname,
+							 uint32 min_range, uint32 max_range,
+							 uint32 *result);
 extern bool parse_sync_method(const char *optarg,
 							  DataDirSyncMethod *sync_method);
 extern void check_mut_excl_opts_internal(int n,...);
-- 
2.43.0

Re: Patch: dumping tables data in multiple chunks in pg_dump

Reply via email to