Hi,
On Tue, Jan 7, 2025 at 5:09 AM Jelte Fennema-Nio <postg...@jeltef.nl>
wrote:
The core functionality works well in my opinion. I think it makes sense
to spend the effort to move this from PoC quality to something
committable. Below some of the things that are necessary to do that
after an initial pass over the code (and trying it out):
Attached a Patch.
On Tue, Jan 7, 2025 at 10:42 PM Atsushi Torikoshi
<torikoshia.t...@gmail.com> wrote:
9. I think this division by 2 could use some explanation in a comment.
I
understand that you're doing this because linux divides its
original
bytes using 512 bytes[2] and your additional factor of 2 gets that
to
1024 bytes. But that's not clear immediately from the code.
I'm also not convinced that 512 is the blocksize if this logic is
even correct on every platform. I'm wondering if maybe we should
simply show the blocks after all.
Maybe so. I'll look into this and then decide the unit.
I looked up the manuals for the following operating systems, as
documented in [1], and it seems that all of them—except Windows—support
getrusage(2) and return ru_inblock/ru_oublock:
Linux, Windows, FreeBSD, OpenBSD, NetBSD, DragonFlyBSD, macOS, AIX,
Solaris, and illumos.
However, I’m unsure if the unit of these values is consistently 512KB
across all operating systems.
Additionally, I’m concerned that the timing of when these metrics are
incremented might vary between OSs.
For example, on Linux, it seems that ru_oublock is incremented when a
page is dirtied, as it’s calculated by dividing write_bytes [2] by 9.
I’m not sure if other operating systems behave the same way.
That said, they all represent the number of storage I/O operations
performed or to be performed.
Therefore, I believe it would be reasonable to report the raw values
as-is, as they should still be useful for understanding storage I/O
activity.
Example output:
=# explain analyze select max(a), max(b) from t_big_ul;
(..snip..)
Planning:
Buffers: shared hit=31 read=54
Storage I/O: read=2744 times write=0 times
Planning Time: 30.685 ms
Execution:
Storage I/O: read=2563600 times write=0 times
Execution Time: 1685.272 ms
[1] https://www.postgresql.org/docs/devel/supported-platforms.html
[2] https://www.kernel.org/doc/Documentation/filesystems/proc.txt
--
Regards,
Atsushi Torikoshi
Seconded from NTT DATA GROUP CORPORATION to SRA OSS K.K.
From d3c9efcf1d44de4aafb5d09051db7a6f7fc4c5dc Mon Sep 17 00:00:00 2001
From: Atsushi Torikoshi <torikos...@oss.nttdata.com>
Date: Mon, 27 Jan 2025 16:37:06 +0900
Subject: [PATCH v1] Add storage I/O tracking to 'BUFFERS' option
The 'BUFFERS' option currently indicates whether a block hit the shared
buffer, but does not distinguish between a cache hit in the OS cache or
a storage I/O operation.
While shared buffers and OS cache offer similar performance, storage
I/O is significantly slower in comparison. By measuring the numbers of
storage I/O read and write, we can better identify if storage I/O is a
bottleneck in performance.
Added tracking of storage I/O usage by calling getrusage(2) at both the
planning and execution phase start and end points.
A more granular approach as well as current BUFFERS option(tracking at
each plan node) was considered but found to be impractical due to the
high performance cost of frequent getrusage() calls.
TODO:
I believe this information is mainly useful when used in auto_explain.
I'll implement it later.
---
doc/src/sgml/ref/explain.sgml | 25 ++++--
src/backend/access/brin/brin.c | 8 +-
src/backend/access/nbtree/nbtsort.c | 8 +-
src/backend/commands/explain.c | 117 +++++++++++++++++++++++++-
src/backend/commands/prepare.c | 8 ++
src/backend/commands/vacuumparallel.c | 8 +-
src/backend/executor/execParallel.c | 35 ++++++--
src/backend/executor/instrument.c | 61 +++++++++++++-
src/include/commands/explain.h | 1 +
src/include/executor/execParallel.h | 2 +
src/include/executor/instrument.h | 19 ++++-
src/include/port/win32/sys/resource.h | 2 +
src/port/win32getrusage.c | 4 +
src/test/regress/expected/explain.out | 37 +++++++-
src/tools/pgindent/typedefs.list | 1 +
15 files changed, 298 insertions(+), 38 deletions(-)
diff --git a/doc/src/sgml/ref/explain.sgml b/doc/src/sgml/ref/explain.sgml
index 6361a14e6..7b45a34fc 100644
--- a/doc/src/sgml/ref/explain.sgml
+++ b/doc/src/sgml/ref/explain.sgml
@@ -198,13 +198,24 @@ ROLLBACK;
previously unmodified blocks that were changed by this query; while the
number of blocks <emphasis>written</emphasis> indicates the number of
previously-dirtied blocks evicted from cache by this backend during
- query processing.
- The number of blocks shown for an
- upper-level node includes those used by all its child nodes. In text
- format, only non-zero values are printed. Buffers information is
- included by default when <literal>ANALYZE</literal> is used but
- otherwise is not included by default, but can be enabled using this
- option.
+ query processing. In text format, only non-zero values are printed.
+ If possible, this option also displays the number of read and write
+ operations performed on storage during the planning and execution phases,
+ shown at the end of the plan. These values are obtained from the
+ <function>getrusage()</function> system call. Note that on platforms that
+ do not support <function>getrusage()</function>, such as Windows, no output
+ will be shown, even if reads or writes actually occur. Additionally, even
+ on platforms where <function>getrusage()</function> is supported, if the
+ kernel is built without the necessary options to track storage read and
+ write operations, no output will be shown.
+ The timing and unit of measurement for read and write operations may vary
+ depending on the platform. For example, on Linux, a read is counted only
+ if this process caused data to be fetched from the storage layer, and a
+ write is counted at the page-dirtying time. On Linux, the unit of
+ measurement for read and write operations is 512 KB.
+ Buffers information is included by default when <literal>ANALYZE</literal>
+ is used but otherwise is not included by default, but can be enabled using
+ this option.
</para>
</listitem>
</varlistentry>
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 4289142e2..22a7ce559 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -2549,7 +2549,7 @@ _brin_end_parallel(BrinLeader *brinleader, BrinBuildState *state)
* or we might get incomplete data.)
*/
for (i = 0; i < brinleader->pcxt->nworkers_launched; i++)
- InstrAccumParallelQuery(&brinleader->bufferusage[i], &brinleader->walusage[i]);
+ InstrAccumParallelQuery(&brinleader->bufferusage[i], NULL, &brinleader->walusage[i]);
/* Free last reference to MVCC snapshot, if one was used */
if (IsMVCCSnapshot(brinleader->snapshot))
@@ -2911,7 +2911,7 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
tuplesort_attach_shared(sharedsort, seg);
/* Prepare to track buffer usage during parallel execution */
- InstrStartParallelQuery();
+ InstrStartParallelQuery(NULL);
/*
* Might as well use reliable figure when doling out maintenance_work_mem
@@ -2926,8 +2926,8 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
/* Report WAL/buffer usage during parallel execution */
bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber],
- &walusage[ParallelWorkerNumber]);
+ InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber], NULL,
+ &walusage[ParallelWorkerNumber], NULL);
index_close(indexRel, indexLockmode);
table_close(heapRel, heapLockmode);
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 7aba852db..98cfde887 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -1619,7 +1619,7 @@ _bt_end_parallel(BTLeader *btleader)
* or we might get incomplete data.)
*/
for (i = 0; i < btleader->pcxt->nworkers_launched; i++)
- InstrAccumParallelQuery(&btleader->bufferusage[i], &btleader->walusage[i]);
+ InstrAccumParallelQuery(&btleader->bufferusage[i], NULL, &btleader->walusage[i]);
/* Free last reference to MVCC snapshot, if one was used */
if (IsMVCCSnapshot(btleader->snapshot))
@@ -1827,7 +1827,7 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
}
/* Prepare to track buffer usage during parallel execution */
- InstrStartParallelQuery();
+ InstrStartParallelQuery(NULL);
/* Perform sorting of spool, and possibly a spool2 */
sortmem = maintenance_work_mem / btshared->scantuplesortstates;
@@ -1837,8 +1837,8 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
/* Report WAL/buffer usage during parallel execution */
bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber],
- &walusage[ParallelWorkerNumber]);
+ InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber], NULL,
+ &walusage[ParallelWorkerNumber], NULL);
#ifdef BTREE_BUILD_STATS
if (log_btree_build_stats)
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index c24e66f82..de77c592b 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -145,6 +145,8 @@ static void show_foreignscan_info(ForeignScanState *fsstate, ExplainState *es);
static const char *explain_get_index_name(Oid indexId);
static bool peek_buffer_usage(ExplainState *es, const BufferUsage *usage);
static void show_buffer_usage(ExplainState *es, const BufferUsage *usage);
+static bool peek_storageio_usage(ExplainState *es, const StorageIOUsage *usage);
+static void show_storageio_usage(ExplainState *es, const StorageIOUsage *usage);
static void show_wal_usage(ExplainState *es, const WalUsage *usage);
static void show_memory_counters(ExplainState *es,
const MemoryContextCounters *mem_counters);
@@ -475,6 +477,8 @@ standard_ExplainOneQuery(Query *query, int cursorOptions,
planduration;
BufferUsage bufusage_start,
bufusage;
+ StorageIOUsage storageio,
+ storageio_start;
MemoryContextCounters mem_counters;
MemoryContext planner_ctx = NULL;
MemoryContext saved_ctx = NULL;
@@ -496,7 +500,10 @@ standard_ExplainOneQuery(Query *query, int cursorOptions,
}
if (es->buffers)
+ {
bufusage_start = pgBufferUsage;
+ GetStorageIOUsage(&storageio_start);
+ }
INSTR_TIME_SET_CURRENT(planstart);
/* plan the query */
@@ -518,9 +525,17 @@ standard_ExplainOneQuery(Query *query, int cursorOptions,
BufferUsageAccumDiff(&bufusage, &pgBufferUsage, &bufusage_start);
}
+ if (es->buffers)
+ {
+ GetStorageIOUsage(&storageio);
+ storageio.inblock -= storageio_start.inblock;
+ storageio.outblock -= storageio_start.outblock;
+ }
+
/* run it (if needed) and produce output */
ExplainOnePlan(plan, into, es, queryString, params, queryEnv,
&planduration, (es->buffers ? &bufusage : NULL),
+ (es->buffers ? &storageio : NULL),
es->memory ? &mem_counters : NULL);
}
@@ -644,7 +659,7 @@ void
ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
const char *queryString, ParamListInfo params,
QueryEnvironment *queryEnv, const instr_time *planduration,
- const BufferUsage *bufusage,
+ const BufferUsage *bufusage, const StorageIOUsage *planstorageio,
const MemoryContextCounters *mem_counters)
{
DestReceiver *dest;
@@ -654,6 +669,7 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
int eflags;
int instrument_option = 0;
SerializeMetrics serializeMetrics = {0};
+ StorageIOUsage storageio_start;
Assert(plannedstmt->commandType != CMD_UTILITY);
@@ -663,7 +679,19 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
instrument_option |= INSTRUMENT_ROWS;
if (es->buffers)
+ {
+ GetStorageIOUsage(&storageio_start);
+
+ /*
+ * Initialize global variable counters for parallel query workers.
+ * Even if the query is cancelled on the way, the EXPLAIN execution
+ * always passes here, so it can be initialized here.
+ */
+ pgStorageIOUsageParallel.inblock = 0;
+ pgStorageIOUsageParallel.outblock = 0;
+
instrument_option |= INSTRUMENT_BUFFERS;
+ }
if (es->wal)
instrument_option |= INSTRUMENT_WAL;
@@ -747,8 +775,9 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
/* Create textual dump of plan tree */
ExplainPrintPlan(es, queryDesc);
- /* Show buffer and/or memory usage in planning */
- if (peek_buffer_usage(es, bufusage) || mem_counters)
+ /* Show buffer, storage I/O, and/or memory usage in planning */
+ if (peek_buffer_usage(es, bufusage) || peek_storageio_usage(es, planstorageio) ||
+ mem_counters)
{
ExplainOpenGroup("Planning", "Planning", true, es);
@@ -760,8 +789,10 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
}
if (bufusage)
+ {
show_buffer_usage(es, bufusage);
-
+ show_storageio_usage(es, planstorageio);
+ }
if (mem_counters)
show_memory_counters(es, mem_counters);
@@ -813,6 +844,35 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
totaltime += elapsed_time(&starttime);
+ /* Show storage I/O usage in execution */
+ if (es->buffers)
+ {
+ StorageIOUsage storageio = {0};
+ StorageIOUsage storageio_end;
+
+ GetStorageIOUsage(&storageio_end);
+ StorageIOUsageAccumDiff(&storageio, &storageio_end, &storageio_start);
+ StorageIOUsageAdd(&storageio, &pgStorageIOUsageParallel);
+
+ if (peek_storageio_usage(es, &storageio))
+ {
+ ExplainOpenGroup("Execution", "Execution", true, es);
+
+ if (es->format == EXPLAIN_FORMAT_TEXT)
+ {
+ ExplainIndentText(es);
+ appendStringInfoString(es->str, "Execution:\n");
+ es->indent++;
+ }
+ show_storageio_usage(es, &storageio);
+
+ if (es->format == EXPLAIN_FORMAT_TEXT)
+ es->indent--;
+
+ ExplainCloseGroup("Execution", "Execution", true, es);
+ }
+ }
+
/*
* We only report execution time if we actually ran the query (that is,
* the user specified ANALYZE), and if summary reporting is enabled (the
@@ -4232,6 +4292,55 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage)
}
}
+/*
+ * Return whether show_storageio_usage would have anything to print, if given
+ * the same 'usage' data. Note that when the format is anything other than
+ * text, we print even if the counters are all zeroes.
+ */
+static bool
+peek_storageio_usage(ExplainState *es, const StorageIOUsage *usage)
+{
+ if (usage == NULL)
+ return false;
+
+ if (es->format != EXPLAIN_FORMAT_TEXT)
+ return true;
+
+ if (usage->inblock <= 0 && usage->outblock <= 0)
+ return false;
+
+ else
+ return true;
+}
+
+/*
+ * Show storage I/O usage.
+ */
+static void
+show_storageio_usage(ExplainState *es, const StorageIOUsage *usage)
+{
+ if (es->format == EXPLAIN_FORMAT_TEXT)
+ {
+ /* Show only positive counter values. */
+ if (usage->inblock <= 0 && usage->outblock <= 0)
+ return;
+
+ ExplainIndentText(es);
+ appendStringInfoString(es->str, "Storage I/O:");
+ appendStringInfo(es->str, " read=%ld times", (long) usage->inblock);
+ appendStringInfo(es->str, " write=%ld times", (long) usage->outblock);
+
+ appendStringInfoChar(es->str, '\n');
+ }
+ else
+ {
+ ExplainPropertyInteger("Storage I/O Read", NULL,
+ usage->inblock, es);
+ ExplainPropertyInteger("Storage I/O Read", NULL,
+ usage->outblock, es);
+ }
+}
+
/*
* Show WAL usage details.
*/
diff --git a/src/backend/commands/prepare.c b/src/backend/commands/prepare.c
index 8989c0c88..e11601b21 100644
--- a/src/backend/commands/prepare.c
+++ b/src/backend/commands/prepare.c
@@ -579,6 +579,8 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
instr_time planduration;
BufferUsage bufusage_start,
bufusage;
+ StorageIOUsage storageio,
+ storageio_start;
MemoryContextCounters mem_counters;
MemoryContext planner_ctx = NULL;
MemoryContext saved_ctx = NULL;
@@ -594,7 +596,11 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
}
if (es->buffers)
+ {
bufusage_start = pgBufferUsage;
+ GetStorageIOUsage(&storageio_start);
+ }
+
INSTR_TIME_SET_CURRENT(planstart);
/* Look it up in the hash table */
@@ -644,6 +650,7 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
{
memset(&bufusage, 0, sizeof(BufferUsage));
BufferUsageAccumDiff(&bufusage, &pgBufferUsage, &bufusage_start);
+ GetStorageIOUsage(&storageio);
}
plan_list = cplan->stmt_list;
@@ -656,6 +663,7 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
if (pstmt->commandType != CMD_UTILITY)
ExplainOnePlan(pstmt, into, es, query_string, paramLI, pstate->p_queryEnv,
&planduration, (es->buffers ? &bufusage : NULL),
+ (es->buffers ? &storageio : NULL),
es->memory ? &mem_counters : NULL);
else
ExplainOneUtility(pstmt->utilityStmt, into, es, pstate, paramLI);
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 0d92e694d..63d4b1398 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -737,7 +737,7 @@ parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, int num_index_scan
WaitForParallelWorkersToFinish(pvs->pcxt);
for (int i = 0; i < pvs->pcxt->nworkers_launched; i++)
- InstrAccumParallelQuery(&pvs->buffer_usage[i], &pvs->wal_usage[i]);
+ InstrAccumParallelQuery(&pvs->buffer_usage[i], NULL, &pvs->wal_usage[i]);
}
/*
@@ -1076,7 +1076,7 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
error_context_stack = &errcallback;
/* Prepare to track buffer usage during parallel execution */
- InstrStartParallelQuery();
+ InstrStartParallelQuery(NULL);
/* Process indexes to perform vacuum/cleanup */
parallel_vacuum_process_safe_indexes(&pvs);
@@ -1084,8 +1084,8 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
/* Report buffer/WAL usage during parallel execution */
buffer_usage = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_BUFFER_USAGE, false);
wal_usage = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
- &wal_usage[ParallelWorkerNumber]);
+ InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber], NULL,
+ &wal_usage[ParallelWorkerNumber], NULL);
TidStoreDetach(dead_items);
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index ff4d9dd1b..a7d5f0def 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -64,6 +64,7 @@
#define PARALLEL_KEY_QUERY_TEXT UINT64CONST(0xE000000000000008)
#define PARALLEL_KEY_JIT_INSTRUMENTATION UINT64CONST(0xE000000000000009)
#define PARALLEL_KEY_WAL_USAGE UINT64CONST(0xE00000000000000A)
+#define PARALLEL_KEY_STORAGEIO_USAGE UINT64CONST(0xE00000000000000B)
#define PARALLEL_TUPLE_QUEUE_SIZE 65536
@@ -597,6 +598,7 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
char *pstmt_space;
char *paramlistinfo_space;
BufferUsage *bufusage_space;
+ StorageIOUsage *storageiousage_space;
WalUsage *walusage_space;
SharedExecutorInstrumentation *instrumentation = NULL;
SharedJitInstrumentation *jit_instrumentation = NULL;
@@ -678,6 +680,13 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
mul_size(sizeof(WalUsage), pcxt->nworkers));
shm_toc_estimate_keys(&pcxt->estimator, 1);
+ /*
+ * Same thing for StorageIOUsage.
+ */
+ shm_toc_estimate_chunk(&pcxt->estimator,
+ mul_size(sizeof(StorageIOUsage), pcxt->nworkers));
+ shm_toc_estimate_keys(&pcxt->estimator, 1);
+
/* Estimate space for tuple queues. */
shm_toc_estimate_chunk(&pcxt->estimator,
mul_size(PARALLEL_TUPLE_QUEUE_SIZE, pcxt->nworkers));
@@ -773,6 +782,12 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
shm_toc_insert(pcxt->toc, PARALLEL_KEY_WAL_USAGE, walusage_space);
pei->wal_usage = walusage_space;
+ /* Same for StorageIOUsage. */
+ storageiousage_space = shm_toc_allocate(pcxt->toc,
+ mul_size(sizeof(StorageIOUsage), pcxt->nworkers));
+ shm_toc_insert(pcxt->toc, PARALLEL_KEY_STORAGEIO_USAGE, storageiousage_space);
+ pei->storageio_usage = storageiousage_space;
+
/* Set up the tuple queues that the workers will write into. */
pei->tqueue = ExecParallelSetupTupleQueues(pcxt, false);
@@ -1168,11 +1183,11 @@ ExecParallelFinish(ParallelExecutorInfo *pei)
WaitForParallelWorkersToFinish(pei->pcxt);
/*
- * Next, accumulate buffer/WAL usage. (This must wait for the workers to
- * finish, or we might get incomplete data.)
+ * Next, accumulate buffer, WAL, and Storage I/O usage. (This must wait
+ * for the workers to finish, or we might get incomplete data.)
*/
for (i = 0; i < nworkers; i++)
- InstrAccumParallelQuery(&pei->buffer_usage[i], &pei->wal_usage[i]);
+ InstrAccumParallelQuery(&pei->buffer_usage[i], &pei->storageio_usage[i], &pei->wal_usage[i]);
pei->finished = true;
}
@@ -1404,6 +1419,8 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
{
FixedParallelExecutorState *fpes;
BufferUsage *buffer_usage;
+ StorageIOUsage *storageio_usage;
+ StorageIOUsage storageio_usage_start = {0};
WalUsage *wal_usage;
DestReceiver *receiver;
QueryDesc *queryDesc;
@@ -1457,13 +1474,14 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
ExecSetTupleBound(fpes->tuples_needed, queryDesc->planstate);
/*
- * Prepare to track buffer/WAL usage during query execution.
+ * Prepare to track buffer, WAL, and StorageI/O usage during query
+ * execution.
*
* We do this after starting up the executor to match what happens in the
* leader, which also doesn't count buffer accesses and WAL activity that
* occur during executor startup.
*/
- InstrStartParallelQuery();
+ InstrStartParallelQuery(&storageio_usage_start);
/*
* Run the plan. If we specified a tuple bound, be careful not to demand
@@ -1476,11 +1494,14 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
/* Shut down the executor */
ExecutorFinish(queryDesc);
- /* Report buffer/WAL usage during parallel execution. */
+ /* Report buffer, WAL, and storageIO usage during parallel execution. */
buffer_usage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
+ storageio_usage = shm_toc_lookup(toc, PARALLEL_KEY_STORAGEIO_USAGE, false);
wal_usage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
- &wal_usage[ParallelWorkerNumber]);
+ &storageio_usage[ParallelWorkerNumber],
+ &wal_usage[ParallelWorkerNumber],
+ &storageio_usage_start);
/* Report instrumentation data if any instrumentation options are set. */
if (instrumentation != NULL)
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index 2d3569b37..7664e01ce 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -13,16 +13,21 @@
*/
#include "postgres.h"
+#include <sys/resource.h>
#include <unistd.h>
#include "executor/instrument.h"
BufferUsage pgBufferUsage;
static BufferUsage save_pgBufferUsage;
+
+/* Only count parallel workers' usage */
+StorageIOUsage pgStorageIOUsageParallel;
WalUsage pgWalUsage;
static WalUsage save_pgWalUsage;
static void BufferUsageAdd(BufferUsage *dst, const BufferUsage *add);
+void StorageIOUsageAdd(StorageIOUsage *dst, const StorageIOUsage *add);
static void WalUsageAdd(WalUsage *dst, WalUsage *add);
@@ -197,27 +202,46 @@ InstrAggNode(Instrumentation *dst, Instrumentation *add)
/* note current values during parallel executor startup */
void
-InstrStartParallelQuery(void)
+InstrStartParallelQuery(StorageIOUsage *storageiousage)
{
save_pgBufferUsage = pgBufferUsage;
save_pgWalUsage = pgWalUsage;
+
+ if (storageiousage != NULL)
+ GetStorageIOUsage(storageiousage);
}
/* report usage after parallel executor shutdown */
void
-InstrEndParallelQuery(BufferUsage *bufusage, WalUsage *walusage)
+InstrEndParallelQuery(BufferUsage *bufusage, StorageIOUsage *storageiousage, WalUsage *walusage, StorageIOUsage *storageiousage_start)
{
memset(bufusage, 0, sizeof(BufferUsage));
BufferUsageAccumDiff(bufusage, &pgBufferUsage, &save_pgBufferUsage);
+
+ if (storageiousage != NULL && storageiousage_start != NULL)
+ {
+ struct StorageIOUsage storageiousage_end;
+
+ GetStorageIOUsage(&storageiousage_end);
+
+ memset(storageiousage, 0, sizeof(StorageIOUsage));
+ StorageIOUsageAccumDiff(storageiousage, &storageiousage_end, storageiousage_start);
+
+ ereport(DEBUG1,
+ (errmsg("Parallel worker's storage I/O times: inblock:%ld outblock:%ld",
+ storageiousage->inblock, storageiousage->outblock)));
+ }
memset(walusage, 0, sizeof(WalUsage));
WalUsageAccumDiff(walusage, &pgWalUsage, &save_pgWalUsage);
}
/* accumulate work done by workers in leader's stats */
void
-InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage)
+InstrAccumParallelQuery(BufferUsage *bufusage, StorageIOUsage *storageiousage, WalUsage *walusage)
{
BufferUsageAdd(&pgBufferUsage, bufusage);
+ if (storageiousage != NULL)
+ StorageIOUsageAdd(&pgStorageIOUsageParallel, storageiousage);
WalUsageAdd(&pgWalUsage, walusage);
}
@@ -273,6 +297,37 @@ BufferUsageAccumDiff(BufferUsage *dst,
add->temp_blk_write_time, sub->temp_blk_write_time);
}
+/* helper functions for StorageIOUsage usage accumulation */
+void
+StorageIOUsageAdd(StorageIOUsage *dst, const StorageIOUsage *add)
+{
+ dst->inblock += add->inblock;
+ dst->outblock += add->outblock;
+}
+
+void
+StorageIOUsageAccumDiff(StorageIOUsage *dst, const StorageIOUsage *add, const StorageIOUsage *sub)
+{
+ dst->inblock += add->inblock - sub->inblock;
+ dst->outblock += add->outblock - sub->outblock;
+}
+
+/* Captures the current storage I/O usage statistics */
+void
+GetStorageIOUsage(StorageIOUsage *usage)
+{
+ struct rusage rusage;
+
+ if (getrusage(RUSAGE_SELF, &rusage))
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_SYSTEM_ERROR),
+ errmsg("getrusage() failed: %m")));
+ }
+ usage->inblock = rusage.ru_inblock;
+ usage->outblock = rusage.ru_oublock;
+}
+
/* helper functions for WAL usage accumulation */
static void
WalUsageAdd(WalUsage *dst, WalUsage *add)
diff --git a/src/include/commands/explain.h b/src/include/commands/explain.h
index ea7419951..8e67dd57b 100644
--- a/src/include/commands/explain.h
+++ b/src/include/commands/explain.h
@@ -108,6 +108,7 @@ extern void ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into,
ParamListInfo params, QueryEnvironment *queryEnv,
const instr_time *planduration,
const BufferUsage *bufusage,
+ const StorageIOUsage *planstorageio,
const MemoryContextCounters *mem_counters);
extern void ExplainPrintPlan(ExplainState *es, QueryDesc *queryDesc);
diff --git a/src/include/executor/execParallel.h b/src/include/executor/execParallel.h
index 5e7106c39..5c8bc76c5 100644
--- a/src/include/executor/execParallel.h
+++ b/src/include/executor/execParallel.h
@@ -26,6 +26,8 @@ typedef struct ParallelExecutorInfo
PlanState *planstate; /* plan subtree we're running in parallel */
ParallelContext *pcxt; /* parallel context we're using */
BufferUsage *buffer_usage; /* points to bufusage area in DSM */
+ StorageIOUsage *storageio_usage; /* points to storageio usage area in
+ * DSM */
WalUsage *wal_usage; /* walusage area in DSM */
SharedExecutorInstrumentation *instrumentation; /* optional */
struct SharedJitInstrumentation *jit_instrumentation; /* optional */
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index 5a6eff75c..7fc7281b1 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -41,6 +41,14 @@ typedef struct BufferUsage
instr_time temp_blk_write_time; /* time spent writing temp blocks */
} BufferUsage;
+typedef struct StorageIOUsage
+{
+ long inblock; /* # of times the file system had to perform
+ * input */
+ long outblock; /* # of times the file system had to perform
+ * output */
+} StorageIOUsage;
+
/*
* WalUsage tracks only WAL activity like WAL records generation that
* can be measured per query and is displayed by EXPLAIN command,
@@ -99,6 +107,7 @@ typedef struct WorkerInstrumentation
} WorkerInstrumentation;
extern PGDLLIMPORT BufferUsage pgBufferUsage;
+extern PGDLLIMPORT StorageIOUsage pgStorageIOUsageParallel;
extern PGDLLIMPORT WalUsage pgWalUsage;
extern Instrumentation *InstrAlloc(int n, int instrument_options,
@@ -109,11 +118,15 @@ extern void InstrStopNode(Instrumentation *instr, double nTuples);
extern void InstrUpdateTupleCount(Instrumentation *instr, double nTuples);
extern void InstrEndLoop(Instrumentation *instr);
extern void InstrAggNode(Instrumentation *dst, Instrumentation *add);
-extern void InstrStartParallelQuery(void);
-extern void InstrEndParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
-extern void InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
+extern void InstrStartParallelQuery(StorageIOUsage *storageiousage);
+extern void InstrEndParallelQuery(BufferUsage *bufusage, StorageIOUsage *storageiousage, WalUsage *walusage, StorageIOUsage *storageiousage_start);
+extern void InstrAccumParallelQuery(BufferUsage *bufusage, StorageIOUsage *storageiousage, WalUsage *walusage);
extern void BufferUsageAccumDiff(BufferUsage *dst,
const BufferUsage *add, const BufferUsage *sub);
+extern void StorageIOUsageAccumDiff(StorageIOUsage *dst,
+ const StorageIOUsage *add, const StorageIOUsage *sub);
+extern void StorageIOUsageAdd(StorageIOUsage *dst, const StorageIOUsage *add);
+extern void GetStorageIOUsage(StorageIOUsage *usage);
extern void WalUsageAccumDiff(WalUsage *dst, const WalUsage *add,
const WalUsage *sub);
diff --git a/src/include/port/win32/sys/resource.h b/src/include/port/win32/sys/resource.h
index a14feeb58..270dc37c8 100644
--- a/src/include/port/win32/sys/resource.h
+++ b/src/include/port/win32/sys/resource.h
@@ -13,6 +13,8 @@ struct rusage
{
struct timeval ru_utime; /* user time used */
struct timeval ru_stime; /* system time used */
+ long ru_inblock; /* Currently always 0 for Windows */
+ long ru_oublock; /* Currently always 0 for Windows */
};
extern int getrusage(int who, struct rusage *rusage);
diff --git a/src/port/win32getrusage.c b/src/port/win32getrusage.c
index 6a197c943..27f0ea052 100644
--- a/src/port/win32getrusage.c
+++ b/src/port/win32getrusage.c
@@ -57,5 +57,9 @@ getrusage(int who, struct rusage *rusage)
rusage->ru_utime.tv_sec = li.QuadPart / 1000000L;
rusage->ru_utime.tv_usec = li.QuadPart % 1000000L;
+ /* Currently always 0 for Windows */
+ rusage->ru_inblock = 0;
+ rusage->ru_oublock = 0;
+
return 0;
}
diff --git a/src/test/regress/expected/explain.out b/src/test/regress/expected/explain.out
index ee31e41d5..f569d8f6f 100644
--- a/src/test/regress/expected/explain.out
+++ b/src/test/regress/expected/explain.out
@@ -127,10 +127,16 @@ select explain_filter('explain (analyze, buffers, format xml) select * from int8
<Local-Written-Blocks>N</Local-Written-Blocks> +
<Temp-Read-Blocks>N</Temp-Read-Blocks> +
<Temp-Written-Blocks>N</Temp-Written-Blocks> +
+ <Storage-I-O-Read>N</Storage-I-O-Read> +
+ <Storage-I-O-Read>N</Storage-I-O-Read> +
</Planning> +
<Planning-Time>N.N</Planning-Time> +
<Triggers> +
</Triggers> +
+ <Execution> +
+ <Storage-I-O-Read>N</Storage-I-O-Read> +
+ <Storage-I-O-Read>N</Storage-I-O-Read> +
+ </Execution> +
<Execution-Time>N.N</Execution-Time> +
</Query> +
</explain>
@@ -175,6 +181,8 @@ select explain_filter('explain (analyze, serialize, buffers, format yaml) select
Local Written Blocks: N +
Temp Read Blocks: N +
Temp Written Blocks: N +
+ Storage I/O Read: N +
+ Storage I/O Read: N +
Planning Time: N.N +
Triggers: +
Serialization: +
@@ -191,6 +199,9 @@ select explain_filter('explain (analyze, serialize, buffers, format yaml) select
Local Written Blocks: N +
Temp Read Blocks: N +
Temp Written Blocks: N +
+ Execution: +
+ Storage I/O Read: N +
+ Storage I/O Read: N +
Execution Time: N.N
(1 row)
@@ -237,7 +248,13 @@ select explain_filter('explain (buffers, format json) select * from int8_tbl i8'
"Local Dirtied Blocks": N, +
"Local Written Blocks": N, +
"Temp Read Blocks": N, +
- "Temp Written Blocks": N +
+ "Temp Written Blocks": N, +
+ "Storage I/O Read": N, +
+ "Storage I/O Read": N +
+ }, +
+ "Execution": { +
+ "Storage I/O Read": N, +
+ "Storage I/O Read": N +
} +
} +
]
@@ -299,11 +316,17 @@ select explain_filter('explain (analyze, buffers, format json) select * from int
"Local I/O Read Time": N.N, +
"Local I/O Write Time": N.N, +
"Temp I/O Read Time": N.N, +
- "Temp I/O Write Time": N.N +
+ "Temp I/O Write Time": N.N, +
+ "Storage I/O Read": N, +
+ "Storage I/O Read": N +
}, +
"Planning Time": N.N, +
"Triggers": [ +
], +
+ "Execution": { +
+ "Storage I/O Read": N, +
+ "Storage I/O Read": N +
+ }, +
"Execution Time": N.N +
} +
]
@@ -423,12 +446,18 @@ select explain_filter('explain (memory, analyze, format json) select * from int8
"Local Written Blocks": N, +
"Temp Read Blocks": N, +
"Temp Written Blocks": N, +
+ "Storage I/O Read": N, +
+ "Storage I/O Read": N, +
"Memory Used": N, +
"Memory Allocated": N +
}, +
"Planning Time": N.N, +
"Triggers": [ +
], +
+ "Execution": { +
+ "Storage I/O Read": N, +
+ "Storage I/O Read": N +
+ }, +
"Execution Time": N.N +
} +
]
@@ -641,6 +670,7 @@ select jsonb_pretty(
}, +
"Planning": { +
"Local Hit Blocks": 0, +
+ "Storage I/O Read": 0, +
"Temp Read Blocks": 0, +
"Local Read Blocks": 0, +
"Shared Hit Blocks": 0, +
@@ -653,6 +683,9 @@ select jsonb_pretty(
}, +
"Triggers": [ +
], +
+ "Execution": { +
+ "Storage I/O Read": 0 +
+ }, +
"Planning Time": 0.0, +
"Execution Time": 0.0 +
} +
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index a2644a2e6..730de3255 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2558,6 +2558,7 @@ SSL
SSLExtensionInfoContext
SSL_CTX
STARTUPINFO
+StorageIOUsage
STRLEN
SV
SYNCHRONIZATION_BARRIER
base-commit: 14793f47192b5eb02197cd03afb88559d7514b76
--
2.34.1