Re: Track the amount of time waiting due to cost_delay

Bertrand Drouvot Thu, 13 Jun 2024 04:56:53 -0700

Hi,

On Tue, Jun 11, 2024 at 08:26:23AM +0000, Bertrand Drouvot wrote:
> Hi,
> 
> On Tue, Jun 11, 2024 at 04:07:05PM +0900, Masahiko Sawada wrote:
> 
> > Thank you for the proposal and the patch. I understand the motivation
> > of this patch.
> 
> Thanks for looking at it!
> 
> > Beside the point Nathan mentioned, I'm slightly worried
> > that massive parallel messages could be sent to the leader process
> > when the cost_limit value is low.
> 
> I see, I can/will do some testing in this area and share the numbers.


Here is the result of the test. It has been launched several times and it
produced the same (surprising result) each time.

====================== Context ================================================

The testing has been done with this relation (large_tbl) and its indexes:

postgres=# SELECT n.nspname, c.relname, count(*) AS buffers
             FROM pg_buffercache b JOIN pg_class c
             ON b.relfilenode = pg_relation_filenode(c.oid) AND
                b.reldatabase IN (0, (SELECT oid FROM pg_database
                                      WHERE datname = current_database()))
             JOIN pg_namespace n ON n.oid = c.relnamespace
             GROUP BY n.nspname, c.relname
             ORDER BY 3 DESC
             LIMIT 22;

 nspname |      relname       | buffers
---------+--------------------+---------
 public  | large_tbl          |  222280
 public  | large_tbl_filler13 |  125000
 public  | large_tbl_filler6  |  125000
 public  | large_tbl_filler5  |  125000
 public  | large_tbl_filler3  |  125000
 public  | large_tbl_filler15 |  125000
 public  | large_tbl_filler4  |  125000
 public  | large_tbl_filler20 |  125000
 public  | large_tbl_filler18 |  125000
 public  | large_tbl_filler14 |  125000
 public  | large_tbl_filler8  |  125000
 public  | large_tbl_filler11 |  125000
 public  | large_tbl_filler19 |  125000
 public  | large_tbl_filler7  |  125000
 public  | large_tbl_filler1  |  125000
 public  | large_tbl_filler12 |  125000
 public  | large_tbl_filler9  |  125000
 public  | large_tbl_filler17 |  125000
 public  | large_tbl_filler16 |  125000
 public  | large_tbl_filler10 |  125000
 public  | large_tbl_filler2  |  125000
 public  | large_tbl_pkey     |    5486
(22 rows)

All of them completly fit in memory (to avoid I/O read latency during the 
vacuum).

The config, outside of default is:

max_wal_size = 4GB
shared_buffers = 30GB
vacuum_cost_delay = 1
autovacuum_vacuum_cost_delay = 1
max_parallel_maintenance_workers = 8
max_parallel_workers = 10
vacuum_cost_limit = 10
autovacuum_vacuum_cost_limit = 10

My system is not overloaded, has enough resources to run this test and only this
test is running.

====================== Results ================================================

========== With v2 (attached) applied on master

postgres=# VACUUM (PARALLEL 8) large_tbl;
VACUUM
Time: 1146873.016 ms (19:06.873)

The duration is splitted that way:

Vacuum phase: cumulative time (cumulative time delayed)
=======================================================
scanning heap: 00:08:16.414628 (time_delayed: 444370)
vacuuming indexes: 00:14:55.314699 (time_delayed: 2545293)
vacuuming heap: 00:19:06.814617 (time_delayed: 2767540)

I sampled active sessions from pg_stat_activity (one second interval), here is
the summary during the vacuuming indexes phase (ordered by count):

 leader_pid |  pid   |   wait_event   | count
------------+--------+----------------+-------
     452996 | 453225 | VacuumDelay    |   366
     452996 | 453223 | VacuumDelay    |   363
     452996 | 453226 | VacuumDelay    |   362
     452996 | 453224 | VacuumDelay    |   361
     452996 | 453222 | VacuumDelay    |   359
     452996 | 453221 | VacuumDelay    |   359
            | 452996 | VacuumDelay    |   331
            | 452996 | CPU            |    30
     452996 | 453224 | WALWriteLock   |    23
     452996 | 453222 | WALWriteLock   |    20
     452996 | 453226 | WALWriteLock   |    20
     452996 | 453221 | WALWriteLock   |    19
            | 452996 | WalSync        |    18
     452996 | 453225 | WALWriteLock   |    18
     452996 | 453223 | WALWriteLock   |    16
            | 452996 | WALWriteLock   |    15
     452996 | 453221 | CPU            |    14
     452996 | 453222 | CPU            |    14
     452996 | 453223 | CPU            |    12
     452996 | 453224 | CPU            |    10
     452996 | 453226 | CPU            |    10
     452996 | 453225 | CPU            |     8
     452996 | 453223 | WalSync        |     4
     452996 | 453221 | WalSync        |     2
     452996 | 453226 | WalWrite       |     2
     452996 | 453221 | WalWrite       |     1
            | 452996 | ParallelFinish |     1
     452996 | 453224 | WalSync        |     1
     452996 | 453225 | WalSync        |     1
     452996 | 453222 | WalWrite       |     1
     452996 | 453225 | WalWrite       |     1
     452996 | 453222 | WalSync        |     1
     452996 | 453226 | WalSync        |     1


========== On master (v2 not applied)

postgres=# VACUUM (PARALLEL 8) large_tbl;
VACUUM
Time: 1322598.087 ms (22:02.598)

Surprisingly it has been longer on master by about 3 minutes.

Let's see how the time is splitted:

Vacuum phase: cumulative time
=============================
scanning heap: 00:08:07.061196
vacuuming indexes: 00:17:50.961228
vacuuming heap: 00:22:02.561199

I sampled active sessions from pg_stat_activity (one second interval), here is
the summary during the vacuuming indexes phase (ordered by count):

 leader_pid |  pid   |    wait_event     | count
------------+--------+-------------------+-------
     468682 | 468858 | VacuumDelay       |   548
     468682 | 468862 | VacuumDelay       |   547
     468682 | 468859 | VacuumDelay       |   547
     468682 | 468860 | VacuumDelay       |   545
     468682 | 468857 | VacuumDelay       |   543
     468682 | 468861 | VacuumDelay       |   542
            | 468682 | VacuumDelay       |   378
            | 468682 | ParallelFinish    |   182
     468682 | 468861 | WALWriteLock      |    19
     468682 | 468857 | WALWriteLock      |    19
     468682 | 468859 | WALWriteLock      |    18
     468682 | 468858 | WALWriteLock      |    16
     468682 | 468860 | WALWriteLock      |    15
     468682 | 468862 | WALWriteLock      |    15
     468682 | 468862 | CPU               |    12
     468682 | 468857 | CPU               |    10
     468682 | 468859 | CPU               |    10
     468682 | 468861 | CPU               |    10
            | 468682 | CPU               |     9
     468682 | 468860 | CPU               |     9
     468682 | 468860 | WalSync           |     8
            | 468682 | WALWriteLock      |     7
     468682 | 468858 | WalSync           |     6
     468682 | 468858 | CPU               |     6
     468682 | 468862 | WalSync           |     3
     468682 | 468857 | WalSync           |     3
     468682 | 468861 | WalSync           |     3
     468682 | 468859 | WalSync           |     2
     468682 | 468861 | WalWrite          |     2
     468682 | 468857 | WalWrite          |     1
     468682 | 468858 | WalWrite          |     1
     468682 | 468861 | WALBufMappingLock |     1
     468682 | 468857 | WALBufMappingLock |     1
            | 468682 | WALBufMappingLock |     1

====================== Observations ===========================================

As compare to v2:

1. scanning heap time is about the same
2. vacuuming indexes time is about 3 minutes longer on master
3. vacuuming heap time is about the same

One difference we can see in the sampling, is that on master the 
"ParallelFinish"
has been sampled about 182 times (so could be _the_ 3 minutes of interest) for
the leader.

On master the vacuum indexes phase has been running between 2024-06-13 10:11:34
and 2024-06-13 10:21:15. If I extract the exact minutes and the counts for the
"ParallelFinish" wait event I get:

 minute |   wait_event   | count
--------+----------------+-------
     18 | ParallelFinish |    48
     19 | ParallelFinish |    60
     20 | ParallelFinish |    60
     21 | ParallelFinish |    14

So it's likely that the leader waited on ParallelFinish during about 3 minutes
at the end of the vacuuming indexes phase (as this wait appeared during
consecutives samples).

====================== Conclusion =============================================

1. During the scanning heap and vacuuming heap phases no noticeable performance
degradation has been observed with v2 applied (as compare to master) (cc'ing
Robert as it's also related to his question about noticeable hit when everything
is in-memory in [1]).

2. During the vacuuming indexes phase, v2 has been faster (as compare to 
master).
The reason is that on master the leader has been waiting during about 3 minutes
on "ParallelFinish" at the end.

====================== Remarks ================================================

As v2 is attached, please find below a summary about the current state of this
thread:

1. v2 implements delay_time as the actual wait time (and not the intended wait
time as proposed in v1).

2. some measurements have been done to check the impact of this new
instrumentation (see this email and [2]): no noticeable performance degradation
has been observed (and surprisingly that's the opposite as mentioned above).

3. there is an ongoing discussion about exposing the number of waits [2].

4. there is an ongoing discussion about exposing the effective cost limit [3].

5. that could be interesting to have a closer look as to why the leader is 
waiting
during 3 minutes on "ParallelFinish" on master and not with v2 applied (but 
that's
probably out of scope for this thread).

[1]: 
https://www.postgresql.org/message-id/CA%2BTgmoZiC%3DzeCDYuMpB%2BGb2yK%3DrTQCGMu0VoxehocKyHxr9Erg%40mail.gmail.com
[2]: 
https://www.postgresql.org/message-id/ZmmOOPwMFIltkdsN%40ip-10-97-1-34.eu-west-3.compute.internal
[3]: 
https://www.postgresql.org/message-id/Zml9%2Bu37iS7DFkJL%40ip-10-97-1-34.eu-west-3.compute.internal

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

>From 21eeab61c125a7ca4afccd3bc5961a1f060f0b9a Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot <bertranddrouvot...@gmail.com>
Date: Thu, 6 Jun 2024 12:35:57 +0000
Subject: [PATCH v2] Report the total amount of time that vacuum has been
 delayed due to cost delay

This commit adds one column: time_delayed to the pg_stat_progress_vacuum system
view to show the total amount of time in milliseconds that vacuum has been
delayed.

This uses the new parallel message type for progress reporting added
by f1889729dd.

Bump catversion because this changes the definition of pg_stat_progress_vacuum.
---
 doc/src/sgml/monitoring.sgml         | 11 +++++++++++
 src/backend/catalog/system_views.sql |  3 ++-
 src/backend/commands/vacuum.c        | 15 +++++++++++++++
 src/include/catalog/catversion.h     |  2 +-
 src/include/commands/progress.h      |  1 +
 src/test/regress/expected/rules.out  |  3 ++-
 6 files changed, 32 insertions(+), 3 deletions(-)
  37.5% doc/src/sgml/
  42.3% src/backend/commands/
   5.9% src/include/catalog/
   3.2% src/include/commands/
   7.9% src/test/regress/expected/

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 053da8d6e4..cdd0f0e533 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -6290,6 +6290,17 @@ FROM pg_stat_get_backend_idset() AS backendid;
        <literal>cleaning up indexes</literal>.
       </para></entry>
      </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>time_delayed</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Total amount of time spent in milliseconds waiting due to <varname>vacuum_cost_delay</varname>
+       or <varname>autovacuum_vacuum_cost_delay</varname>. In case of parallel
+       vacuum the reported time is across all the workers and the leader.
+      </para></entry>
+     </row>
     </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 53047cab5f..1345e99dcb 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1221,7 +1221,8 @@ CREATE VIEW pg_stat_progress_vacuum AS
         S.param2 AS heap_blks_total, S.param3 AS heap_blks_scanned,
         S.param4 AS heap_blks_vacuumed, S.param5 AS index_vacuum_count,
         S.param6 AS max_dead_tuple_bytes, S.param7 AS dead_tuple_bytes,
-        S.param8 AS indexes_total, S.param9 AS indexes_processed
+        S.param8 AS indexes_total, S.param9 AS indexes_processed,
+        S.param10 AS time_delayed
     FROM pg_stat_get_progress_info('VACUUM') AS S
         LEFT JOIN pg_database D ON S.datid = D.oid;
 
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 48f8eab202..5c40ee6e2c 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -40,6 +40,7 @@
 #include "catalog/pg_inherits.h"
 #include "commands/cluster.h"
 #include "commands/defrem.h"
+#include "commands/progress.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
 #include "nodes/makefuncs.h"
@@ -2380,13 +2381,27 @@ vacuum_delay_point(void)
 	/* Nap if appropriate */
 	if (msec > 0)
 	{
+		instr_time	delay_start;
+		instr_time	delay_time;
+
 		if (msec > vacuum_cost_delay * 4)
 			msec = vacuum_cost_delay * 4;
 
 		pgstat_report_wait_start(WAIT_EVENT_VACUUM_DELAY);
+		INSTR_TIME_SET_CURRENT(delay_start);
 		pg_usleep(msec * 1000);
+		INSTR_TIME_SET_CURRENT(delay_time);
 		pgstat_report_wait_end();
 
+		/* Report the amount of time we slept */
+		INSTR_TIME_SUBTRACT(delay_time, delay_start);
+		if (VacuumSharedCostBalance != NULL)
+			pgstat_progress_parallel_incr_param(PROGRESS_VACUUM_TIME_DELAYED,
+												INSTR_TIME_GET_MILLISEC(delay_time));
+		else
+			pgstat_progress_incr_param(PROGRESS_VACUUM_TIME_DELAYED,
+									   INSTR_TIME_GET_MILLISEC(delay_time));
+
 		/*
 		 * We don't want to ignore postmaster death during very long vacuums
 		 * with vacuum_cost_delay configured.  We can't use the usual
diff --git a/src/include/catalog/catversion.h b/src/include/catalog/catversion.h
index f0809c0e58..40b4f1d1e4 100644
--- a/src/include/catalog/catversion.h
+++ b/src/include/catalog/catversion.h
@@ -57,6 +57,6 @@
  */
 
 /*							yyyymmddN */
-#define CATALOG_VERSION_NO	202405161
+#define CATALOG_VERSION_NO	202406101
 
 #endif
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index 82a8fe6bd1..1fcefe9436 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -27,6 +27,7 @@
 #define PROGRESS_VACUUM_DEAD_TUPLE_BYTES		6
 #define PROGRESS_VACUUM_INDEXES_TOTAL			7
 #define PROGRESS_VACUUM_INDEXES_PROCESSED		8
+#define PROGRESS_VACUUM_TIME_DELAYED			9
 
 /* Phases of vacuum (as advertised via PROGRESS_VACUUM_PHASE) */
 #define PROGRESS_VACUUM_PHASE_SCAN_HEAP			1
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index ef658ad740..a499e44df1 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2053,7 +2053,8 @@ pg_stat_progress_vacuum| SELECT s.pid,
     s.param6 AS max_dead_tuple_bytes,
     s.param7 AS dead_tuple_bytes,
     s.param8 AS indexes_total,
-    s.param9 AS indexes_processed
+    s.param9 AS indexes_processed,
+    s.param10 AS time_delayed
    FROM (pg_stat_get_progress_info('VACUUM'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
      LEFT JOIN pg_database d ON ((s.datid = d.oid)));
 pg_stat_recovery_prefetch| SELECT stats_reset,
-- 
2.34.1

Re: Track the amount of time waiting due to cost_delay

Reply via email to