Hi, On Tue, Jun 11, 2024 at 08:26:23AM +0000, Bertrand Drouvot wrote: > Hi, > > On Tue, Jun 11, 2024 at 04:07:05PM +0900, Masahiko Sawada wrote: > > > Thank you for the proposal and the patch. I understand the motivation > > of this patch. > > Thanks for looking at it! > > > Beside the point Nathan mentioned, I'm slightly worried > > that massive parallel messages could be sent to the leader process > > when the cost_limit value is low. > > I see, I can/will do some testing in this area and share the numbers.
Here is the result of the test. It has been launched several times and it produced the same (surprising result) each time. ====================== Context ================================================ The testing has been done with this relation (large_tbl) and its indexes: postgres=# SELECT n.nspname, c.relname, count(*) AS buffers FROM pg_buffercache b JOIN pg_class c ON b.relfilenode = pg_relation_filenode(c.oid) AND b.reldatabase IN (0, (SELECT oid FROM pg_database WHERE datname = current_database())) JOIN pg_namespace n ON n.oid = c.relnamespace GROUP BY n.nspname, c.relname ORDER BY 3 DESC LIMIT 22; nspname | relname | buffers ---------+--------------------+--------- public | large_tbl | 222280 public | large_tbl_filler13 | 125000 public | large_tbl_filler6 | 125000 public | large_tbl_filler5 | 125000 public | large_tbl_filler3 | 125000 public | large_tbl_filler15 | 125000 public | large_tbl_filler4 | 125000 public | large_tbl_filler20 | 125000 public | large_tbl_filler18 | 125000 public | large_tbl_filler14 | 125000 public | large_tbl_filler8 | 125000 public | large_tbl_filler11 | 125000 public | large_tbl_filler19 | 125000 public | large_tbl_filler7 | 125000 public | large_tbl_filler1 | 125000 public | large_tbl_filler12 | 125000 public | large_tbl_filler9 | 125000 public | large_tbl_filler17 | 125000 public | large_tbl_filler16 | 125000 public | large_tbl_filler10 | 125000 public | large_tbl_filler2 | 125000 public | large_tbl_pkey | 5486 (22 rows) All of them completly fit in memory (to avoid I/O read latency during the vacuum). The config, outside of default is: max_wal_size = 4GB shared_buffers = 30GB vacuum_cost_delay = 1 autovacuum_vacuum_cost_delay = 1 max_parallel_maintenance_workers = 8 max_parallel_workers = 10 vacuum_cost_limit = 10 autovacuum_vacuum_cost_limit = 10 My system is not overloaded, has enough resources to run this test and only this test is running. ====================== Results ================================================ ========== With v2 (attached) applied on master postgres=# VACUUM (PARALLEL 8) large_tbl; VACUUM Time: 1146873.016 ms (19:06.873) The duration is splitted that way: Vacuum phase: cumulative time (cumulative time delayed) ======================================================= scanning heap: 00:08:16.414628 (time_delayed: 444370) vacuuming indexes: 00:14:55.314699 (time_delayed: 2545293) vacuuming heap: 00:19:06.814617 (time_delayed: 2767540) I sampled active sessions from pg_stat_activity (one second interval), here is the summary during the vacuuming indexes phase (ordered by count): leader_pid | pid | wait_event | count ------------+--------+----------------+------- 452996 | 453225 | VacuumDelay | 366 452996 | 453223 | VacuumDelay | 363 452996 | 453226 | VacuumDelay | 362 452996 | 453224 | VacuumDelay | 361 452996 | 453222 | VacuumDelay | 359 452996 | 453221 | VacuumDelay | 359 | 452996 | VacuumDelay | 331 | 452996 | CPU | 30 452996 | 453224 | WALWriteLock | 23 452996 | 453222 | WALWriteLock | 20 452996 | 453226 | WALWriteLock | 20 452996 | 453221 | WALWriteLock | 19 | 452996 | WalSync | 18 452996 | 453225 | WALWriteLock | 18 452996 | 453223 | WALWriteLock | 16 | 452996 | WALWriteLock | 15 452996 | 453221 | CPU | 14 452996 | 453222 | CPU | 14 452996 | 453223 | CPU | 12 452996 | 453224 | CPU | 10 452996 | 453226 | CPU | 10 452996 | 453225 | CPU | 8 452996 | 453223 | WalSync | 4 452996 | 453221 | WalSync | 2 452996 | 453226 | WalWrite | 2 452996 | 453221 | WalWrite | 1 | 452996 | ParallelFinish | 1 452996 | 453224 | WalSync | 1 452996 | 453225 | WalSync | 1 452996 | 453222 | WalWrite | 1 452996 | 453225 | WalWrite | 1 452996 | 453222 | WalSync | 1 452996 | 453226 | WalSync | 1 ========== On master (v2 not applied) postgres=# VACUUM (PARALLEL 8) large_tbl; VACUUM Time: 1322598.087 ms (22:02.598) Surprisingly it has been longer on master by about 3 minutes. Let's see how the time is splitted: Vacuum phase: cumulative time ============================= scanning heap: 00:08:07.061196 vacuuming indexes: 00:17:50.961228 vacuuming heap: 00:22:02.561199 I sampled active sessions from pg_stat_activity (one second interval), here is the summary during the vacuuming indexes phase (ordered by count): leader_pid | pid | wait_event | count ------------+--------+-------------------+------- 468682 | 468858 | VacuumDelay | 548 468682 | 468862 | VacuumDelay | 547 468682 | 468859 | VacuumDelay | 547 468682 | 468860 | VacuumDelay | 545 468682 | 468857 | VacuumDelay | 543 468682 | 468861 | VacuumDelay | 542 | 468682 | VacuumDelay | 378 | 468682 | ParallelFinish | 182 468682 | 468861 | WALWriteLock | 19 468682 | 468857 | WALWriteLock | 19 468682 | 468859 | WALWriteLock | 18 468682 | 468858 | WALWriteLock | 16 468682 | 468860 | WALWriteLock | 15 468682 | 468862 | WALWriteLock | 15 468682 | 468862 | CPU | 12 468682 | 468857 | CPU | 10 468682 | 468859 | CPU | 10 468682 | 468861 | CPU | 10 | 468682 | CPU | 9 468682 | 468860 | CPU | 9 468682 | 468860 | WalSync | 8 | 468682 | WALWriteLock | 7 468682 | 468858 | WalSync | 6 468682 | 468858 | CPU | 6 468682 | 468862 | WalSync | 3 468682 | 468857 | WalSync | 3 468682 | 468861 | WalSync | 3 468682 | 468859 | WalSync | 2 468682 | 468861 | WalWrite | 2 468682 | 468857 | WalWrite | 1 468682 | 468858 | WalWrite | 1 468682 | 468861 | WALBufMappingLock | 1 468682 | 468857 | WALBufMappingLock | 1 | 468682 | WALBufMappingLock | 1 ====================== Observations =========================================== As compare to v2: 1. scanning heap time is about the same 2. vacuuming indexes time is about 3 minutes longer on master 3. vacuuming heap time is about the same One difference we can see in the sampling, is that on master the "ParallelFinish" has been sampled about 182 times (so could be _the_ 3 minutes of interest) for the leader. On master the vacuum indexes phase has been running between 2024-06-13 10:11:34 and 2024-06-13 10:21:15. If I extract the exact minutes and the counts for the "ParallelFinish" wait event I get: minute | wait_event | count --------+----------------+------- 18 | ParallelFinish | 48 19 | ParallelFinish | 60 20 | ParallelFinish | 60 21 | ParallelFinish | 14 So it's likely that the leader waited on ParallelFinish during about 3 minutes at the end of the vacuuming indexes phase (as this wait appeared during consecutives samples). ====================== Conclusion ============================================= 1. During the scanning heap and vacuuming heap phases no noticeable performance degradation has been observed with v2 applied (as compare to master) (cc'ing Robert as it's also related to his question about noticeable hit when everything is in-memory in [1]). 2. During the vacuuming indexes phase, v2 has been faster (as compare to master). The reason is that on master the leader has been waiting during about 3 minutes on "ParallelFinish" at the end. ====================== Remarks ================================================ As v2 is attached, please find below a summary about the current state of this thread: 1. v2 implements delay_time as the actual wait time (and not the intended wait time as proposed in v1). 2. some measurements have been done to check the impact of this new instrumentation (see this email and [2]): no noticeable performance degradation has been observed (and surprisingly that's the opposite as mentioned above). 3. there is an ongoing discussion about exposing the number of waits [2]. 4. there is an ongoing discussion about exposing the effective cost limit [3]. 5. that could be interesting to have a closer look as to why the leader is waiting during 3 minutes on "ParallelFinish" on master and not with v2 applied (but that's probably out of scope for this thread). [1]: https://www.postgresql.org/message-id/CA%2BTgmoZiC%3DzeCDYuMpB%2BGb2yK%3DrTQCGMu0VoxehocKyHxr9Erg%40mail.gmail.com [2]: https://www.postgresql.org/message-id/ZmmOOPwMFIltkdsN%40ip-10-97-1-34.eu-west-3.compute.internal [3]: https://www.postgresql.org/message-id/Zml9%2Bu37iS7DFkJL%40ip-10-97-1-34.eu-west-3.compute.internal Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
>From 21eeab61c125a7ca4afccd3bc5961a1f060f0b9a Mon Sep 17 00:00:00 2001 From: Bertrand Drouvot <bertranddrouvot...@gmail.com> Date: Thu, 6 Jun 2024 12:35:57 +0000 Subject: [PATCH v2] Report the total amount of time that vacuum has been delayed due to cost delay This commit adds one column: time_delayed to the pg_stat_progress_vacuum system view to show the total amount of time in milliseconds that vacuum has been delayed. This uses the new parallel message type for progress reporting added by f1889729dd. Bump catversion because this changes the definition of pg_stat_progress_vacuum. --- doc/src/sgml/monitoring.sgml | 11 +++++++++++ src/backend/catalog/system_views.sql | 3 ++- src/backend/commands/vacuum.c | 15 +++++++++++++++ src/include/catalog/catversion.h | 2 +- src/include/commands/progress.h | 1 + src/test/regress/expected/rules.out | 3 ++- 6 files changed, 32 insertions(+), 3 deletions(-) 37.5% doc/src/sgml/ 42.3% src/backend/commands/ 5.9% src/include/catalog/ 3.2% src/include/commands/ 7.9% src/test/regress/expected/ diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml index 053da8d6e4..cdd0f0e533 100644 --- a/doc/src/sgml/monitoring.sgml +++ b/doc/src/sgml/monitoring.sgml @@ -6290,6 +6290,17 @@ FROM pg_stat_get_backend_idset() AS backendid; <literal>cleaning up indexes</literal>. </para></entry> </row> + + <row> + <entry role="catalog_table_entry"><para role="column_definition"> + <structfield>time_delayed</structfield> <type>bigint</type> + </para> + <para> + Total amount of time spent in milliseconds waiting due to <varname>vacuum_cost_delay</varname> + or <varname>autovacuum_vacuum_cost_delay</varname>. In case of parallel + vacuum the reported time is across all the workers and the leader. + </para></entry> + </row> </tbody> </tgroup> </table> diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql index 53047cab5f..1345e99dcb 100644 --- a/src/backend/catalog/system_views.sql +++ b/src/backend/catalog/system_views.sql @@ -1221,7 +1221,8 @@ CREATE VIEW pg_stat_progress_vacuum AS S.param2 AS heap_blks_total, S.param3 AS heap_blks_scanned, S.param4 AS heap_blks_vacuumed, S.param5 AS index_vacuum_count, S.param6 AS max_dead_tuple_bytes, S.param7 AS dead_tuple_bytes, - S.param8 AS indexes_total, S.param9 AS indexes_processed + S.param8 AS indexes_total, S.param9 AS indexes_processed, + S.param10 AS time_delayed FROM pg_stat_get_progress_info('VACUUM') AS S LEFT JOIN pg_database D ON S.datid = D.oid; diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c index 48f8eab202..5c40ee6e2c 100644 --- a/src/backend/commands/vacuum.c +++ b/src/backend/commands/vacuum.c @@ -40,6 +40,7 @@ #include "catalog/pg_inherits.h" #include "commands/cluster.h" #include "commands/defrem.h" +#include "commands/progress.h" #include "commands/vacuum.h" #include "miscadmin.h" #include "nodes/makefuncs.h" @@ -2380,13 +2381,27 @@ vacuum_delay_point(void) /* Nap if appropriate */ if (msec > 0) { + instr_time delay_start; + instr_time delay_time; + if (msec > vacuum_cost_delay * 4) msec = vacuum_cost_delay * 4; pgstat_report_wait_start(WAIT_EVENT_VACUUM_DELAY); + INSTR_TIME_SET_CURRENT(delay_start); pg_usleep(msec * 1000); + INSTR_TIME_SET_CURRENT(delay_time); pgstat_report_wait_end(); + /* Report the amount of time we slept */ + INSTR_TIME_SUBTRACT(delay_time, delay_start); + if (VacuumSharedCostBalance != NULL) + pgstat_progress_parallel_incr_param(PROGRESS_VACUUM_TIME_DELAYED, + INSTR_TIME_GET_MILLISEC(delay_time)); + else + pgstat_progress_incr_param(PROGRESS_VACUUM_TIME_DELAYED, + INSTR_TIME_GET_MILLISEC(delay_time)); + /* * We don't want to ignore postmaster death during very long vacuums * with vacuum_cost_delay configured. We can't use the usual diff --git a/src/include/catalog/catversion.h b/src/include/catalog/catversion.h index f0809c0e58..40b4f1d1e4 100644 --- a/src/include/catalog/catversion.h +++ b/src/include/catalog/catversion.h @@ -57,6 +57,6 @@ */ /* yyyymmddN */ -#define CATALOG_VERSION_NO 202405161 +#define CATALOG_VERSION_NO 202406101 #endif diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h index 82a8fe6bd1..1fcefe9436 100644 --- a/src/include/commands/progress.h +++ b/src/include/commands/progress.h @@ -27,6 +27,7 @@ #define PROGRESS_VACUUM_DEAD_TUPLE_BYTES 6 #define PROGRESS_VACUUM_INDEXES_TOTAL 7 #define PROGRESS_VACUUM_INDEXES_PROCESSED 8 +#define PROGRESS_VACUUM_TIME_DELAYED 9 /* Phases of vacuum (as advertised via PROGRESS_VACUUM_PHASE) */ #define PROGRESS_VACUUM_PHASE_SCAN_HEAP 1 diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out index ef658ad740..a499e44df1 100644 --- a/src/test/regress/expected/rules.out +++ b/src/test/regress/expected/rules.out @@ -2053,7 +2053,8 @@ pg_stat_progress_vacuum| SELECT s.pid, s.param6 AS max_dead_tuple_bytes, s.param7 AS dead_tuple_bytes, s.param8 AS indexes_total, - s.param9 AS indexes_processed + s.param9 AS indexes_processed, + s.param10 AS time_delayed FROM (pg_stat_get_progress_info('VACUUM'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20) LEFT JOIN pg_database d ON ((s.datid = d.oid))); pg_stat_recovery_prefetch| SELECT stats_reset, -- 2.34.1