very high replay_lag on 3-node cluster

Tiemen Ruiten Fri, 19 Jul 2019 04:48:09 -0700

Hello,

In my previous post[1] on this list I brought up an issue with long running
checkpoints. I reduced checkpoint_timeout to a more reasonable value (15m
down from 60m) and forced/immediate checkpoints now complete mostly in
under a minute. This thread and another one[2] on the Clusterlabs
mailinglist also helped me understand more about how PostgreSQL internals
work, thanks everyone!


Now to my current issue: I took the advice to add more monitoring on replay
lag (using pg_last_xact_replay_timestamp) and things are not looking good.
Last night replication lagged by almost 6 hours on one of the nodes[3], but
eventually caught up. As you can see in that screenshot, ph-sql-03 is
consistently slower to replay than ph-sql-05 (ph-sql-04 is the current
master) and there happen to be different SSD's in ph-sql-03 (Crucial MX300
vs Crucial MX500 in the other two), which makes me think this is IO
related.

When I check the replay_lag column of pg_stat_replication, the numbers are
consistent with the data from pg_last_xact_replay_timestamp:

postgres=# SELECT application_name, replay_lag FROM pg_stat_replication;
 application_name |   replay_lag
------------------+-----------------
 ph-sql-03        | 00:15:16.179952
 ph-sql-05        | 00:10:01.078163

Currently this doesn't present an operational issue, as the slaves aren't
used by applications (still waiting for development to make the necessary
changes). So there are no queries running at all on the slaves apart from
the occasional monitoring.

Cluster specifications:
all nodes:
- filesystem: ZFS stripe of mirrors
- 2* CPU: Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz (20 cores total)
- 128 GB RAM
ph-sql-03: 8* Crucial MX300 1050MB, underprovisioned to 1TB
ph-sql-0{4,5}: 8* Crucial MX500 1TB

postgresql.conf with GUCs changed from default:

shared_buffers = 8GB
work_mem = 64MB
maintenance_work_mem = 2GB
autovacuum_work_mem = 1GB
effective_io_concurrency = 200
max_worker_processes = 50
max_parallel_maintenance_workers = 8
max_parallel_workers_per_gather = 8
max_parallel_workers = 40
wal_level = replica
synchronous_commit = off
full_page_writes = off
wal_log_hints = on
wal_buffers = 128MB
checkpoint_timeout = 15min
max_wal_size = 8GB
min_wal_size = 1GB
checkpoint_completion_target = 0.9
archive_mode = on
archive_command = 'pgbackrest --stanza=pgdb2 archive-push %p'
max_wal_senders = 10
wal_keep_segments = 20
hot_standby = on
hot_standby_feedback = on
random_page_cost = 1.5
effective_cache_size = 48GB
default_statistics_target = 500
shared_preload_libraries = 'timescaledb, pg_cron, pg_prewarm'   # (change
requires restart)
max_locks_per_transaction = 512

What are possible reasons for the high replay_lag? Is my storage just too
slow? Are there any tunables available?

[1]
https://www.postgresql.org/message-id/flat/CAEkBuzeno6ztiM1g4WdzKRJFgL8b2nfePNU%3Dq3sBiEZUm-D-sQ%40mail.gmail.com

[2] https://lists.clusterlabs.org/pipermail/users/2019-July/025967.html
[3] https://ibb.co/0sncjBZ

Thank you,

Tiemen Ruiten

very high replay_lag on 3-node cluster

Reply via email to