I've got two Postgres 13 databases on AWS RDS.

  *   One is a master, the other a slave using logical replication.
  *   Replication has fallen behind by about 350Gb.
  *   The slave was maxed out in terms of CPU for the past four days because of 
some jobs that were ongoing so I'm not sure what logical replication was able 
to replicate during that time.
  *   I killed those jobs and now CPU on the master and slave are both low.
  *   I look at the subscriber via `select * from pg_stat_subscription;` and 
see that latest_end_lsn is advancing albeit very slowly.
  *   The publisher says write/flush/replay lags are all 13 minutes behind but 
it's been like that for most of the day.
  *   I see no errors in the logs on either the publisher or subscriber outside 
of some simple SQL errors that users have been making.
  *   CloudWatch reports low CPU utilization, low I/O, and low network.

Is there anything I can do here? Previously I set wal_receiver_timeout timeout 
to 0 because I had replication issues, and that helped things. I wish I had 
some visibility here to get any kind of confidence that it's going to pull 
through, but other than these lsn values and database logs, I'm not sure what 
to check.

Sincerely,
mj

Reply via email to