[PERFORM] Replication Lag Causes

Mike Wilson Sat, 01 Nov 2014 15:34:22 -0700

I have two 9.3.4 PG instances that back a large internet website that has very 
seasonal traffic and can generate large query loads.  My instances are in a 
master-slave streaming replication setup  and are stable and in general perform 
very well.  The only issues we have with the boxes is that when the master is 
busy the slave may start to lag excessively.  I can give specifics as to what 
heavily loaded means and additionally the postgresql.conf for both boxes but my 
basic questions are:
   * What causes streaming replication lag to increase?  
   * What parameters can be tuned to reduce streaming replication lag?
   * Can a loaded slave affect lag adversely?
   * Can increasing max_wal_senders help reduce lag?


The reason I ask this is that as mentioned above the servers are stable and are 
real troopers in general as they back a very popular web site that puts the 
master under heavy seasonal load at times.  At those times though we see an 
almost exponential growth in streaming replication lag compared to load on the 
master.  

For example, the master is a very beefy Solaris:
   * 4 Recent Intel Zeons (16 physical cores)
   * 256 GB of ECC RAM
   * 12 TB of ZFS (spindle and SSD internal storage)
   * DB on disk size is 2TB
   * ZFS ARC cache of roughly 250G.
   * ZFS ARC2/ZIL configured for SSD’s (this is awesome by the way)

Basic PG Config:
   shared_buffers = 2GB
   work_mem = 128MB
   max_connections = 1700 (supports roughly 100 web servers)
   wal_keep_segments = 256 (roughly enough for 24 hours of operation under 
heavy load)
   wal_sender_timeout = 60s
   replication_timeout=(not set)
   wal_receiver_status_interval=10s
   max_wal_senders=6
   * wal archiving is off
   * 98% of the queries on the master complete in under 500ms.
   * No hung or very long running queries in general.

The master on a normal day maintains a load of about 0.5, during which 
replication lag to the slave is in hundreds milliseconds.  When the production 
db server is heavily hit though the load may go as high as 4 on the master and 
the streaming replication lag may increase to more than 2 hours relatively 
quickly.  Load on the slave is generally below 1 even when the master is 
heavily loaded.  The traffic to the master is primarily read with about 10% DML 
(new users, purchase records, etc).  DML statements increase proportionally 
when under load though.  The master and slave are connected via dedicated 10G 
fiber link and even under heavy load the utilization of the link is nowhere 
near close to saturation.  BTW, the slave does run some reported related 
queries throughout the day that might take up to a minute to complete.

I have the task of figuring out why this otherwise healthy DB starts to lag so 
badly under load and if there is anything that we could do about it.  I’ve been 
wondering particularly if we should up the max_wal_senders but from the docs it 
is unclear if that would help.  In my testing with pg_bench on our dev boxes 
which were the previous production hardware for these servers I have determined 
that it doesn’t take much DML load on the master to get the slave to start 
lagging severely.  I was wondering if this was expected and/or some design 
consideration?  Possibly streaming replication isn’t meant to be used for 
heavily hit databases and maintain small lag times?  I would like to believe 
that the fault is something we have done though and that there is some 
parameter we could tune to reduce this lag.

Any recommendations would be very helpful.  
 
Mike Wilson
Predicate Logic Consulting

[PERFORM] Replication Lag Causes

Reply via email to