Hey Everyone, With my investigations into our streaming slave having a continuously growing pg_xlog directory ( http://archives.postgresql.org/pgsql-general/2012-08/msg00124.php) I think I've hit upon what's going on.
Using pg_controldata I've found that our streaming slave hasn't run a checkpoint for over 2 weeks. Or I suppose more specifically hasn't completed a checkpoint in 2 weeks. The server was synced and started up on the 27th so there was 4 days where it looks like the checkpoints where happening correctly. I have attempted a manual checkpoint on the slave with no effect (it returns instantly to me in the client but the pg_controldata never shows any change). On my testing system the automatic checkpoint runs successfully every 5 minutes and removes the wal logs it no longer needs to get to that restartpoint. pg_control version number: 903 Catalog version number: 201105231 Database system identifier: 5680552055141687295 Database cluster state: in archive recovery *pg_control last modified: Tue 31 Jul 2012 12:43:59 PM MDT* Latest checkpoint location: 178/5344C730 Prior checkpoint location: 178/4E8C1938 Latest checkpoint's REDO location: 178/4FC5E8E0 Latest checkpoint's TimeLineID: 10 Latest checkpoint's NextXID: 0/291590955 Latest checkpoint's NextOID: 67986575 Latest checkpoint's NextMultiXactId: 122827 Latest checkpoint's NextMultiOffset: 296304 Latest checkpoint's oldestXID: 64713442 Latest checkpoint's oldestXID's DB: 16408 Latest checkpoint's oldestActiveXID: 260568245 *Time of latest checkpoint: Tue 31 Jul 2012 12:35:53 PM MDT* Minimum recovery ending location: 199/EC6C0BA0 Backup start location: 0/0 Current wal_level setting: hot_standby Current max_connections setting: 600 Current max_prepared_xacts setting: 0 Current max_locks_per_xact setting: 256 Maximum data alignment: 8 Database block size: 8192 Blocks per segment of large relation: 131072 WAL block size: 8192 Bytes per WAL segment: 16777216 Maximum length of identifiers: 64 Maximum columns in an index: 32 Maximum size of a TOAST chunk: 1996 Date/time type storage: 64-bit integers Float4 argument passing: by value Float8 argument passing: by value Compared to our master pg_control version number: 903 Catalog version number: 201105231 Database system identifier: 5680552055141687295 Database cluster state: in production pg_control last modified: Mon 13 Aug 2012 11:35:35 AM MDT Latest checkpoint location: 199/EC5E4218 Prior checkpoint location: 199/EBEFB120 Latest checkpoint's REDO location: 199/EC033110 Latest checkpoint's TimeLineID: 10 Latest checkpoint's NextXID: 0/291590955 Latest checkpoint's NextOID: 75744399 Latest checkpoint's NextMultiXactId: 122877 Latest checkpoint's NextMultiOffset: 296404 Latest checkpoint's oldestXID: 91782681 Latest checkpoint's oldestXID's DB: 16408 Latest checkpoint's oldestActiveXID: 291583665 Time of latest checkpoint: Mon 13 Aug 2012 11:31:35 AM MDT Minimum recovery ending location: 0/0 Backup start location: 0/0 Current wal_level setting: hot_standby Current max_connections setting: 600 Current max_prepared_xacts setting: 0 Current max_locks_per_xact setting: 256 Maximum data alignment: 8 Database block size: 8192 Blocks per segment of large relation: 131072 WAL block size: 8192 Bytes per WAL segment: 16777216 Maximum length of identifiers: 64 Maximum columns in an index: 32 Maximum size of a TOAST chunk: 1996 Date/time type storage: 64-bit integers Float4 argument passing: by value Float8 argument passing: by value Which appears to correctly be running it's automated checkpoints every 5 minutes. Our checkpoint configuration is currently: checkpoint_completion_target | 0.8 | Time spent flushing dirty buffers during checkpoint, as fraction of checkpoint interval. checkpoint_segments | 128 | Sets the maximum distance in log segments between automatic WAL checkpoints. checkpoint_timeout | 5min | Sets the maximum time between automatic WAL checkpoints. checkpoint_warning | 30s | Enables warnings if checkpoint segments are filled more frequently than this. The settings are the same on both the slave and the master. We don't currently have log_checkpoints enabled on our slave (I'm going to make the change and schedule a resync for this evening to see if the log reveals anything) I've done some googling and can't really see anything that will allow me to see the status of the checkpoints in progress currently. Just for more info we're running postgres 9.1.1 build from source on 2 X64 Cent OS5.8 machines. The total size of our /data directory is currently 94.5 gigs which is currently all in 1 DB with 2193 schemas (each of which has around 1300 objects).. From the what I'm able to pull of the slave there isn't constant IO happening like the checkpoints are running and just thrashing the disks. If there's any other information that may be helpful I can pull it. Anyone have any suggestions as to next roads to take (besides turning on the checkpoint logging and trying to see what's going on)? Thanks