Apologies about the formatting; resending again as plain-text.

Regards,
Steve


From: steven.jones1...@outlook.com
To: pgsql-performance@postgresql.org
Subject: [PERFORM] Very slow checkpoints
Date: Wed, 18 Mar 2015 11:21:08 +0000

Hi,

We have a so far (to us) unexplainable issue on our production systems after we 
roughly doubled the amount of data we import daily. We should be ok on pure 
theoretical hardware performance, but we are seeing some weird IO counters when 
the actual throughput of the writes is very low. The use case is as follows:

  - typical DW - relatively constant periodic data loads - i.e. heavy write
  - we receive large CSV files ~ 5-10Gb every 15 minutes spread out across 5-7 
minutes
  - Custom ETL scripts process and filter files within < 30 seconds down to 
about 5Gb CSV ready to load
  - 2 loader queues load the files, picking off a file one-by-one
  - tables are partitioned daily, indexed on a primary key + timestamp 
  - system is HP blade; 128Gb RAM, 2x 8-core, 12x 10k RPM RAID1+0 (database) on 
first controller, 2x 15k RAID1 (xlog) on a different controller
  - DB size is ~2.5Tb; rotating load of 30 days keeps the database stable
  - filesystem: zfs with lz4 compression
  - raw throughput of the database disk is> 700Mbytes/sec sequential 
and>150Mbytes random for read and roughly half for write in various benchmarks
  - CPU load is minimal when copy loads are taking place (i.e. after ETL has 
finished)

The issue is that the system is constantly checkpointing regardless of various 
kernel and postgres settings. Having read through most of the history of this 
list and most of the recommendations on various blogs, we have been unable to 
find an answer why the checkpoints are being written so slowly. Even when we 
disable all import processes or if index is dropped, the checkpoint is still 
taking> 1hour. Stats are pointing to checkpoint sizes of roughly 7Gb which 
should take < 1min even with full random reads; so even when imports are fully 
disabled, what is not making sense is why would the checkpointing be taking 
well over an hour?

One other thing that's noticed, but not measured, i.e. mostly anecdotal is that 
for a period of     COMMAND
32101 be/4 postgres   10.25 M/s 1085.86 K/s  0.00 % 96.80 % postgres: 
checkpointer process
56661 be/4 postgres    6.84 M/s  591.61 K/s  0.00 % 90.91 % postgres: dbauser 
db [local] COPY
56751 be/4 postgres    6.97 M/s  838.73 K/s  0.00 % 88.00 % postgres: dbauser 
db [local] COPY
56744 be/4 postgres    6.13 M/s  958.55 K/s  0.00 % 85.48 % postgres: dbauser 
db [local] COPY
56621 be/4 postgres    6.77 M/s 1288.05 K/s  0.00 % 83.96 % postgres: dbauser 
db [local] COPY
32102 be/4 postgres    8.05 M/s 1340.47 K/s  0.00 % 82.47 % postgres: writer 
process
 1005 be/0 root        0.00 B/s    0.00 B/s  0.00 %  5.81 % [txg_sync]
32103 be/4 postgres    0.00 B/s   10.41 M/s  0.00 %  0.52 % postgres: wal 
writer process


---                                       

Reply via email to