Hi can share recovery.conf file settings??
El mar., 23 de oct. de 2018 a la(s) 00:28, Boris Sagadin ( bo...@infosplet.com) escribió: > Yes, turning wal_compression off improves things. Slave that was mentioned > unfortunately lagged too much before this setting was applied and was > turned off. However the remaining slave lags less now, although still > occasionally up to a few minutes. I think single threadedness of recovery > is a big slowdown for write heavy databases. Maybe an option to increase > wal_size beyond 16MB in v11 will help. > > In the meantime we'll solve this by splitting the DB to 2 or 3 clusters or > maybe trying out some sharding solution like Citus. > > > Boris > > On Sun, Oct 21, 2018 at 9:06 AM, Boris Sagadin <bo...@infosplet.com> > wrote: > >> Hello, >> >> I have a database running on i3.8xlarge (256GB RAM, 32 CPU cores, 4x >> 1.9TB NVMe drive) AWS instance with about 5TB of disk space occupied, ext4, >> Ubuntu 16.04. >> >> Multi-tenant DB with about 40000 tables, insert heavy. >> >> I started a new slave with identical HW specs, SR. DB started syncing >> from master, which took about 4 hours, then it started applying the WALs. >> However, it seems it can't catch up. Delay is still around 3 hours >> (measured with now() - pg_last_xact_replay_timestamp()), even a day later. >> It goes a few 100s up and down, but it seems to float around 3h mark. >> >> Disk IO is low at about 10%, measured with iostat, no connected clients, >> recovery process is at around 90% CPU single core usage. >> >> Tried tuning the various parameters, but with no avail. Only thing I >> found suspicious is stracing the recovery process constantly produces many >> errors such as: >> >> lseek(428, 0, SEEK_END) = 780124160 >> lseek(30, 0, SEEK_END) = 212992 >> read(9, 0x7ffe4001f557, 1) = -1 EAGAIN (Resource temporarily >> unavailable) >> lseek(680, 0, SEEK_END) = 493117440 >> read(9, 0x7ffe4001f557, 1) = -1 EAGAIN (Resource temporarily >> unavailable) >> lseek(774, 0, SEEK_END) = 583368704 >> >> ...[snip]... >> >> read(9, 0x7ffe4001f557, 1) = -1 EAGAIN (Resource temporarily >> unavailable) >> lseek(774, 0, SEEK_END) = 583368704 >> read(9, 0x7ffe4001f557, 1) = -1 EAGAIN (Resource temporarily >> unavailable) >> lseek(277, 0, SEEK_END) = 502882304 >> lseek(6, 516096, SEEK_SET) = 516096 >> read(6, >> "\227\320\5\0\1\0\0\0\0\340\7\246\26\274\0\0\315\0\0\0\0\0\0\0}\0178\5&/\260\r"..., >> 8192) = 8192 >> read(9, 0x7ffe4001f557, 1) = -1 EAGAIN (Resource temporarily >> unavailable) >> lseek(735, 0, SEEK_END) = 272809984 >> read(9, 0x7ffe4001f557, 1) = -1 EAGAIN (Resource temporarily >> unavailable) >> lseek(277, 0, SEEK_END) = 502882304 >> >> ls -l fd/9 >> lr-x------ 1 postgres postgres 64 Oct 21 06:21 fd/9 -> pipe:[46358] >> >> >> Perf top on recovery produces: >> >> 27.76% postgres [.] pglz_decompress >> 9.90% [kernel] [k] entry_SYSCALL_64_after_swapgs >> 7.09% postgres [.] hash_search_with_hash_value >> 4.26% libpthread-2.23.so [.] llseek >> 3.64% libpthread-2.23.so [.] __read_nocancel >> 2.80% [kernel] [k] __fget_light >> 2.67% postgres [.] 0x000000000034d3ba >> 1.85% [kernel] [k] ext4_llseek >> 1.84% postgres [.] pg_comp_crc32c_sse42 >> 1.44% postgres [.] hash_any >> 1.35% postgres [.] 0x000000000036afad >> 1.29% postgres [.] MarkBufferDirty >> 1.21% postgres [.] XLogReadRecord >> [...] >> >> Tried changing the process limits with prlimit to unlimited, but no >> change. >> >> I can turn off the WAL compression but I doubt this is the main culprit. >> Any ideas appreciated. >> >> Regards, >> Boris >> >> > -- Cordialmente, Ing. Hellmuth I. Vargas S. Esp. Telemática y Negocios por Internet Oracle Database 10g Administrator Certified Associate EnterpriseDB Certified PostgreSQL 9.3 Associate