On Thu, 21 Aug 2025 at 11:03, hubert depesz lubaczewski <dep...@depesz.com> wrote: > > On Wed, Aug 20, 2025 at 06:30:00PM +0100, Thom Brown wrote: > > Do you have THP enabled? Can you use mpstat and see what %steal shows as? > > Sorry, what is "THP"? I tried searching for "what is thp", and most > common search results are related to some chemical compound.
Ah, yeah I meant transparent hugepage: cat /sys/kernel/mm/transparent_hugepage/enabled This should show it being set as "never". > > mpstat, yes, we have it. Let me quickly show what I get: > > Two days ago, at ~ 10:10pm UTC I saw this in Pg logs: > > # grep -oP '^2025-08-19 22:09:2\d\.\d+ UTC' postgresql-2025-08-19_220000.csv > | uniq -c | grep -C3 -P '^\s*\d\d' > 2 2025-08-19 22:09:29.084 UTC > 1 2025-08-19 22:09:29.094 UTC > 2 2025-08-19 22:09:29.097 UTC > 70 2025-08-19 22:09:29.109 UTC > 90 2025-08-19 22:09:29.110 UTC > 6 2025-08-19 22:09:29.111 UTC > 1 2025-08-19 22:09:29.153 UTC > 1 2025-08-19 22:09:29.555 UTC > > As you can see we have 70, and then 90 messages all logged with the same > timestamp. All of them (160) were "duration:x" lines. > > At the time, mpstat (called using `mpstat 5`) looked like this: > > CPU %usr %nice %sys %iowait %irq %soft %steal > %guest %gnice %idle > 22:09:04 all 2.25 0.00 0.27 0.24 0.00 0.10 0.00 > 0.00 0.00 97.14 > 22:09:09 all 2.70 0.00 0.28 0.27 0.00 0.12 0.00 > 0.00 0.00 96.63 > 22:09:14 all 2.83 0.00 0.63 0.23 0.00 0.09 0.00 > 0.00 0.00 96.22 > 22:09:19 all 2.66 0.00 0.28 0.23 0.00 0.10 0.00 > 0.00 0.00 96.74 > 22:09:24 all 2.36 0.00 0.27 0.21 0.00 0.10 0.00 > 0.00 0.00 97.06 > 22:09:29 all 2.49 0.00 0.40 0.40 0.00 0.11 0.00 > 0.00 0.00 96.60 > 22:09:34 all 2.65 0.00 0.28 0.23 0.00 0.11 0.00 > 0.00 0.00 96.73 > 22:09:39 all 2.26 0.00 0.26 0.21 0.00 0.09 0.00 > 0.00 0.00 97.19 > 22:09:44 all 2.31 0.00 0.25 0.22 0.00 0.09 0.00 > 0.00 0.00 97.13 > 22:09:49 all 2.44 0.00 0.22 0.24 0.00 0.08 0.00 > 0.00 0.00 97.02 > 22:09:54 all 2.23 0.00 0.23 0.28 0.00 0.10 0.00 > 0.00 0.00 97.16 > 22:09:59 all 2.48 0.00 0.26 0.30 0.00 0.11 0.00 > 0.00 0.00 96.85 > 22:10:04 all 2.29 0.00 0.28 0.26 0.00 0.12 0.00 > 0.00 0.00 97.06 > 22:10:09 all 2.31 0.00 0.22 0.21 0.00 0.11 0.00 > 0.00 0.00 97.14 > 22:10:14 all 2.38 0.00 0.25 0.21 0.00 0.10 0.00 > 0.00 0.00 97.05 > 22:10:19 all 2.54 0.00 0.24 0.20 0.00 0.11 0.00 > 0.00 0.00 96.91 > 22:10:24 all 2.26 0.00 0.25 0.22 0.00 0.10 0.00 > 0.00 0.00 97.17 > 22:10:29 all 2.12 0.00 0.27 0.24 0.00 0.09 0.00 > 0.00 0.00 97.27 > 22:10:34 all 2.56 0.00 0.26 0.26 0.00 0.12 0.00 > 0.00 0.00 96.80 > 22:10:39 all 2.16 0.00 0.23 0.24 0.00 0.10 0.00 > 0.00 0.00 97.27 > 22:10:44 all 2.37 0.00 0.26 0.22 0.00 0.10 0.00 > 0.00 0.00 97.05 > 22:10:49 all 2.25 0.00 0.25 0.61 0.00 0.09 0.00 > 0.00 0.00 96.80 > 22:10:54 all 2.41 0.00 0.28 0.22 0.00 0.10 0.00 > 0.00 0.00 96.99 > 22:10:59 all 2.83 0.00 0.29 0.19 0.00 0.12 0.00 > 0.00 0.00 96.57 This output looks fine, so it doesn't show anything concerning, so suggests the issue is somehow on the Postgres side. Did you happen to poll pg_stat_activity at the time to see whether you had lots of IPC waits? I'm wondering whether the storage layer is freezing up for a moment. Thom