Greetings hackers,
I have I hope interesting observation (and nano patch proposal) on system where
statistics freshness is a critical factor. Autovacuum/autogathering statistics
was tuned to be pretty very aggressive:
autovacuum_vacuum_cost_delay=0 (makes autovacuum_vacuum_cost_limit irrelevant)
autovacuum_naptime=1s
autovacuum_max_workers=4
Some critical table partitions are configured with:
autovacuum_analyze_scale_factor=0.001, autovacuum_analyze_threshold=50000 to
force auto-analyze jobs pretty often. The interesting logs are like this:
automatic analyze of table "t1" system usage: CPU: user: 37.52 s, system: 23.01
s, elapsed: 252.14 s
automatic analyze of table "t2" system usage: CPU: user: 38.70 s, system: 22.63
s, elapsed: 317.33 s
automatic analyze of table "t2" system usage: CPU: user: 39.38 s, system: 21.43
s, elapsed: 213.58 s
automatic analyze of table "t1" system usage: CPU: user: 37.91 s, system: 24.49
s, elapsed: 229.45 s
and this is root-cause of my question. As you can see there is huge 3x-4x time
discrepancy between "elapsed" and user+system which is strange at least for me
as there should be no waiting (autovacuum_vacuum_cost_delay=0). According to
various tools it is true: Time was wasted somewhere else, but not in the
PostgreSQL analyze. The ps(1) and pidstat(1) also report the same for the
worker:
06:56:12 AM PID %usr %system %guest %CPU CPU Command
06:56:13 AM 114774 8.00 10.00 0.00 18.00 18 postgres
06:56:14 AM 114774 8.00 11.00 0.00 19.00 15 postgres
06:56:15 AM 114774 5.00 13.00 0.00 18.00 18 postgres
06:56:17 AM PID kB_rd/s kB_wr/s kB_ccwr/s Command
06:56:18 AM 114774 63746.53 0.00 0.00 postgres
06:56:19 AM 114774 62896.00 0.00 0.00 postgres
06:56:20 AM 114774 62920.00 0.00 0.00 postgres
One could argue that such autoanalyze worker could perform 5x more of work
(%CPU -> 100%) here. The I/O system is not performing a lot (total = 242MB/s
reads@22k IOPS, 7MB/s writes@7k IOPS, with service time 0.04ms), although
reporting high utilization I'm pretty sure it could push much more. There can
be up to 3x-4x of such 70-80MB/s analyzes in parallel (let's say 300MB/s for
statistics collection alone).
According to various gdb backtraces, a lot of time is spent here:
#0 0x00007f98cdfc9f73 in __pread_nocancel () from /lib64/libpthread.so.0
#1 0x0000000000741a16 in pread (__offset=811253760, __nbytes=8192,
__buf=0x7f9413ab7280, __fd=<optimized out>) at /usr/include/bits/unistd.h:84
#2 FileRead (file=<optimized out>, buffer=0x7f9413ab7280 "\037\005",
amount=8192, offset=811253760, wait_event_info=167772173) at fd.c:1883
#3 0x0000000000764b8f in mdread (reln=<optimized out>, forknum=<optimized
out>, blocknum=19890902, buffer=0x7f9413ab7280 "\037\005") at md.c:596
#4 0x000000000073d69b in ReadBuffer_common (smgr=<optimized out>,
relpersistence=112 'p', forkNum=MAIN_FORKNUM, blockNum=19890902,
mode=RBM_NORMAL, strategy=0x1102278, hit=0x7fffba7e2d4f)
at bufmgr.c:897
#5 0x000000000073e27e in ReadBufferExtended (reln=0x7f98c0c9ded0,
forkNum=MAIN_FORKNUM, blockNum=19890902, mode=<optimized out>,
strategy=<optimized out>) at bufmgr.c:665
#6 0x00000000004c2e2f in heapam_scan_analyze_next_block (scan=<optimized out>,
blockno=<optimized out>, bstrategy=<optimized out>) at heapam_handler.c:998
#7 0x0000000000597de1 in table_scan_analyze_next_block (bstrategy=<optimized
out>, blockno=<optimized out>, scan=0x10c8098) at
../../../src/include/access/tableam.h:1462
#8 acquire_sample_rows (onerel=0x7f98c0c9ded0, elevel=13, rows=0x1342e08,
targrows=1500000, totalrows=0x7fffba7e3160, totaldeadrows=0x7fffba7e3158) at
analyze.c:1048
#9 0x0000000000596a50 in do_analyze_rel (onerel=0x7f98c0c9ded0,
params=0x10744e4, va_cols=0x0, acquirefunc=0x597ca0 <acquire_sample_rows>,
relpages=26763525, inh=false,
in_outer_xact=false, elevel=13) at analyze.c:502
[..]
#12 0x00000000006e76b4 in autovacuum_do_vac_analyze (bstrategy=0x1102278,
tab=<optimized out>) at autovacuum.c:3118
[..]
The interesting thing that targrows=1.5mlns and that blocks are accessed (as
expected) in sorted order:
Breakpoint 1, heapam_scan_analyze_next_block (scan=0x10c8098, blockno=19890910,
bstrategy=0x1102278) at heapam_handler.c:984
Breakpoint 1, heapam_scan_analyze_next_block (scan=0x10c8098, blockno=19890912,
bstrategy=0x1102278) at heapam_handler.c:984
Breakpoint 1, heapam_scan_analyze_next_block (scan=0x10c8098, blockno=19890922,
bstrategy=0x1102278) at heapam_handler.c:984
Breakpoint 1, heapam_scan_analyze_next_block (scan=0x10c8098, blockno=19890935,
bstrategy=0x1102278) at heapam_handler.c:984
Breakpoint 1, heapam_scan_analyze_next_block (scan=0x10c8098, blockno=19890996,
bstrategy=0x1102278) at heapam_handler.c:984
And probably this explains the discrepancy, perf with CPU usage reporting shows
a lot of frames waiting on I/O on readaheads done by ext4, trimmed for clarity:
# Children Self sys usr Command Shared Object Symbol
63.64% 0.00% 0.00% 0.00% postgres [kernel.kallsyms] [k]
entry_SYSCALL_64_after_hwframe
---entry_SYSCALL_64_after_hwframe
do_syscall_64
|--59.66%--sys_pread64
| vfs_read
| --59.09%--__vfs_read
| --58.24%--generic_file_read_iter
| |--47.44%--ondemand_readahead
| |
--46.88%--__do_page_cache_readahead
| |
|--32.67%--ext4_mpage_readpages
| | |
|--16.76%--submit_bio
| |
|--10.23%--blk_finish_plug
[..]
63.64% 1.99% 1.99% 0.00% postgres [kernel.kallsyms] [k]
do_syscall_64
|--61.65%--do_syscall_64
| |--59.66%--sys_pread64
| | vfs_read
| | --59.09%--__vfs_read
| | --58.24%--generic_file_read_iter
| |
|--47.44%--ondemand_readahead
| | |
--46.88%--__do_page_cache_readahead
61.36% 0.00% 0.00% 0.00% postgres postgres [.]
FileRead
---FileRead
__pread_nocancel
--60.51%--entry_SYSCALL_64_after_hwframe
do_syscall_64
--59.66%--sys_pread64
vfs_read
--59.09%--__vfs_read
--58.24%--generic_file_read_iter
|--47.44%--ondemand_readahead
|
--46.88%--__do_page_cache_readahead
61.36% 0.85% 0.00% 0.85% postgres libpthread-2.17.so [.]
__pread_nocancel
|--60.51%--__pread_nocancel
| entry_SYSCALL_64_after_hwframe
| do_syscall_64
| --59.66%--sys_pread64
| vfs_read
| --59.09%--__vfs_read
| --58.24%--generic_file_read_iter
|
|--47.44%--ondemand_readahead
| |
--46.88%--__do_page_cache_readahead
59.66% 0.00% 0.00% 0.00% postgres [kernel.kallsyms] [k]
sys_pread64
---sys_pread64
vfs_read
--59.09%--__vfs_read
--58.24%--generic_file_read_iter
|--47.44%--ondemand_readahead
|
--46.88%--__do_page_cache_readahead
|
|--32.67%--ext4_mpage_readpages
[..]
Perf --no-children also triple confirms that there isn't any function that is
burning a lot inside the worker:
# Overhead sys usr Command Shared Object Symbol
5.40% 0.00% 5.40% postgres [vdso] [.]
__vdso_clock_gettime
5.11% 0.00% 5.11% postgres postgres [.]
acquire_sample_rows
---acquire_sample_rows
3.98% 0.00% 3.98% postgres postgres [.]
heapam_scan_analyze_next_tuple
---heapam_scan_analyze_next_tuple
3.69% 3.69% 0.00% postgres [kernel.kallsyms] [k]
pvclock_clocksource_read
My questions are:
a) does anybody know if it is expected that getrusage() doesn't include
readahead times as current thread system time ? (I don't know by may be
performed by other kernel threads?) ru_stime is defined as "This is the total
amount of time spent executing in kernel mode". Maybe the "executing" is the
keyword here? (waiting != executing?)
b) initially I've wanted to add a new pg_rusage_show_verbose() that would also
add ru_inblock, but that wouldn't add much value to the end user. Also adding
another timing directly around table_scan_analyze_next_block() seems like the
bad idea as it involves locking underneah. So I've tried the most easy approach
to simply log $pgStatBlockReadTime as strictly I/O time spent in pread()
(ReadBuffer_common() already measures time). The attached patch for
PgSQL14-devel in heavy I/O conditions (with track_io_timings=on) logs the
following:
"LOG: automatic analyze of table "test.public.t1_default" system usage: IO
read time 0.69 s, CPU: user: 0.18 s, system: 0.13 s, elapsed: 0.92 s"
my interpretation would be that IO reading time was most limiting factor (69/92
= 75%), but *CPU* on kernel side was just 13s. It could give the enduser/DBA
the information needed, the information where's the bottleneck given the
autovacuum_vacuum_cost_delay=0. In autovacuum_vacuum_cost_delay>0 maybe it
would make sense to include also time spent on sleeping?
c) I'm curious if anybody has any I/O related insights into analyze.c
processing especially related to readaheads? E.g. maybe disabling readahead
would help for PostgreSQL analyze.c usecase on NVMe? Is it worth given that
only x% of blocks are needed? The only option I'm aware would be to e.g.
hash-partition the table (to introduce parallelism by autovacuums and enable
even workers). Any hints or comments?
All of the above observations from PostgreSQL 12.4 on Linux kernel 4.14 with
ext4/striped dm with 3x-4x NVMEs.
-Jakub Wartak.
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 8af12b5c6b..fea1bd6f44 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -312,6 +312,7 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
Oid save_userid;
int save_sec_context;
int save_nestlevel;
+ PgStat_Counter startblockreadtime = 0;
if (inh)
ereport(elevel,
@@ -347,6 +348,7 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
if (IsAutoVacuumWorkerProcess() && params->log_min_duration >= 0)
{
pg_rusage_init(&ru0);
+ startblockreadtime = pgStatBlockReadTime;
if (params->log_min_duration > 0)
starttime = GetCurrentTimestamp();
}
@@ -686,10 +688,11 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
TimestampDifferenceExceeds(starttime,
GetCurrentTimestamp(),
params->log_min_duration))
ereport(LOG,
- (errmsg("automatic analyze of table
\"%s.%s.%s\" system usage: %s",
+ (errmsg("automatic analyze of table
\"%s.%s.%s\" system usage: IO read time %.2f s, %s",
get_database_name(MyDatabaseId),
get_namespace_name(RelationGetNamespace(onerel)),
RelationGetRelationName(onerel),
+ (double)
(pgStatBlockReadTime - startblockreadtime)/1000000,
pg_rusage_show(&ru0))));
}