Hello community, CONTEXT:
Back in October I presented the talk "Debugging active queries with mid-flight instrumented explain plans" at PGConf EU 2024 (recording: https://www.youtube.com/watch?v=6ahTb-7C05c) presenting an experimental feature that enables visualization of in progress EXPLAIN ANALYZE executions. Given the positive feedback and requests, I am sending this patch with the feature, which I am calling Progressive Explain. PROPOSAL: This proposal introduces a feature to print execution plans of active queries in an in-memory shared hash object so that other sessions can visualize them with a new view: pg_stat_progress_explain. Plans are only printed if the new GUC parameter progressive_explain is enabled. For regular queries or queries started with EXPLAIN (without ANALYZE) the plan is printed only once at the start. For instrumented runs (started via EXPLAIN ANALYZE or when auto_explain flag log_analyze is enabled), the plan is printed on a fixed interval controlled by the new GUC parameter progressive_explain_interval. This plan includes all instrumentation stats computed so far (per node rows and execution time). New view: - pg_stat_progress_explain - pid: PID of the process running the query - last_explain: timestamp when plan was last printed - explain_count: amount of times plan was printed - total_explain_time: accumulated time spend printing plans (in ms) - explain: the actual plan (limited read privileges) New GUCs: - progressive_explain: if progressive plans are printed for local session. - type: bool - default: off - context: user - progressive_explain_interval: interval between each explain print. - type: int - default: 1s - min: 10ms - context: user - progressive_explain_sample_rate: fraction of rows processed by the query until progressive_explain_interval is evaluated to print a progressive plan - type: floating point - default: 0.1 - range: (0.0 - 1.0) - context: user - progressive_explain_output_size: max output size of the plan printed in the in-memory hash table. - type: int - default: 4096 - min: 100 - context: postmaster - progressive_explain_format: format used to print the plans. - type: enum - default: text - context: user - progressive_explain_settings: controls whether information about modified configuration is added to the printed plan. - type: bool - default: off - context: user - progressive_explain_verbose: controls whether verbose details are added to the printed plan. - type: bool - default: off - context: user DEMONSTRATION: postgres=# SET progressive_explain = ON; SET postgres=# EXPLAIN ANALYZE SELECT * FROM test t1 UNION ALL SELECT * FROM test t1; postgres=# select * from pg_stat_progress_explain; -[ RECORD 1 ]------+--------------------------------------------------------------------------------------------------------------------------------------- pid | 299663 last_explain | 2024-12-29 22:40:33.016833+00 explain_count | 5 total_explain_time | 0.205 explain | Append (cost=0.00..466670.40 rows=20000160 width=37) (actual time=0.052..3372.318 rows=14013813 loops=1) | Buffers: shared hit=4288 read=112501 | -> Seq Scan on test t1 (cost=0.00..183334.80 rows=10000080 width=37) (actual time=0.052..1177.428 rows=10000000 loops=1) | Buffers: shared hit=4288 read=79046 | -> Seq Scan on test t1_1 (cost=0.00..183334.80 rows=10000080 width=37) (actual time=0.072..608.481 rows=4013813 loops=1) (current) | Buffers: shared read=33455 | IMPLEMENTATION HIGHLIGHTS: - The initial plan is printed at the end of standard_ExecutorStart if progressive_explain is enabled, for both regular queries and instrumented ones (EXPLAIN ANALYZE): /* * Start progressive explain if enabled. */ if (progressive_explain) ProgressiveExplainBegin(queryDesc); - The incremental plan print for instrumented runs uses a dedicated ExecProcNode if progressive_explain is enabled: if (node->instrument) if (progressive_explain) node->ExecProcNode = ExecProcNodeInstrExplain; else node->ExecProcNode = ExecProcNodeInstr; else node->ExecProcNode = node->ExecProcNodeReal; - ExecProcNodeInstrExplain is identical to ExecProcNodeInstr with an additional part to print plans based on a sampling logic: /* * Update progressive explain based on sampling. */ if (pg_prng_double(&pg_global_prng_state) < progressive_explain_sample_rate) ProgressiveExplainUpdate(node); That logic was added because ExecProcNodeInstrExplain is called once per row processed (a lot of times) and performing the timestamp interval check with progressive_explain_interval to decide whether to print the plan (done inside ProgressiveExplainUpdate) is expensive. Benchmarks (shared at the end of this email) show that sampling + timestamp check gives much better results than performing the timestamp check at every ExecProcNodeInstrExplain call. - The plans are stored in a shared hash object (explainArray) allocated at database start, similar to procArray. ExplainHashShmemSize() computes shared memory needed for it, based on max_connections + max_parallel_workers for the amount of elements in the array and progressive_explain_output_size for the size per element. - A memory context release callback is configured in the memory context where the query is running, being responsible for updating explainArray even when the query doesn't finish gracefully. - Instrumented plans being printed incrementally need to clone instrumentation objects to change them, so each print uses a dedicated memory context that gets released after the output is constructed. This avoids extended private memory usage: /* Dedicated memory context for the current plan being printed */ tmpCxt = AllocSetContextCreate(CurrentMemoryContext, "Progressive Explain Temporary Context", ALLOCSET_DEFAULT_SIZES); - A new version of InstrEndLoop (InstrEndLoopForce) was created that allows to be called targeting in-progress instrumented objects. Those are common when traversing the plan tree of an active query. - Column explain from pg_stat_progress_explain can only be visualized by superusers or the same role that is running the query. If none of those conditions are met, users will see "<insufficient privilege>". - For instrumented runs, the printed includes 2 per node modifiers when appropriate: <current>: the plan node currently being processed. <never executed>: a plan node not processed yet. IMPLEMENTATION OVERHEAD: When not used, the overhead added is: - One IF at standard_ExecutorStart to check if progressive_explain is enabled - For instrumented runs (EXPLAIN ANALYZE), one IF at ExecProcNodeFirst to define ExecProcNode wrapper BENCHMARKS: Performed 3 scenarios of benchmarks: A) Comparison between unpatched PG18, patched with progressive explain disabled and patched with feature enabled globally (all queries printing the plan at query start: - PG18 without patch: postgres@ip-172-31-39-191:~$ /usr/local/pgsql/bin/pgbench -S -n -T 120 -c 30 number of transactions actually processed: 2173978 latency average = 1.655 ms tps = 18127.977529 (without initial connection time) - PG18 with patch: -- progressive_explain = off postgres@ip-172-31-39-191:~$ /usr/local/pgsql/bin/pgbench -S -n -T 120 -c 30 number of transactions actually processed: 2198806 latency average = 1.636 ms tps = 18333.157809 (without initial connection time) -- progressive_explain = on (prints plan only once per query) postgres@ip-172-31-39-191:~$ /usr/local/pgsql/bin/pgbench -S -n -T 120 -c 30 number of transactions actually processed: 2047459 latency average = 1.756 ms tps = 17081.477199 (without initial connection time) B) EXPLAIN ANALYZE performance with different progressive_explain_interval settings in patched: -- progressive_explain = off postgres@ip-172-31-39-191:~$ /usr/local/pgsql/bin/pgbench -n -T 120 -c 1 -f script.sql number of transactions actually processed: 27 latency average = 4492.845 ms -- progressive_explain = on -- progressive_explain_interval = 1s (default) -- progressive_explain_sample_rate = 0.01 (default) postgres@ip-172-31-39-191:~$ /usr/local/pgsql/bin/pgbench -n -T 300 -c 1 -f script.sql number of transactions actually processed: 26 latency average = 4656.067 ms -- progressive_explain = on -- progressive_explain_interval = 10ms -- progressive_explain_sample_rate = 0.01 (default) postgres@ip-172-31-39-191:~$ /usr/local/pgsql/bin/pgbench -n -T 300 -c 1 -f script.sql number of transactions actually processed: 26 latency average = 4785.608 ms C) EXPLAIN ANALYZE performance in patched with and without progressive_explain_sample_rate, ie, sampling with 2 different values and also no sampling logic: -- progressive_explain = off postgres@ip-172-31-39-191:~$ /usr/local/pgsql/bin/pgbench -n -T 120 -c 1 -f script.sql number of transactions actually processed: 27 latency average = 4492.845 ms -- progressive_explain = on -- progressive_explain_interval = 1s (default) -- progressive_explain_sample_rate = 0.01 (default) postgres@ip-172-31-39-191:~$ /usr/local/pgsql/bin/pgbench -n -T 120 -c 1 -f script.sql number of transactions actually processed: 26 latency average = 4656.067 ms -- progressive_explain = on -- progressive_explain_interval = 1s (default) -- progressive_explain_sample_rate = 1 postgres@ip-172-31-39-191:~$ /usr/local/pgsql/bin/pgbench -n -T 120 -c 1 -f script.sql number of transactions actually processed: 19 latency average = 6432.902 ms -- progressive_explain = on -- progressive_explain_interval = 1s (default) -- NO SAMPLING LOGIC postgres@ip-172-31-39-191:~$ /usr/local/pgsql/bin/pgbench -n -T 120 -c 1 -f script.sql number of transactions actually processed: 21 latency average = 5864.820 ms BENCHMARK RESULTS: It definitely needs more testing but preliminary results show that: - From (A) we see that the patch adds negligible overhead when the feature is not used. Enabling globally reduces overall TPS as all queries are spending time printing the plan. The idea is to enable progressive_explain on a per-need basis, only to a subset of sessions that need it. - From (B) we see that using progressive explain slightly increases total execution time. Difference between using progressive_explain_interval set to 1s (plan printed 4 times per query in the test) and to 10ms (plan printed ~460 times per query in the test) is very small. The actual overhead appears when changing progressive_explain_sample_rate. - From (C) we see that progressive_explain_sample_rate with a low value (default 0.01) performs better than not using sampling or using progressive_explain_sample_rate = 1. So the overhead of having the sampling logic is much lower than not sampling at all. TESTS: Currently working on tests for a second version of the patch. DOCUMENTATION: Added documentation for the new view pg_stat_progress_explain, new GUCs and a new item in section 14.1: 14.1. Using EXPLAIN 14.1.1. EXPLAIN Basics 14.1.2. EXPLAIN ANALYZE 14.1.3. Caveats 14.1.4. Progressive EXPLAIN FURTHER DISCUSSION: Considering that this patch introduces a new major feature with several new components (view, GUCs, etc), there is open room for discussion such as: - Do the columns in pg_stat_progress_explain make sense? Are we missing or adding unnecessary columns? - Do the new GUCs make sense and are their default values appropriate? - Do we want progressive explain to print plans of regular queries started without EXPLAIN if progressive_explain is enabled or should the feature be restricted to instrumented queries (EXPLAIN ANALYZE)? - Is the size of explainHash based on max_connections + max_parallel_workers large enough or are there other types of backends that use the executor and will print plans too? Regards, Rafael Castro.
v1-0001-Proposal-for-progressive-explains.patch
Description: Binary data