Re: postgres_fdw fails to see that array type belongs to extension
Hi, I realized that ALTER EXTENSION foo ADD TYPE _bar does pretty much the same via ExecAlterExtensionContentsStmt(). So the code in the patch seems fine. On 1/8/24 12:21, David Geier wrote: The attached patch just adds a 2nd dependency between the array type and the extension, using recordDependencyOnCurrentExtension(). It seems like that the other internal dependency on the element type must stay? If that seems reasonable I can add a test to modules/test_extensions. Can extensions in contrib use test extension in their own tests? It looks like postgres_fdw doesn't test any of the shippability logic. -- David Geier (ServiceNow)
Re: postgres_fdw fails to see that array type belongs to extension
Hi, On 12/27/23 18:38, Tom Lane wrote: Hmm. It seems odd that if an extension defines a type, the type is listed as a member of the extension but the array type is not. That makes it look like the array type is an externally-created thing that happens to depend on the extension, when it's actually part of the extension. I'm surprised we've not run across other misbehaviors traceable to that. Agreed. Of course, fixing it like that leads to needing to change the contents of pg_depend, so it wouldn't be back-patchable. But it seems like the best way in the long run. The attached patch just adds a 2nd dependency between the array type and the extension, using recordDependencyOnCurrentExtension(). It seems like that the other internal dependency on the element type must stay? If that seems reasonable I can add a test to modules/test_extensions. Can extensions in contrib use test extension in their own tests? It looks like postgres_fdw doesn't test any of the shippability logic. -- David Geier (ServiceNow) From de23a4e9f1b0620a5204594139568cdcb3d57885 Mon Sep 17 00:00:00 2001 From: David Geier Date: Mon, 8 Jan 2024 10:58:21 +0100 Subject: [PATCH] Fix dependency of array of type owned by extension --- src/backend/catalog/pg_type.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/src/backend/catalog/pg_type.c b/src/backend/catalog/pg_type.c index d7167108fb..3544a3af1a 100644 --- a/src/backend/catalog/pg_type.c +++ b/src/backend/catalog/pg_type.c @@ -724,8 +724,10 @@ GenerateTypeDependencies(HeapTuple typeTuple, if (OidIsValid(typeForm->typelem)) { ObjectAddressSet(referenced, TypeRelationId, typeForm->typelem); - recordDependencyOn(, , - isImplicitArray ? DEPENDENCY_INTERNAL : DEPENDENCY_NORMAL); + recordDependencyOn(, , isImplicitArray ? DEPENDENCY_INTERNAL : DEPENDENCY_NORMAL); + + if (makeExtensionDep) + recordDependencyOnCurrentExtension(, rebuild); } } -- 2.39.2
postgres_fdw fails to see that array type belongs to extension
Hi hackers, We use postgres_fdw to connect two databases. Both DBs have an extension installed which provides a custom string data type. Our extension is known to the FDW as we created the foreign server with our extension listed in the "extensions" option. The filter clause of the query SELECT * FROM test WHERE col = 'foo' OR col = 'bar' is pushed down to the remote, while the filter clause of the semantically equivalent query SELECT * FROM test WHERE col IN ('foo', 'bar') is not. I traced this down to getExtensionOfObject() called from lookup_shippable(). getExtensionOfObject() doesn't recurse but only checks first level dependencies and only checks for extension dependencies. However, the IN operator takes an array of our custom data type as argument (type is typically prefixed with _ in pg_type). This array type is only dependent on our extension via the custom data type in two steps which postgres_fdw doesn't see. Therefore, postgres_fdw doesn't allow for push-down of the IN. Thoughts? -- David Geier (ServiceNow)
Fix assertion in autovacuum worker
Hi hackers, PostgreSQL hit the following assertion during error cleanup, after being OOM in dsa_allocate0(): void dshash_detach(dshash_table *hash_table) { ASSERT_NO_PARTITION_LOCKS_HELD_BY_ME(hash_table); called from pgstat_shutdown_hook(), called from shmem_exit(), called from proc_exit(), called from the exception handler. The partition locks got previously acquired by AutoVacWorkerMain() pgstat_report_autovac() pgstat_get_entry_ref_locked() pgstat_get_entry_ref() dshash_find_or_insert() resize() resize() locks all partitions so the hash table can safely be resized. Then it calls dsa_allocate0(). If dsa_allocate0() fails to allocate, it errors out. The exception handler calls proc_exit() which normally calls LWLockReleaseAll() via AbortTransaction() but only if there's an active transaction. However, pgstat_report_autovac() runs before a transaction got started and hence LWLockReleaseAll() doesn't run before pgstat_shutdown_hook() is called. See attached patch for an attempt to fix this issue. -- David Geier (ServiceNow) From 5580e3680b2211235e4bc2b5dcbfe6b4f5b8eee5 Mon Sep 17 00:00:00 2001 From: David Geier Date: Tue, 28 Nov 2023 18:52:46 +0100 Subject: [PATCH] Fix autovacuum cleanup on error --- src/backend/postmaster/autovacuum.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c index f929b62e8a..5de55649d9 100644 --- a/src/backend/postmaster/autovacuum.c +++ b/src/backend/postmaster/autovacuum.c @@ -1584,6 +1584,9 @@ AutoVacWorkerMain(int argc, char *argv[]) /* Report the error to the server log */ EmitErrorReport(); + /* Make sure all locks are released so assertions don't hit in at-exit callbacks */ + LWLockReleaseAll(); + /* * We can now go away. Note that because we called InitProcess, a * callback was registered to do ProcKill, which will clean up -- 2.39.2
Re: how to do profile for pg?
Hi, This talk from Andres seems to have some relevant information for you: https://www.youtube.com/watch?v=HghP4D72Noc -- David Geier (ServiceNow)
Re: Parallel Bitmap Heap Scan reports per-worker stats in EXPLAIN ANALYZE
Hi Dmitry, Thanks for looking at the patch and sorry for the long line. On 3/17/23 21:14, Dmitry Dolgov wrote: The idea sounds reasonable to me, but I have to admit snapshot_and_stats implementation looks awkward. Maybe it would be better to have a separate structure field for both stats and snapshot, which will be set to point to a corresponding place in the shared FAM e.g. when the worker is getting initialized? shm_toc_allocate mentions BUFFERALIGN to handle possibility of some atomic operations needing it, so I guess that would have to be an alignment in this case as well. Probably another option would be to allocate two separate pieces of shared memory, which resolves questions like proper alignment, but annoyingly will require an extra lookup and a new key. I considered the other options and it seems to me none of them is particularly superior. All of them have pros and cons with the cons mostly outweighing the pros. Let me quickly elaborate: 1. Use multiple shm_toc entries: Shared state is split into multiple pieces. Extra pointers in BitmapHeapScanState needed to point at the split out data. BitmapHeapScanState has already a shared_info member, which is not a pointer to the shared memory but a pointer to the leader local data allocated used to store the instrumentation data from the workers. This is confusing but at least consistent with how its done in other places (e.g. execSort.c, nodeHash.c, nodeIncrementalSort.c). Having another pointer there which points to the shared memory makes it even more confusing. If we go this way we would have e.g. shared_info_copy and shared_info members in BitmapHeapScanState. 2. Store two extra pointers to the shared FAM entries in BitmapHeapScanState: IMHO, that is the better alternative of (1) as it doesn't need an extra TOC entry but comes with the same confusion of multiple pointers to SharedBitmapHeapScanInfo in BitmapHeapScanState. But maybe that's not too bad? 3. Solution in initial patch (use two functions to obtain pointers where/when needed): Avoids the need for another pointer in BitmapHeapScanState at the cost / ugliness of having to call the helper functions. Another, not yet discussed, option I can see work is: 4. Allocate a fixed amount of memory for the instrumentation stats based on MAX_PARALLEL_WORKER_LIMIT: MAX_PARALLEL_WORKER_LIMIT is 1024 and used as the limit of the max_parallel_workers GUC. This way MAX_PARALLEL_WORKER_LIMIT * sizeof(BitmapHeapScanInstrumentation) = 1024 * 8 = 8192 bytes would be allocated. To cut this down in half we could additionally change the type of lossy_pages and exact_pages from long to uint32. Only possibly needed memory would have to get initialized, the remaining unused memory would remain untouched to not waste cycles. My first preference is the new option (4). My second preference is option (1). What's your take? -- David Geier (ServiceNow)
Re: Eliminate redundant tuple visibility check in vacuum
Hi, On 9/1/23 03:25, Peter Geoghegan wrote: On Thu, Aug 31, 2023 at 3:35 PM Melanie Plageman wrote: Any inserting transaction which aborts after heap_page_prune()'s visibility check will now be of no concern to lazy_scan_prune(). Since we don't do the visibility check again, we won't find the tuple HEAPTUPLE_DEAD and thus won't have the problem of adding the tuple to the array of tuples for vacuum to reap. This does mean that we won't reap and remove tuples whose inserting transactions abort right after heap_page_prune()'s visibility check. But, this doesn't seem like an issue. It's definitely not an issue. The loop is essentially a hacky way of getting a consistent picture of which tuples should be treated as HEAPTUPLE_DEAD, and which tuples need to be left behind (consistent at the level of each page and each HOT chain, at least). Making that explicit seems strictly better. They may not be removed until the next vacuum, but ISTM it is actually worse to pay the cost of re-pruning the whole page just to clean up that one tuple. Maybe if that renders the page all visible and we can mark it as such in the visibility map -- but that seems like a relatively narrow use case. The chances of actually hitting the retry are microscopic anyway. It has nothing to do with making sure that dead tuples from aborted tuples get removed for its own sake, or anything. Rather, the retry is all about making sure that all TIDs that get removed from indexes can only point to LP_DEAD stubs. Prior to Postgres 14, HEAPTUPLE_DEAD tuples with storage would very occasionally be left behind, which made life difficult in a bunch of other places -- for no good reason. That makes sense and seems like a much better compromise. Thanks for the explanations. Please update the comment to document the corner case and how we handle it. -- David Geier (ServiceNow)
Re: Eliminate redundant tuple visibility check in vacuum
Hi Melanie, On 8/31/23 02:59, Melanie Plageman wrote: I created a large table and then updated a tuple on every page in the relation and vacuumed it. I saw a consistent slight improvement in vacuum execution time. I profiled a bit with perf stat as well. The difference is relatively small for this kind of example, but I can work on a more compelling, realistic example. I think eliminating the retry loop is also useful, as I have heard that users have had to cancel vacuums which were in this retry loop for too long. Just to provide a specific test case, if you create a small table like this create table foo (a int, b int, c int) with(autovacuum_enabled=false); insert into foo select i, i, i from generate_series(1, 1000); And then vacuum it. I find that with my patch applied I see a consistent ~9% speedup (averaged across multiple runs). master: ~533ms patch: ~487ms And in the profile, with my patch applied, you notice less time spent in HeapTupleSatisfiesVacuumHorizon() master: 11.83% postgres postgres [.] heap_page_prune 11.59% postgres postgres [.] heap_prepare_freeze_tuple 8.77% postgres postgres [.] lazy_scan_prune 8.01% postgres postgres [.] HeapTupleSatisfiesVacuumHorizon 7.79% postgres postgres [.] heap_tuple_should_freeze patch: 13.41% postgres postgres [.] heap_prepare_freeze_tuple 9.88% postgres postgres [.] heap_page_prune 8.61% postgres postgres [.] lazy_scan_prune 7.00% postgres postgres [.] heap_tuple_should_freeze 6.43% postgres postgres [.] HeapTupleSatisfiesVacuumHorizon Thanks a lot for providing additional information and the test case. I tried it on a release build and I also see a 10% speed-up. I reset the visibility map between VACUUM runs, see: CREATE EXTENSION pg_visibility; CREATE TABLE foo (a INT, b INT, c INT) WITH(autovacuum_enabled=FALSE); INSERT INTO foo SELECT i, i, i from generate_series(1, 1000) i; VACUUM foo; SELECT pg_truncate_visibility_map('foo'); VACUUM foo; SELECT pg_truncate_visibility_map('foo'); VACUUM foo; ... The first patch, which refactors the code so we can pass the result of the visibility checks to the caller, looks good to me. Regarding the 2nd patch (disclaimer: I'm not too familiar with that area of the code): I don't completely understand why the retry loop is not needed anymore and how you now detect/handle the possible race condition? It can still happen that an aborting transaction changes the state of a row after heap_page_prune() looked at that row. Would that case now not be ignored? -- David Geier (ServiceNow)
Re: Eliminate redundant tuple visibility check in vacuum
Hi Melanie, On 8/29/23 01:49, Melanie Plageman wrote: While working on a set of patches to combine the freeze and visibility map WAL records into the prune record, I wrote the attached patches reusing the tuple visibility information collected in heap_page_prune() back in lazy_scan_prune(). heap_page_prune() collects the HTSV_Result for every tuple on a page and saves it in an array used by heap_prune_chain(). If we make that array available to lazy_scan_prune(), it can use it when collecting stats for vacuum and determining whether or not to freeze tuples. This avoids calling HeapTupleSatisfiesVacuum() again on every tuple in the page. It also gets rid of the retry loop in lazy_scan_prune(). How did you test this change? Could you measure any performance difference? If so could you provide your test case? -- David Geier (ServiceNow)
Re: Let's make PostgreSQL multi-threaded
Hi, On 8/11/23 14:05, Merlin Moncure wrote: On Thu, Jul 27, 2023 at 8:28 AM David Geier wrote: Hi, On 6/7/23 23:37, Andres Freund wrote: > I think we're starting to hit quite a few limits related to the process model, > particularly on bigger machines. The overhead of cross-process context > switches is inherently higher than switching between threads in the same > process - and my suspicion is that that overhead will continue to > increase. Once you have a significant number of connections we end up spending > a *lot* of time in TLB misses, and that's inherent to the process model, > because you can't share the TLB across processes. Another problem I haven't seen mentioned yet is the excessive kernel memory usage because every process has its own set of page table entries (PTEs). Without huge pages the amount of wasted memory can be huge if shared buffers are big. Hm, noted this upthread, but asking again, does this help/benefit interactions with the operating system make oom kill situations less likely? These things are the bane of my existence, and I'm having a hard time finding a solution that prevents them other than running pgbouncer and lowering max_connections, which adds complexity. I suspect I'm not the only one dealing with this. What's really scary about these situations is they come without warning. Here's a pretty typical example per sar -r. The conjecture here is that lots of idle connections make the server appear to have less memory available than it looks, and sudden transient demands can cause it to destabilize. It does in the sense that your server will have more memory available in case you have many long living connections around. Every connection has less kernel memory overhead if you will. Of course even then a runaway query will be able to invoke the OOM killer. The unfortunate thing with the OOM killer is that, in my experience, it often kills the checkpointer. That's because the checkpointer will touch all of shared buffers over time which makes it likely to get selected by the OOM killer. Have you tried disabling memory overcommit? -- David Geier (ServiceNow)
Re: Let's make PostgreSQL multi-threaded
Hi, On 6/7/23 23:37, Andres Freund wrote: I think we're starting to hit quite a few limits related to the process model, particularly on bigger machines. The overhead of cross-process context switches is inherently higher than switching between threads in the same process - and my suspicion is that that overhead will continue to increase. Once you have a significant number of connections we end up spending a *lot* of time in TLB misses, and that's inherent to the process model, because you can't share the TLB across processes. Another problem I haven't seen mentioned yet is the excessive kernel memory usage because every process has its own set of page table entries (PTEs). Without huge pages the amount of wasted memory can be huge if shared buffers are big. For example with 256 GiB of used shared buffers a single process needs about 256 MiB for the PTEs (for simplicity I ignored the tree structure of the page tables and just took the number of 4k pages times 4 bytes per PTE). With 512 connections, which is not uncommon for machines with many cores, a total of 128 GiB of memory is just spent on page tables. We used non-transparent huge pages to work around this limitation but they come with plenty of provisioning challenges, especially in cloud infrastructures where different services run next to each other on the same server. Transparent huge pages have unpredictable performance disadvantages. Also if some backends only use shared buffers sparsely, memory is wasted for the remaining, unused range inside the huge page. -- David Geier (ServiceNow)
Re: pg_stat_statements and "IN" conditions
Hi, Seems like supporting only constants is a good starting point. The only thing that is likely confusing for users is that NUMERICs (and potentially constants of other types) are unsupported. Wouldn't it be fairly simple to support them via something like the following? is_const(element) || (is_coercion(element) && is_const(element->child)) It definitely makes sense to implement that, although I don't think it's going to be acceptable to do that via directly listing conditions an element has to satisfy. It probably has to be more flexible, sice we would like to extend it in the future. My plan is to address this in a follow-up patch, when the main mechanism is approved. Would you agree with this approach? I still think it's counterintuitive and I'm pretty sure people would even report this as a bug because not knowing about the difference in internal representation you would expect NUMERICs to work the same way other constants work. If anything we would have to document it. Can't we do something pragmatic and have something like IsMergableInElement() which for now only supports constants and later can be extended? Or what exactly do you mean by "more flexible"? -- David Geier (ServiceNow)
Re: Parallel Bitmap Heap Scan reports per-worker stats in EXPLAIN ANALYZE
Hi, On 1/20/23 09:34, David Geier wrote: EXPLAIN ANALYZE for parallel Bitmap Heap Scans currently only reports the number of heap blocks processed by the leader. It's missing the per-worker stats. The attached patch adds that functionality in the spirit of e.g. Sort or Memoize. Here is a simple test case and the EXPLAIN ANALYZE output with and without the patch: Attached is a rebased version of the patch. I would appreciate someone taking a look. As background: the change doesn't come out of thin air. We repeatedly took wrong conclusions in our query analysis because we assumed that the reported block counts include the workers. If no one objects I would also register the patch at the commit fest. The patch is passing cleanly on CI. -- David Geier (ServiceNow) From de03dfb101810f051b883804947707ecddaffceb Mon Sep 17 00:00:00 2001 From: David Geier Date: Tue, 8 Nov 2022 19:40:31 +0100 Subject: [PATCH v2] Parallel Bitmap Heap Scan reports per-worker stats Similarly to other nodes (e.g. hash join, sort, memoize), Bitmap Heap Scan now reports per-worker stats in the EXPLAIN ANALYZE output. Previously only the heap blocks stats for the leader were reported which was incomplete in parallel scans. Discussion: https://www.postgresql.org/message-id/flat/b3d80961-c2e5-38cc-6a32-61886cdf766d%40gmail.com --- src/backend/commands/explain.c| 45 ++-- src/backend/executor/execParallel.c | 3 + src/backend/executor/nodeBitmapHeapscan.c | 88 +++ src/include/executor/nodeBitmapHeapscan.h | 1 + src/include/nodes/execnodes.h | 23 +- 5 files changed, 137 insertions(+), 23 deletions(-) diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c index e57bda7b62..b5b31a763c 100644 --- a/src/backend/commands/explain.c +++ b/src/backend/commands/explain.c @@ -3396,26 +3396,57 @@ show_hashagg_info(AggState *aggstate, ExplainState *es) static void show_tidbitmap_info(BitmapHeapScanState *planstate, ExplainState *es) { + Assert(es->analyze); + if (es->format != EXPLAIN_FORMAT_TEXT) { ExplainPropertyInteger("Exact Heap Blocks", NULL, - planstate->exact_pages, es); + planstate->stats.exact_pages, es); ExplainPropertyInteger("Lossy Heap Blocks", NULL, - planstate->lossy_pages, es); + planstate->stats.lossy_pages, es); } else { - if (planstate->exact_pages > 0 || planstate->lossy_pages > 0) + if (planstate->stats.exact_pages > 0 || planstate->stats.lossy_pages > 0) { ExplainIndentText(es); appendStringInfoString(es->str, "Heap Blocks:"); - if (planstate->exact_pages > 0) -appendStringInfo(es->str, " exact=%ld", planstate->exact_pages); - if (planstate->lossy_pages > 0) -appendStringInfo(es->str, " lossy=%ld", planstate->lossy_pages); + if (planstate->stats.exact_pages > 0) +appendStringInfo(es->str, " exact=%ld", planstate->stats.exact_pages); + if (planstate->stats.lossy_pages > 0) +appendStringInfo(es->str, " lossy=%ld", planstate->stats.lossy_pages); appendStringInfoChar(es->str, '\n'); } } + + if (planstate->shared_info != NULL) + { + for (int n = 0; n < planstate->shared_info->num_workers; n++) + { + BitmapHeapScanInstrumentation *si = >shared_info->sinstrument[n]; + + if (si->exact_pages == 0 && si->lossy_pages == 0) +continue; + + if (es->workers_state) +ExplainOpenWorker(n, es); + + if (es->format == EXPLAIN_FORMAT_TEXT) + { +ExplainIndentText(es); +appendStringInfo(es->str, "Heap Blocks: exact=%ld lossy=%ld\n", + si->exact_pages, si->lossy_pages); + } + else + { +ExplainPropertyInteger("Exact Heap Blocks", NULL, si->exact_pages, es); +ExplainPropertyInteger("Lossy Heap Blocks", NULL, si->lossy_pages, es); + } + + if (es->workers_state) +ExplainCloseWorker(n, es); + } + } } /* diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c index aa3f283453..73ef4f0c0f 100644 --- a/src/backend/executor/execParallel.c +++ b/src/backend/executor/execParallel.c @@ -1072,6 +1072,9 @@ ExecParallelRetrieveInstrumentation(PlanState *planstate, case T_MemoizeState: ExecMemoizeRetrieveInstrumentation((MemoizeState *) planstate); break; + case T_BitmapHeapScanState: + ExecBitmapHeapRetrieveInstrumentation((BitmapHeapScanState *) planstate); + break; default: break; } diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c index f35df0b8bf..6d36e7922b 100644 --- a/src/backend/executor/nodeBitmapHeapscan.c +++ b/src/backend/executor/nodeBitmapHeapscan.c @@ -239,9 +239,9 @@ BitmapHeapNext(BitmapHeapScanState *node) }
Re: Performance issues with parallelism and LIMIT
Hi, On 2/8/23 11:42, Tomas Vondra wrote: On 2/1/23 14:41, David Geier wrote: Yeah, this is a pretty annoying regression. We already can hit poor behavior when matching rows are not distributed uniformly in the tables (which is what LIMIT costing assumes), and this makes it more likely to hit similar issues. A bit like when doing many HTTP requests makes it more likely to hit at least one 99% outlier. Are you talking about the use of ordering vs filtering indexes in queries where there's both an ORDER BY and a filter present (e.g. using an ordering index but then all rows passing the filter are at the end of the table)? If not, can you elaborate a bit more on that and maybe give an example. No opinion on these options, but worth a try. Alternatively, we could try the usual doubling approach - start with a low threshold (and set the latch frequently), and then gradually increase it up to the 1/4. That should work both for queries expecting only few rows and those producing a lot of data. I was thinking about this variant as well. One more alternative would be latching the leader once a worker has produced 1/Nth of the LIMIT where N is the number of workers. Both variants have the disadvantage that there are still corner cases where the latch is set too late; but it would for sure be much better than what we have today. I also did some profiling and - at least on my development laptop with 8 physical cores - the original example, motivating the batching change is slower than when it's disabled by commenting out: if (force_flush || mqh->mqh_send_pending > (mq->mq_ring_size >> 2)) SET parallel_tuple_cost TO 0; CREATE TABLE b (a int); INSERT INTO b SELECT generate_series(1, 2); ANALYZE b; EXPLAIN (ANALYZE, TIMING OFF) SELECT * FROM b; Gather (cost=1000.00..1200284.61 rows=200375424 width=4) (actual rows=2 loops=1) Workers Planned: 7 Workers Launched: 7 -> Parallel Seq Scan on b (cost=0.00..1199284.61 rows=28625061 width=4) (actual rows=2500 loops=8) Always latch: 19055 ms Batching: 19575 ms If I find some time, I'll play around a bit more and maybe propose a patch. ... We would need something similar to CHECK_FOR_INTERRUPTS() which returns a NULL slot if a parallel worker is supposed to stop execution (we could e.g. check if the queue got detached). Or could we amend CHECK_FOR_INTERRUPTS() to just stop the worker gracefully if the queue got detached? That sounds reasonable, but I'm not very familiar the leader-worker communication, so no opinion on how it should be done. I think an extra macro that needs to be called from dozens of places to check if parallel execution is supposed to end is the least preferred approach. I'll read up more on how CHECK_FOR_INTERRUPTS() works and if we cannot actively signal the workers that they should stop. -- David Geier (ServiceNow)
Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?
Hi! On 2/14/23 13:48, David Geier wrote: It still fails. I'll get Cirrus-CI working on my own Github fork so I can make sure it really compiles on all platforms before I submit a new version. It took some time until Cirrus CI allowed me to run tests against my new GitHub account (there's a 3 days freeze to avoid people from getting Cirrus CI nodes to mine bitcoins :-D). Attached now the latest patch which passes builds, rebased on latest master. I also reviewed the first two patches a while ago in [1]. I hope we can progress with them to further reduce the size of this patch set. Beyond that: I could work on support for more OSs (e.g. starting with Windows). Is there appetite for that or do we rather want to instead start with a smaller patch? [1] https://www.postgresql.org/message-id/3ac157f7-085d-e071-45fc-b87cd306360c%40gmail.com -- David Geier (ServiceNow) From d03a9be2522b0ef22fd58cbcfc95eb19ca8b2bea Mon Sep 17 00:00:00 2001 From: Andres Freund Date: Fri, 20 Jan 2023 15:31:54 -0800 Subject: [PATCH v9 1/3] instr_time: Add INSTR_TIME_SET_SECONDS(), INSTR_TIME_IS_LT() INSTR_TIME_SET_SECONDS() is useful to calculate the end of a time-bound loop without having to convert into time units (which is costly). INSTR_TIME_IS_LT() can be used to check the loop condition. Author: Reviewed-by: Discussion: https://postgr.es/m/ Backpatch: --- src/include/portability/instr_time.h | 12 1 file changed, 12 insertions(+) diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h index cc85138e21..aab80effb0 100644 --- a/src/include/portability/instr_time.h +++ b/src/include/portability/instr_time.h @@ -15,6 +15,8 @@ * * INSTR_TIME_IS_ZERO(t) is t equal to zero? * + * INSTR_TIME_IS_LT(x, y) x < y + * * INSTR_TIME_SET_ZERO(t) set t to zero (memset is acceptable too) * * INSTR_TIME_SET_CURRENT(t) set t to current time @@ -22,6 +24,8 @@ * INSTR_TIME_SET_CURRENT_LAZY(t) set t to current time if t is zero, * evaluates to whether t changed * + * INSTR_TIME_SET_SECONDS(t, s) set t to s seconds + * * INSTR_TIME_ADD(x, y)x += y * * INSTR_TIME_SUBTRACT(x, y) x -= y @@ -122,6 +126,9 @@ pg_clock_gettime_ns(void) #define INSTR_TIME_SET_CURRENT(t) \ ((t) = pg_clock_gettime_ns()) +#define INSTR_TIME_SET_SECONDS(t, s) \ + ((t).ticks = NS_PER_S * (s)) + #define INSTR_TIME_GET_NANOSEC(t) \ ((int64) (t).ticks) @@ -156,6 +163,9 @@ GetTimerFrequency(void) #define INSTR_TIME_SET_CURRENT(t) \ ((t) = pg_query_performance_counter()) +#define INSTR_TIME_SET_SECONDS(t, s) \ + ((t).ticks = s * GetTimerFrequency()) + #define INSTR_TIME_GET_NANOSEC(t) \ ((int64) ((t).ticks * ((double) NS_PER_S / GetTimerFrequency( @@ -168,6 +178,8 @@ GetTimerFrequency(void) #define INSTR_TIME_IS_ZERO(t) ((t).ticks == 0) +#define INSTR_TIME_IS_LT(x, y) ((x).ticks < (y).ticks) + #define INSTR_TIME_SET_ZERO(t) ((t).ticks = 0) -- 2.34.1 From 639213b01102a7320a62ba5ed68a8e3d6a05514b Mon Sep 17 00:00:00 2001 From: Andres Freund Date: Mon, 16 Jan 2023 11:19:11 -0800 Subject: [PATCH v9 2/3] wip: report nanoseconds in pg_test_timing This commit also updates pg_test_timing's documentation: - compare EXPLAIN (ANALYZE, TIMING ON/OFF) instead of comparing performance of of statement with/without EXPLAIN ANALYZE - explain the 2x overhead (due to two timestamp acquisitions per row) - remove old section about old versions of linux - I couldn't update the numbers, and it's old enough nobody would care --- doc/src/sgml/ref/pgtesttiming.sgml | 117 ++-- src/bin/pg_test_timing/pg_test_timing.c | 74 +-- 2 files changed, 95 insertions(+), 96 deletions(-) diff --git a/doc/src/sgml/ref/pgtesttiming.sgml b/doc/src/sgml/ref/pgtesttiming.sgml index a5eb3aa25e..7e0266cf58 100644 --- a/doc/src/sgml/ref/pgtesttiming.sgml +++ b/doc/src/sgml/ref/pgtesttiming.sgml @@ -93,28 +93,34 @@ PostgreSQL documentation Good results will show most (>90%) individual timing calls take less than - one microsecond. Average per loop overhead will be even lower, below 100 - nanoseconds. This example from an Intel i7-860 system using a TSC clock - source shows excellent performance: + one microsecond (1000 nanoseconds). Average per loop overhead will be even + lower, below 100 nanoseconds. This example from an Intel i9-9880H system + using a TSC clock source shows excellent performance: - Note that different units are used for the per loop time than the - histogram. The loop can have resolution within a few nanoseconds (ns), - while the individual timing calls can only resolve down to one microsecond - (us). + Note that the accuracy of the histogram entries may be lower than the + per loop time. @@ -125,24 +131,25 @@ Histogram of timing durations: When the query executor is running a statement using EXPLAIN ANALYZE, individual operations are time
Re: pg_stat_statements and "IN" conditions
Hi, On 2/11/23 13:08, Dmitry Dolgov wrote: On Sat, Feb 11, 2023 at 11:47:07AM +0100, Dmitry Dolgov wrote: The original version of the patch was doing all of this, i.e. handling numerics, Param nodes, RTE_VALUES. The commentary about find_const_walker in tests is referring to a part of that, that was dealing with evaluation of expression to see if it could be reduced to a constant. Unfortunately there was a significant push back from reviewers because of those features. That's why I've reduced the patch to it's minimally useful version, having in mind re-implementing them as follow-up patches in the future. This is the reason as well why I left tests covering all this missing functionality -- as breadcrumbs to already discovered cases, important for the future extensions. I'd like to elaborate on this a bit and remind about the origins of the patch, as it's lost somewhere in the beginning of the thread. The idea is not pulled out of thin air, everything is coming from our attempts to improve one particular monitoring infrastructure in a real commercial setting. Every covered use case and test in the original proposal was a result of field trials, when some application-side library or ORM was responsible for gigabytes of data in pgss, chocking the monitoring agent. Thanks for the clarifications. I didn't mean to contend the usefulness of the patch and I wasn't aware that you already jumped through the loops of handling Param, etc. Seems like supporting only constants is a good starting point. The only thing that is likely confusing for users is that NUMERICs (and potentially constants of other types) are unsupported. Wouldn't it be fairly simple to support them via something like the following? is_const(element) || (is_coercion(element) && is_const(element->child)) -- David Geier (ServiceNow)
Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?
Hi! On 2/14/23 12:11, David Geier wrote: Hi, I think I fixed the compilation errors. It was due to a few variables being declared under #if defined(__x86_64__) && defined(__linux__) while being used also under non x86 Linux. I also removed again the code to obtain the TSC frequency under hypervisors because the TSC is usually emulated and therefore no faster than clock_gettime() anyways. So we now simply fallback to clock_gettime() on hypervisors when we cannot obtain the frequency via leaf 0x16. Beyond that I reviewed the first two patches a while ago in [1]. I hope we can progress with them to further reduce the size of this patch set. [1] https://www.postgresql.org/message-id/3ac157f7-085d-e071-45fc-b87cd306360c%40gmail.com It still fails. I'll get Cirrus-CI working on my own Github fork so I can make sure it really compiles on all platforms before I submit a new version. -- David Geier (ServiceNow)
Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?
Hi, On 2/7/23 19:12, Andres Freund wrote: This fails to build on several platforms: https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest%2F42%2F3751 I think I fixed the compilation errors. It was due to a few variables being declared under #if defined(__x86_64__) && defined(__linux__) while being used also under non x86 Linux. I also removed again the code to obtain the TSC frequency under hypervisors because the TSC is usually emulated and therefore no faster than clock_gettime() anyways. So we now simply fallback to clock_gettime() on hypervisors when we cannot obtain the frequency via leaf 0x16. Beyond that I reviewed the first two patches a while ago in [1]. I hope we can progress with them to further reduce the size of this patch set. [1] https://www.postgresql.org/message-id/3ac157f7-085d-e071-45fc-b87cd306360c%40gmail.com -- David Geier (ServiceNow) From 36ff7f7ee14bf42ef0fb775cec428180251c3ff9 Mon Sep 17 00:00:00 2001 From: Andres Freund Date: Fri, 20 Jan 2023 15:31:54 -0800 Subject: [PATCH v8 1/3] instr_time: Add INSTR_TIME_SET_SECONDS(), INSTR_TIME_IS_LT() INSTR_TIME_SET_SECONDS() is useful to calculate the end of a time-bound loop without having to convert into time units (which is costly). INSTR_TIME_IS_LT() can be used to check the loop condition. Author: Reviewed-by: Discussion: https://postgr.es/m/ Backpatch: --- src/include/portability/instr_time.h | 12 1 file changed, 12 insertions(+) diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h index cc85138e21..aab80effb0 100644 --- a/src/include/portability/instr_time.h +++ b/src/include/portability/instr_time.h @@ -15,6 +15,8 @@ * * INSTR_TIME_IS_ZERO(t) is t equal to zero? * + * INSTR_TIME_IS_LT(x, y) x < y + * * INSTR_TIME_SET_ZERO(t) set t to zero (memset is acceptable too) * * INSTR_TIME_SET_CURRENT(t) set t to current time @@ -22,6 +24,8 @@ * INSTR_TIME_SET_CURRENT_LAZY(t) set t to current time if t is zero, * evaluates to whether t changed * + * INSTR_TIME_SET_SECONDS(t, s) set t to s seconds + * * INSTR_TIME_ADD(x, y)x += y * * INSTR_TIME_SUBTRACT(x, y) x -= y @@ -122,6 +126,9 @@ pg_clock_gettime_ns(void) #define INSTR_TIME_SET_CURRENT(t) \ ((t) = pg_clock_gettime_ns()) +#define INSTR_TIME_SET_SECONDS(t, s) \ + ((t).ticks = NS_PER_S * (s)) + #define INSTR_TIME_GET_NANOSEC(t) \ ((int64) (t).ticks) @@ -156,6 +163,9 @@ GetTimerFrequency(void) #define INSTR_TIME_SET_CURRENT(t) \ ((t) = pg_query_performance_counter()) +#define INSTR_TIME_SET_SECONDS(t, s) \ + ((t).ticks = s * GetTimerFrequency()) + #define INSTR_TIME_GET_NANOSEC(t) \ ((int64) ((t).ticks * ((double) NS_PER_S / GetTimerFrequency( @@ -168,6 +178,8 @@ GetTimerFrequency(void) #define INSTR_TIME_IS_ZERO(t) ((t).ticks == 0) +#define INSTR_TIME_IS_LT(x, y) ((x).ticks < (y).ticks) + #define INSTR_TIME_SET_ZERO(t) ((t).ticks = 0) -- 2.34.1 From 1200bfee5b8ebf7c68d8cefb3771a4c3523c2cea Mon Sep 17 00:00:00 2001 From: Andres Freund Date: Mon, 16 Jan 2023 11:19:11 -0800 Subject: [PATCH v8 2/3] wip: report nanoseconds in pg_test_timing This commit also updates pg_test_timing's documentation: - compare EXPLAIN (ANALYZE, TIMING ON/OFF) instead of comparing performance of of statement with/without EXPLAIN ANALYZE - explain the 2x overhead (due to two timestamp acquisitions per row) - remove old section about old versions of linux - I couldn't update the numbers, and it's old enough nobody would care --- doc/src/sgml/ref/pgtesttiming.sgml | 117 ++-- src/bin/pg_test_timing/pg_test_timing.c | 74 +-- 2 files changed, 95 insertions(+), 96 deletions(-) diff --git a/doc/src/sgml/ref/pgtesttiming.sgml b/doc/src/sgml/ref/pgtesttiming.sgml index a5eb3aa25e..7e0266cf58 100644 --- a/doc/src/sgml/ref/pgtesttiming.sgml +++ b/doc/src/sgml/ref/pgtesttiming.sgml @@ -93,28 +93,34 @@ PostgreSQL documentation Good results will show most (>90%) individual timing calls take less than - one microsecond. Average per loop overhead will be even lower, below 100 - nanoseconds. This example from an Intel i7-860 system using a TSC clock - source shows excellent performance: + one microsecond (1000 nanoseconds). Average per loop overhead will be even + lower, below 100 nanoseconds. This example from an Intel i9-9880H system + using a TSC clock source shows excellent performance: - Note that different units are used for the per loop time than the - histogram. The loop can have resolution within a few nanoseconds (ns), - while the individual timing calls can only resolve down to one microsecond - (us). + Note that the accuracy of the histogram entries may be lower than the + per loop time. @@ -125,24 +131,25 @@ Histogram of timing durations: When the query executor is running a statement using EXPLAIN ANALYZE, indi
Re: pg_stat_statements and "IN" conditions
Hi, On 2/9/23 16:02, Dmitry Dolgov wrote: Unfortunately, rebase is needed again due to recent changes in queryjumblefuncs ( 9ba37b2cb6a174b37fc51d0649ef73e56eae27fc ) I reviewed the last patch applied to some commit from Feb. 4th. It seems a little strange to me that with const_merge_threshold = 1, such a test case gives the same result as with const_merge_threshold = 2 select pg_stat_statements_reset(); set const_merge_threshold to 1; select * from test where i in (1,2,3); select * from test where i in (1,2); select * from test where i in (1); select query, calls from pg_stat_statements order by query; query| calls -+--- select * from test where i in (...) | 2 select * from test where i in ($1) | 1 Probably const_merge_threshold = 1 should produce only "i in (...)"? Well, it's not intentional, probably I need to be more careful with off-by-one. Although I agree to a certain extent with Peter questioning Please add tests for all the corner cases. At least for (1) IN only contains a single element and (2) const_merge_threshold = 1. Beyond that: - There's a comment about find_const_walker(). I cannot find that function anywhere. What am I missing? - What about renaming IsConstList() to IsMergableConstList(). - Don't you intend to use the NUMERIC data column in SELECT * FROM test_merge_numeric WHERE id IN (1, 2, 3, 4, 5, 6, 7, 8, 9, 10)? Otherwise, the test is identical to previous test cases and you're not checking for what happens with NUMERICs which are wrapped in FuncExpr because of the implicit coercion. - Don't we want to extend IsConstList() to allow a list of all implicitly coerced constants? It's inconsistent that otherwise e.g. NUMERICs don't work. - Typo in /* The firsts merged constant */ (first not firsts) - Prepared statements are not supported as they contain INs with Param instead of Const nodes. While less likely, I've seen applications that use prepared statements in conjunction with queries generated through a UI which ended up with tons of prepared queries with different number of elements in the IN clause. Not necessarily something that must go into this patch but maybe worth thinking about. - The setting name const_merge_threshold is not very telling without knowing the context. While being a little longer, what about jumble_const_merge_threshold or queryid_const_merge_threshold or similar? - Why do we actually only want to merge constants? Why don't we ignore the type of element in the IN and merge whatever is there? Is this because the original jumbling logic as of now only has support for constants? - Ideally we would even remove duplicates. That would even improve cardinality estimation but I guess we don't want to spend the cycles on doing so in the planner? - Why did you change palloc() to palloc0() for clocations array? The length is initialized to 0 and FWICS RecordConstLocation() initializes all members. Seems to me like we don't have to spend these cycles. - Can't the logic at the end of IsConstList() not be simplified to: foreach(temp, elements) if (!IsA(lfirst(temp), Const)) return false; // All elements are of type Const *firstConst = linitial_node(Const, elements); *lastConst = llast_node(Const, elements); return true; -- David Geier (ServiceNow)
Performance issues with parallelism and LIMIT
e on detached queues): /* * If we are not able to send the tuple, we assume the destination * has closed and no more tuples can be sent. If that's the case, * end the loop. */ if (!dest->receiveSlot(slot, dest)) break; Reproduction steps for this problem are below. Here the worker getting the first table page will be done right away, but the query takes as long as it takes to scan all pages of the entire table. CREATE TABLE bar (col INT); INSERT INTO bar SELECT generate_series(1, 500); SET max_parallel_workers_per_gather = 8; EXPLAIN ANALYZE VERBOSE SELECT col FROM bar WHERE col = 1 LIMIT 1; Limit (cost=0.00..1.10 rows=1 width=4) (actual time=32.289..196.200 rows=1 loops=1) Output: col -> Gather (cost=0.00..30939.03 rows=28208 width=4) (actual time=32.278..196.176 rows=1 loops=1) Output: col Workers Planned: 8 Workers Launched: 7 -> Parallel Seq Scan on public.bar (cost=0.00..30939.03 rows=3526 width=4) (actual time=137.251..137.255 rows=0 loops=7) Output: col Filter: (bar.col = 1) Rows Removed by Filter: 713769 Worker 0: actual time=160.177..160.181 rows=0 loops=1 Worker 1: actual time=160.111..160.115 rows=0 loops=1 Worker 2: actual time=0.043..0.047 rows=1 loops=1 Worker 3: actual time=160.040..160.044 rows=0 loops=1 Worker 4: actual time=160.167..160.171 rows=0 loops=1 Worker 5: actual time=160.018..160.022 rows=0 loops=1 Worker 6: actual time=160.201..160.205 rows=0 loops=1 Planning Time: 0.087 ms Execution Time: 196.247 ms We would need something similar to CHECK_FOR_INTERRUPTS() which returns a NULL slot if a parallel worker is supposed to stop execution (we could e.g. check if the queue got detached). Or could we amend CHECK_FOR_INTERRUPTS() to just stop the worker gracefully if the queue got detached? Jasper Smit, Spiros Agathos and Dimos Stamatakis helped working on this. [1] https://www.postgresql.org/message-id/flat/CAFiTN-tVXqn_OG7tHNeSkBbN%2BiiCZTiQ83uakax43y1sQb2OBA%40mail.gmail.com -- David Geier (ServiceNow)
Re: Lazy JIT IR code generation to increase JIT speed with partitions
Hi Dmitry, On 1/27/23 16:18, Dmitry Dolgov wrote: As I've noted off-list, there was noticeable difference in the dumped bitcode, which I haven't noticed since we were talking mostly about differences between executions of the same query. Thanks for the clarification and also thanks for helping with this effort. -- David Geier (ServiceNow)
Re: Lazy JIT IR code generation to increase JIT speed with partitions
Hi, Thanks for taking a look! On 12/1/22 22:12, Dmitry Dolgov wrote: First to summarize things a bit: from what I understand there are two suggestions on the table, one is about caching modules when doing inlining, the second is about actual lazy jitting. Are those to tightly coupled together, could they be split into two separate patches? It would make it a bit easier to review and test. Yes. I was playing with the caching part of the patch (still have to think about the lazy jitting), which generally seems to be a good idea. From what I see the main concern here is a chance that IRMover will rename types, degrading performance of the generated code in long run. I have to admit, I'm not fully sure mentioned LLVM 13 fix [1] completely eliminates those concerns, somehow its description is formulated in not very committing way ("don't know if this patch fixes ..., but it does fix a few soundness issues that have crept in."). But I haven't found any crashes or false asserts coming from the LLVM side (using v13), running several rounds of regression tests with forced jitting, so a point to the fix. I would be curious to learn how Andres was troubleshooting type renaming issues? Using LLVM 13 from packages, jitting the same query twice and dumping the bitcode out showed some difference in types a-la "%struct.ExprState.142" vs "%struct.ExprState.430" (from what I understood, the whole thing is an identifier, including the number) with the patch, but the very same changes are happening on the main branch as well. Of course, I was inspecting bitcode only for certain queries, it doesn't exclude that some other examples actually feature type renaming. In general, it would be interesting to know how to do some sort of "smoke tests" for the generated code, e.g. in case if LLVM has fixed this particular issue, but they might reappear in the future? +1, especially with my findings described below. I did few performance tests and got numbers similar to posted in the thread, inlining time being impressively reduced (~10 times) as well as (suspiciously) optimization time (~1.5 times). The approach was a bit different though, I've run the sequence of example queries from the thread using pgbench and checked jit counters from pgss. Few small notes: If caching of modules is safe from LLVM >= 13, I guess it should be wrapped into a corresponding condition on LLVM_VERSION_MAJOR, right? Why the assert about hasExternalLinkage was removed from the llvmjit_inline.cpp? For so much discussion about such a small change there is definitely not enough commentaries in the code about dangers of cloning modules. Also, maybe a stupid question, but how big this cache should be? From what I understand it will be cleared on llvm_shutdown, does it mean it can grow unbounded if e.g. a single session is firing all the time different queries at the db? The cache doesn't contain the modules generated for a query but the bitcode files with the to-be-inlined functions stored on disk. So the cache would stop growing as soon as a connection process has loaded all of them. Nevertheless, if a workload uses many connections and truly all or at least most of the bitcode files that could be quite some memory. Beyond that I made an unfortunate discovery: the inlining time goes down by so much because no inlining is happening anymore :-/. This is because I cloned the module only when passing it to the IRMover, not right after taking it from the cache. However, after the module is taken from the cache it's still modified. llvm_execute_inline_plan() executes if (valueToImport->hasExternalLinkage()) { valueToImport->setLinkage(LinkageTypes::AvailableExternallyLinkage); } But when checking if a function is inlinable in function_inlinable() we check if (F.hasAvailableExternallyLinkage()) return false; which makes the code bail out for any function to be inlined. It's very curious as to why we didn't really see that when dumping the bitcode. It seems like the bitcode is always different enough to not spot that. So +1 on your point on having a smoke test in place to verify things still work. If I change the code to clone the module right after taking it from the cache and before making the changes to it, the JIT time remains high and seems even higher than when re-reading the file from disk. Profiling showed that llvm::CloneModule() is just super slow, especially for big bitcode files like numeric.bc. I haven't investigated why that is and if we can do something about it. I also don't plan to do so for the moment being. For reference, I attached the patch which only contains the caching code. It's the variant that clones early. -- David Geier (ServiceNow) From 6a13803de83aaa46d34880e1b7c44bf93eb87f35 Mon Sep 17 00:00:00 2001 From: David Geier Date: Mon, 27 Jun 2022 12:28:29 +0200 Subject: [PATCH] Cache modules in JIT inlining --- src/backend/jit/llvm/llvm
Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?
Hi, On 1/23/23 21:30, Andres Freund wrote: That's been the case since my first post in the thread :). Mainly, it seems easier to detect underflow cases during subtraction that way. And the factor of 2 in range doesn't change a whole lot. I just realized it the other day :). If you have time to look at the pg_test_timing part, it'd be appreciated. That's a it larger, and nobody looked at it yet. So I'm a bit hesitant to push it. I haven't yet pushed the pg_test_timing (nor it's small prerequisite) patch. I've attached those two patches. Feel free to include them in your series if you want, then the CF entry (and thus cfbot) makes sense again... I'll include them in my new patch set and also have a careful look at them. I reviewed the prerequisite patch which introduces INSTR_TIME_SET_SECONDS(), as well as the pg_test_timing patch. Here my comments: - The prerequisite patch looks good me. - By default, the test query in the pg_test_timing doc runs serially. What about adding SET max_parallel_workers_per_gather = 0 to make sure it really always does (e.g. on a system with different settings for parallel_tuple_cost / parallel_setup_cost)? Otherwise, the numbers will be much more flaky. - Why have you added a case distinction for diff == 0? Have you encountered this case? If so, how? Maybe add a comment. - To further reduce overhead we could call INSTR_TIME_SET_CURRENT() multiple times. But then again: why do we actually care about the per-loop time? Why not instead sum up diff and divide by the number of iterations to exclude all the overhead in the first place? - In the computation of the per-loop time in nanoseconds you can now use INSTR_TIME_GET_NANOSEC() instead of INSTR_TIME_GET_DOUBLE() * NS_PER_S. The rest looks good to me. The rebased patches are part of the patch set I sent out yesterday in reply to another mail in this thread. -- David Geier (ServiceNow)
Re: pg_upgrade from PG-14.5 to PG-15.1 failing due to non-existing function
Hi, On 1/25/23 19:38, Dimos Stamatakis wrote: Hi hackers, I attempted to perform an upgrade from PG-14.5 to PG-15.1 with pg_upgrade and unfortunately it errors out because of a function that does not exist anymore in PG-15.1. The function is ‘pg_catalog.close_lb’ and it exists in 14.5 but not in 15.1. In our scenario we changed the permissions of this function in PG14.5 (via an automated tool) and then pg_upgrade tries to change the permissions in PG15.1 as well. Here [1] is a very similar issue that has been reported in 2019. The patch didn't make it in but it also seems to not fix the issue reported by Dimos. The patch in [1] seems to be concerned with changed function signatures rather than with dropped functions. Maybe [1] could be revived and extended to also ignore dropped functions? [1] https://www.postgresql.org/message-id/flat/f85991ad-bbd4-ad57-fde4-e12f0661dbf0%40postgrespro.ru -- David Geier (ServiceNow)
Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?
Hi I think at least some should be converted to just accumulate in an instr_time... I think that's for a later patch though? Yep, at least quite similar. OK. I coded it up in the latest version of the patch. Depending on how low we want to keep the error, I don't think we can: If I set the allowed deviation to 10**-9, we end up requiring a shift by 29 for common ghz ranges. Clearly 33bits isn't an interesting range. But even if you accept a higher error - we don't have *that* much range available. Assuming an uint64, the range is ~584 years. If we want 10 years range, we end up math.log(((2**64)-1) / (10 * 365 * 60 * 60 * 24 * 10**9), 2) ~= 5.87 So 5 bits available that we could "use" for multiply/shift. For something like 2.5ghz, that'd be ~2% error, clearly not acceptable. And even just a year of range, ends up allowing a failure of 30796s = 8min over a year, still too high. Thanks for doing the math. Agreed. The latest patch detects overflow and correctly handles it. But I don't think it's really an issue - normally that branch will never be taken (at least within the memory of the branch predictor), which on modern CPUs means it'll just be predicted as not taken. So as long as we tell the compiler what's the likely branch, it should be fine. At least as long as the branch compares with a hardcoded number. Yeah. The overflow detection just compares two int64. The "overflow threshold" is pre-computed. - the restriction just to linux, that'll make testing harder for some, and ends up encoding too much OS dependency - I think we need both the barrier and non-barrier variant, otherwise I suspect we'll end up with inccuracies we don't want - needs lots more documentation about why certain cpuid registers are used - cpu microarch dependencies - isn't there, e.g., the case that the scale on nehalem has to be different than on later architectures? - lack of facility to evaluate how well the different time sources work Makes sense. I carried that list over to my latest e-mail which also includes the patch to have some sort of summary of where we are in a single place. -- David Geier (ServiceNow)
Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?
Hi, On 1/23/23 18:41, Andres Freund wrote: If we add it, it probably shouldn't depend on TIMING, but on SUMMARY. Regression test queries showing EXPLAIN ANALYZE output all do something like EXPLAIN (ANALYZE, COSTS OFF, SUMMARY OFF, TIMING OFF) the SUMMARY OFF gets rid of the "top-level" "Planning Time" and "Execution Time", whereas the TIMING OFF gets rid of the per-node timing. Those are separate options because per-node timing is problematic performance-wise (right now), but whole-query timing rarely is. Makes sense. I wasn't aware of SUMMARY. Let's keep this option in mind, in case we'll revisit exposing the clock source in the future. Another, independent, thing worth thinking about: I think we might want to expose both rdtsc and rdtscp. For something like InstrStartNode()/InstrStopNode(), avoiding the "one-way barrier" of rdtscp is quite important to avoid changing the query performance. But for measuring whole-query time, we likely want to measure the actual time. It probably won't matter hugely for the whole query time - the out of order window of modern CPUs is large, but not *that* large - but I don't think we can generally assume that. That's what I thought as well. I added INSTR_TIME_SET_CURRENT_FAST() and for now call that variant from InstrStartNode(), InstrEndNode() and pg_test_timing. To do so in InstrEndNode(), I removed INSTR_TIME_SET_CURRENT_LAZY(). Otherwise, two variants of that macro would be needed. INSTR_TIME_SET_CURRENT_LAZY() was only used in a single place and the code is more readable that way. INSTR_TIME_SET_CURRENT() is called from a bunch of places. I still have to go through all of them and see which should be changed to call the _FAST() variant. Attached is v7 of the patch: - Rebased on latest master (most importantly on top of the int64 instr_time commits). - Includes two commits from Andres which introduce INSTR_TIME_SET_SECONDS(), INSTR_TIME_IS_LT() and WIP to report pg_test_timing output in nanoseconds. - Converts ticks to nanoseconds only with integer math, while accounting for overflow. - Supports RDTSCP via INSTR_TIME_SET_CURRENT() and introduced INSTR_TIME_SET_CURRENT_FAST() which uses RDTSC. I haven't gotten to the following: - Looking through all calls to INSTR_TIME_SET_CURRENT() and check if they should be replaced by INSTR_TIME_SET_CURRENT_FAST(). - Reviewing Andres commits. Potentially improving on pg_test_timing's output. - Looking at enabling RDTSC on more platforms. Is there a minimum set of platforms we would like support for? Windows should be easy. That would also allow to unify the code a little more. - Add more documentation and do more testing around the calls to CPUID. - Profiling and optimizing the code. A quick test showed about 10% improvement over master with TIMING ON vs TIMING OFF, when using the test-case from Andres' e-mail that started this thread. I hope I'll find time to work on these points during the next days. -- David Geier (ServiceNow) From 0b5ce706bed13c0c242e6ace809d3c37a8064029 Mon Sep 17 00:00:00 2001 From: Andres Freund Date: Fri, 20 Jan 2023 15:31:54 -0800 Subject: [PATCH v7 1/3] instr_time: Add INSTR_TIME_SET_SECONDS(), INSTR_TIME_IS_LT() INSTR_TIME_SET_SECONDS() is useful to calculate the end of a time-bound loop without having to convert into time units (which is costly). INSTR_TIME_IS_LT() can be used to check the loop condition. Author: Reviewed-by: Discussion: https://postgr.es/m/ Backpatch: --- src/include/portability/instr_time.h | 12 1 file changed, 12 insertions(+) diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h index cc85138e21..aab80effb0 100644 --- a/src/include/portability/instr_time.h +++ b/src/include/portability/instr_time.h @@ -15,6 +15,8 @@ * * INSTR_TIME_IS_ZERO(t) is t equal to zero? * + * INSTR_TIME_IS_LT(x, y) x < y + * * INSTR_TIME_SET_ZERO(t) set t to zero (memset is acceptable too) * * INSTR_TIME_SET_CURRENT(t) set t to current time @@ -22,6 +24,8 @@ * INSTR_TIME_SET_CURRENT_LAZY(t) set t to current time if t is zero, * evaluates to whether t changed * + * INSTR_TIME_SET_SECONDS(t, s) set t to s seconds + * * INSTR_TIME_ADD(x, y)x += y * * INSTR_TIME_SUBTRACT(x, y) x -= y @@ -122,6 +126,9 @@ pg_clock_gettime_ns(void) #define INSTR_TIME_SET_CURRENT(t) \ ((t) = pg_clock_gettime_ns()) +#define INSTR_TIME_SET_SECONDS(t, s) \ + ((t).ticks = NS_PER_S * (s)) + #define INSTR_TIME_GET_NANOSEC(t) \ ((int64) (t).ticks) @@ -156,6 +163,9 @@ GetTimerFrequency(void) #define INSTR_TIME_SET_CURRENT(t) \ ((t) = pg_query_performance_counter()) +#define INSTR_TIME_SET_SECONDS(t, s) \ + ((t).ticks = s * GetTimerFrequency()) + #define INSTR_TIME_GET_NANOSEC(t) \ ((int64) ((t).ticks * ((double) NS_PER_S / GetTimerFrequency( @@ -168,6 +178,8 @@ GetTimerFrequency(void) #define INSTR_TIME_IS_ZERO(t) ((t).ticks
Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?
Hi, On 1/21/23 06:31, Andres Freund wrote: I pushed the int64-ification commits. Great. I started rebasing. One thing I was wondering about: why did you chose to use a signed instead of an unsigned 64-bit integer for the ticks? If you have time to look at the pg_test_timing part, it'd be appreciated. That's a it larger, and nobody looked at it yet. So I'm a bit hesitant to push it. I haven't yet pushed the pg_test_timing (nor it's small prerequisite) patch. I've attached those two patches. Feel free to include them in your series if you want, then the CF entry (and thus cfbot) makes sense again... I'll include them in my new patch set and also have a careful look at them. -- David Geier (ServiceNow)
Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?
Hi, On 1/21/23 05:12, Andres Freund wrote: We do currently do the conversion quite frequently. Admittedly I was partially motivated by trying to get the per-loop overhead in pg_test_timing down ;) But I think it's a real issue. Places where we do, but shouldn't, convert: - ExecReScan() - quite painful, we can end up with a lot of those - InstrStopNode() - adds a good bit of overhead to simple InstrStopNode() doesn't convert in the general case but only for the first tuple or when async. So it goes somewhat hand in hand with ExecReScan(). - PendingWalStats.wal_write_time - this is particularly bad because it happens within very contended code - calls to pgstat_count_buffer_read_time(), pgstat_count_buffer_write_time() - they can be very frequent - pgbench.c, as we already discussed - pg_stat_statements.c - ... These all will get a bit slower when moving to a "variable" frequency. I wonder if we will be able to measure any of them easily. But given that it's many more places than I had realized and given that the optimized code is not too involved, let's give it a try. What was your approach for avoiding the costly operation? I ended up with a integer multiplication + shift approximation for the floating point multiplication (which in turn uses the inverse of the division by the frequency). To allow for sufficient precision while also avoiding overflows, I had to make that branch conditional, with a slow path for large numbers of nanoseconds. It seems like we ended up with the same. I do: sec = ticks / frequency_hz ns = ticks / frequency_hz * 1,000,000,000 ns = ticks * (1,000,000,000 / frequency_hz) ns = ticks * (1,000,000 / frequency_khz) <-- now in kilohertz Now, the constant scaling factor in parentheses is typically a floating point number. For example for a frequency of 2.5 GHz it would be 2.5. To work around that we can do something like: ns = ticks * (1,000,000 * scaler / frequency_khz) / scaler Where scaler is a power-of-2, big enough to maintain enough precision while allowing for a shift to implement the division. The additional multiplication with scaler makes that the maximum range go down, because we must ensure we never overflow. I'm wondering if we cannot pick scaler in such a way that remaining range of cycles is large enough for our use case and we can therefore live without bothering for the overflow case. What would be "enough"? 1 year? 10 years? ... Otherwise, we indeed need code that cares for the potential overflow. My hunch is that it can be done branchless, but it for sure adds dependent instructions. Maybe in that case a branch is better that almost certainly will never be taken? I'll include the code in the new patch set which I'll latest submit tomorrow. I think it'd be great - but I'm not sure we're there yet, reliability and code-complexity wise. Thanks to your commits, the diff of the new patch set will be already much smaller and easier to review. What's your biggest concern in terms of reliability? I think it might be worth makign the rdts aspect somewhat measurable. E.g. allowing pg_test_timing to use both at the same time, and have it compare elapsed time with both sources of counters. I haven't yet looked into pg_test_timing. I'll do that while including your patches into the new patch set. -- David Geier (ServiceNow)
Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?
Hi, On 1/21/23 05:14, Andres Freund wrote: The elapsed time is already inherently unstable, so we shouldn't have any test output showing the time. But I doubt showing it in every explain is a good idea - we use instr_time in plenty of other places. Why show it in explain, but not in all those other places? Yeah. I thought it would only be an issue if we showed it unconditionally in EXPLAIN ANALYZE. If we only show it with TIMING ON, we're likely fine with pretty much all regression tests. But given the different opinions, I'll leave it out in the new patch set for the moment being. -- David Geier (ServiceNow)
Parallel Bitmap Heap Scan reports per-worker stats in EXPLAIN ANALYZE
Hi hackers, EXPLAIN ANALYZE for parallel Bitmap Heap Scans currently only reports the number of heap blocks processed by the leader. It's missing the per-worker stats. The attached patch adds that functionality in the spirit of e.g. Sort or Memoize. Here is a simple test case and the EXPLAIN ANALYZE output with and without the patch: create table foo(col0 int, col1 int); insert into foo select generate_series(1, 1000, 0.001), generate_series(1000, 2000, 0.001); create index idx0 on foo(col0); create index idx1 on foo(col1); set parallel_tuple_cost = 0; set parallel_setup_cost = 0; explain (analyze, costs off, timing off) select * from foo where col0 > 900 or col1 = 1; With the patch: Gather (actual rows=99501 loops=1) Workers Planned: 2 Workers Launched: 2 -> Parallel Bitmap Heap Scan on foo (actual rows=33167 loops=3) Recheck Cond: ((col0 > 900) OR (col1 = 1)) Heap Blocks: exact=98 Worker 0: Heap Blocks: exact=171 lossy=0 Worker 1: Heap Blocks: exact=172 lossy=0 -> BitmapOr (actual rows=0 loops=1) -> Bitmap Index Scan on idx0 (actual rows=99501 loops=1) Index Cond: (col0 > 900) -> Bitmap Index Scan on idx1 (actual rows=0 loops=1) Index Cond: (col1 = 1) Without the patch: Gather (actual rows=99501 loops=1) Workers Planned: 2 Workers Launched: 2 -> Parallel Bitmap Heap Scan on foo (actual rows=33167 loops=3) Recheck Cond: ((col0 > 900) OR (col1 = 1)) Heap Blocks: exact=91 -> BitmapOr (actual rows=0 loops=1) -> Bitmap Index Scan on idx0 (actual rows=99501 loops=1) Index Cond: (col0 > 900) -> Bitmap Index Scan on idx1 (actual rows=0 loops=1) Index Cond: (col1 = 1) So in total the parallel Bitmap Heap Scan actually processed 441 heap blocks instead of just 91. Now two variable length arrays (VLA) would be needed, one for the snapshot and one for the stats. As this obviously doesn't work, I now use a single, big VLA and added functions to retrieve pointers to the respective fields. I'm using MAXALIGN() to make sure the latter field is aligned properly. Am I doing this correctly? I'm not entirely sure around alignment conventions and requirements of other platforms. I couldn't find existing tests that exercise the EXPLAIN ANALYZE output of specific nodes. I could only find a few basic smoke tests for EXPLAIN ANALYZE with parallel nodes in parallel_select.sql. Do we want tests for the changed functionality? If so I could right away also add tests for EXPLAIN ANALYZE including other parallel nodes. Thank you for your feedback. -- David Geier (ServiceNow) From b2c84fb16e9521d6cfadb0c069e27a213e8e8471 Mon Sep 17 00:00:00 2001 From: David Geier Date: Tue, 8 Nov 2022 19:40:31 +0100 Subject: [PATCH v1] Parallel Bitmap Heap Scan reports per-worker stats Similarly to other nodes (e.g. hash join, sort, memoize), Bitmap Heap Scan now reports per-worker stats in the EXPLAIN ANALYZE output. Previously only the heap blocks stats for the leader were reported which was incomplete in parallel scans. --- src/backend/commands/explain.c| 46 ++-- src/backend/executor/execParallel.c | 3 + src/backend/executor/nodeBitmapHeapscan.c | 88 +++ src/include/executor/nodeBitmapHeapscan.h | 1 + src/include/nodes/execnodes.h | 23 +- 5 files changed, 138 insertions(+), 23 deletions(-) diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c index 5212a64b1e..d532fc4a87 100644 --- a/src/backend/commands/explain.c +++ b/src/backend/commands/explain.c @@ -3396,26 +3396,58 @@ show_hashagg_info(AggState *aggstate, ExplainState *es) static void show_tidbitmap_info(BitmapHeapScanState *planstate, ExplainState *es) { + Assert(es->analyze); + if (es->format != EXPLAIN_FORMAT_TEXT) { ExplainPropertyInteger("Exact Heap Blocks", NULL, - planstate->exact_pages, es); + planstate->stats.exact_pages, es); ExplainPropertyInteger("Lossy Heap Blocks", NULL, - planstate->lossy_pages, es); + planstate->stats.lossy_pages, es); } else { - if (planstate->exact_pages > 0 || planstate->lossy_pages > 0) + if (planstate->stats.exact_pages > 0 || planstate->stats.lossy_pages > 0) { ExplainIndentText(es); appendStringInfoString(es->str, "Heap Blocks:"); - if (planstate->exact_pages > 0) -appendStringInfo(es->str, " exact=%ld", planstate->exact_pages); - if (planstate->lossy_pages > 0) -appendStringInfo(es->str, " lossy=%ld", planstate->lossy_pages); + if (planstate->stats.exact_pages > 0) +appendStringInfo(es->str, " exact=%ld", planstate
Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?
On 1/18/23 13:52, David Geier wrote: On 1/16/23 21:39, Pavel Stehule wrote: po 16. 1. 2023 v 21:34 odesílatel Tomas Vondra napsal: Hi, there's minor bitrot in the Mkvcbuild.pm change, making cfbot unhappy. As for the patch, I don't have much comments. I'm wondering if it'd be useful to indicate which timing source was actually used for EXPLAIN ANALYZE, say something like: Planning time: 0.197 ms Execution time: 0.225 ms Timing source: clock_gettime (or tsc) +1 I like the idea of exposing the timing source in the EXPLAIN ANALYZE output. It's a good tradeoff between inspectability and effort, given that RDTSC should always be better to use. If there are no objections I go this way. Thinking about this a little more made me realize that this will cause different pg_regress output depending on the platform. So if we go this route we would at least need an option for EXPLAIN ANALYZE to disable it. Or rather have it disabled by default and allow for enabling it. Thoughts? -- David Geier (ServiceNow)
Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?
Hi Andres, I also couldn't help and hacked a bit on the rdtsc pieces. I did figure out how to do the cycles->nanosecond conversion with integer shift and multiply in the common case, which does show a noticable speedup. But that's for another day. I also have code for that here. I decided against integrating it because we don't convert frequently enough to make it matter. Or am I missing something? I fought a bit with myself about whether to send those patches in this thread, because it'll take over the CF entry. But decided that it's ok, given that David's patches should be rebased over these anyway? That's alright. Though, I would hope we attempt to bring your patch set as well as the RDTSC patch set in. -- David Geier (ServiceNow)
Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?
Hi, @Andres: will you take care of these changes and provide me with an updated patch set so I can rebase the RDTSC changes? Otherwise, I can also apply Tom suggestions to your patch set and send out the complete patch set. -- David Geier (ServiceNow)
Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?
On 1/16/23 18:37, Andres Freund wrote: Hi, On 2023-01-02 14:28:20 +0100, David Geier wrote: I'm doubtful this is worth the complexity it incurs. By the time we convert out of the instr_time format, the times shouldn't be small enough that the accuracy is affected much. I don't feel strong about it and you have a point that we most likely only convert ones we've accumulated a fair amount of cycles. Looking around, most of the existing uses of INSTR_TIME_GET_MICROSEC() actually accumulate themselves, and should instead keep things in the instr_time format and convert later. We'd win more accuracy / speed that way. I don't think the introduction of pg_time_usec_t was a great idea, but oh well. Fully agreed. Why not replacing pg_time_usec_t with instr_time in a separate patch? I haven't looked carefully enough if all occurrences could sanely replaced but at least the ones that accumulate time seem good starting points. Additionally, I initialized a few variables of type instr_time which otherwise resulted in warnings due to use of potentially uninitialized variables. Unless we decide, as I suggested downthread, that we deprecate INSTR_TIME_SET_ZERO(), that's unfortunately not the right fix. I've a similar patch that adds all the necesarry INSTR_TIME_SET_ZERO() calls. I don't feel strong about it, but like Tom tend towards keeping the initialization macro. Thanks that you have improved on the first patch and fixed these issues in a better way. What about renaming INSTR_TIME_GET_DOUBLE() to INSTR_TIME_GET_SECS() so that it's consistent with the _MILLISEC() and _MICROSEC() variants? The INSTR_TIME_GET_MICROSEC() returns a uint64 while the other variants return double. This seems error prone. What about renaming the function or also have the function return a double and cast where necessary at the call site? I think those should be a separate discussion / patch. OK. I'll propose follow-on patches ones we're done with the ones at hand. I'll then rebase the RDTSC patches on your patch set. -- David Geier (ServiceNow)
Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?
On 1/16/23 21:39, Pavel Stehule wrote: po 16. 1. 2023 v 21:34 odesílatel Tomas Vondra napsal: Hi, there's minor bitrot in the Mkvcbuild.pm change, making cfbot unhappy. As for the patch, I don't have much comments. I'm wondering if it'd be useful to indicate which timing source was actually used for EXPLAIN ANALYZE, say something like: Planning time: 0.197 ms Execution time: 0.225 ms Timing source: clock_gettime (or tsc) There has been a proposal to expose this as a GUC (or perhaps as explain option), to allow users to pick what timing source to use. I wouldn't go that far - AFAICS is this is meant to be universally better when available. But knowing which source was used seems useful. +1 Thanks for looking at the patch. I'll fix the merge conflict. I like the idea of exposing the timing source in the EXPLAIN ANALYZE output. It's a good tradeoff between inspectability and effort, given that RDTSC should always be better to use. If there are no objections I go this way. -- David Geier (ServiceNow)
Re: Sampling-based timing for EXPLAIN ANALYZE
Nice idea. On 1/6/23 10:19, Jelte Fennema wrote: Do you have some performance comparison between TIMING ON and TIMING SAMPLING? +1 to see some numbers compared to TIMING ON. Mostly I'm wondering if the sampling based approach gains us enough to be worth it, once the patch to use RDTSC hopefully landed (see [1]). I believe that with the RDTSC patch the overhead of TIMING ON is lower than the overhead of using ANALYZE with TIMING OFF in the first place. Hence, to be really useful, it would be great if we could on top of TIMING SAMPLING also lower the overhead of ANALYZE itself further (e.g. by using a fast path for the default EXPLAIN (ANALYZE, TIMING ON / SAMPLING)). Currently, InstrStartNode() and InstrStopNode() have a ton of branches and without all the typically deactivated code the implementation would be very small and could be placed in an inlinable function. [1] https://www.postgresql.org/message-id/flat/20200612232810.f46nbqkdhbutzqdg%40alap3.anarazel.de -- David Geier (ServiceNow)
Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?
Hi, CFBot shows some compilation errors as in [1], please post an updated version for the same: 09:08:12.525] /usr/bin/ld: src/bin/pg_test_timing/pg_test_timing.p/pg_test_timing.c.o: warning: relocation against `cycles_to_sec' in read-only section `.text' [09:08:12.525] /usr/bin/ld: src/bin/pg_test_timing/pg_test_timing.p/pg_test_timing.c.o: in function `pg_clock_gettime_ref_cycles': [09:08:12.525] /tmp/cirrus-ci-build/build/../src/include/portability/instr_time.h:119: undefined reference to `use_rdtsc' [09:08:12.525] /usr/bin/ld: src/bin/pg_test_timing/pg_test_timing.p/pg_test_timing.c.o: in function `test_timing': [09:08:12.525] /tmp/cirrus-ci-build/build/../src/bin/pg_test_timing/pg_test_timing.c:135: undefined reference to `pg_clock_gettime_initialize_rdtsc' [09:08:12.525] /usr/bin/ld: /tmp/cirrus-ci-build/build/../src/bin/pg_test_timing/pg_test_timing.c:137: undefined reference to `cycles_to_us' [09:08:12.525] /usr/bin/ld: /tmp/cirrus-ci-build/build/../src/bin/pg_test_timing/pg_test_timing.c:146: undefined reference to `cycles_to_us' [09:08:12.525] /usr/bin/ld: /tmp/cirrus-ci-build/build/../src/bin/pg_test_timing/pg_test_timing.c:169: undefined reference to `cycles_to_us' [09:08:12.525] /usr/bin/ld: /tmp/cirrus-ci-build/build/../src/bin/pg_test_timing/pg_test_timing.c:176: undefined reference to `cycles_to_sec' [09:08:12.525] /usr/bin/ld: warning: creating DT_TEXTREL in a PIE [09:08:12.525] collect2: error: ld returned 1 exit status [1] - https://cirrus-ci.com/task/5375312565895168 Regards, Vignesh I fixed the compilation error on CFBot. I missed adding instr_time.c to the Meson makefile. New patch set attached. -- David Geier (ServiceNow) From be18633d4735f680c7910fcb4e8ac90c4eada131 Mon Sep 17 00:00:00 2001 From: David Geier Date: Thu, 17 Nov 2022 10:22:01 +0100 Subject: [PATCH 1/3] Change instr_time to just store nanoseconds, that's cheaper. --- src/include/portability/instr_time.h | 62 1 file changed, 26 insertions(+), 36 deletions(-) diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h index 9ea1a68bd9..c64f21eb53 100644 --- a/src/include/portability/instr_time.h +++ b/src/include/portability/instr_time.h @@ -80,63 +80,53 @@ #define PG_INSTR_CLOCK CLOCK_REALTIME #endif -typedef struct timespec instr_time; +typedef int64 instr_time; +#define NS_PER_S INT64CONST(10) +#define US_PER_S INT64CONST(100) +#define MS_PER_S INT64CONST(1000) -#define INSTR_TIME_IS_ZERO(t) ((t).tv_nsec == 0 && (t).tv_sec == 0) +#define NS_PER_MS INT64CONST(100) +#define NS_PER_US INT64CONST(1000) -#define INSTR_TIME_SET_ZERO(t) ((t).tv_sec = 0, (t).tv_nsec = 0) +#define INSTR_TIME_IS_ZERO(t) ((t) == 0) -#define INSTR_TIME_SET_CURRENT(t) ((void) clock_gettime(PG_INSTR_CLOCK, &(t))) +#define INSTR_TIME_SET_ZERO(t) ((t) = 0) + +static inline instr_time pg_clock_gettime_ns(void) +{ + struct timespec tmp; + + clock_gettime(PG_INSTR_CLOCK, ); + + return tmp.tv_sec * NS_PER_S + tmp.tv_nsec; +} + +#define INSTR_TIME_SET_CURRENT(t) \ + (t) = pg_clock_gettime_ns() #define INSTR_TIME_ADD(x,y) \ do { \ - (x).tv_sec += (y).tv_sec; \ - (x).tv_nsec += (y).tv_nsec; \ - /* Normalize */ \ - while ((x).tv_nsec >= 10) \ - { \ - (x).tv_nsec -= 10; \ - (x).tv_sec++; \ - } \ + (x) += (y); \ } while (0) #define INSTR_TIME_SUBTRACT(x,y) \ do { \ - (x).tv_sec -= (y).tv_sec; \ - (x).tv_nsec -= (y).tv_nsec; \ - /* Normalize */ \ - while ((x).tv_nsec < 0) \ - { \ - (x).tv_nsec += 10; \ - (x).tv_sec--; \ - } \ + (x) -= (y); \ } while (0) #define INSTR_TIME_ACCUM_DIFF(x,y,z) \ do { \ - (x).tv_sec += (y).tv_sec - (z).tv_sec; \ - (x).tv_nsec += (y).tv_nsec - (z).tv_nsec; \ - /* Normalize after each add to avoid overflow/underflow of tv_nsec */ \ - while ((x).tv_nsec < 0) \ - { \ - (x).tv_nsec += 10; \ - (x).tv_sec--; \ - } \ - while ((x).tv_nsec >= 10) \ - { \ - (x).tv_nsec -= 10; \ - (x).tv_sec++; \ - } \ + (x) += (y) - (z); \ } while (0) #define INSTR_TIME_GET_DOUBLE(t) \ - (((double) (t).tv_sec) + ((double) (t).tv_nsec) / 10.0) + ((double) (t) / NS_PER_S) #define INSTR_TIME_GET_MILLISEC(t) \ - (((double) (t).tv_sec * 1000.0) + ((double) (t).tv_nsec) / 100.0) + ((double) (t) / NS_PER_MS) #define INSTR_TIME_GET_MICROSEC(t) \ - (((uint64) (t).tv_sec * (uint64) 100) + (uint64) ((t).tv_nsec / 1000)) + ((double) (t) / NS_PER_US) #else /* WIN32 */ -- 2.34.1 From 190ca09566eabb017ed25b1512225173ca47fb88 Mon Sep 17 00:00:00 2001 From: David Geier Date: Thu, 17 Nov 2022 13:03:59 +0100 Subject: [PATCH 2/3] Use CPU reference cycles, via RDTSC, to measure time for instrumentation. For now this is only enabled on Linux/x86 when the system clocksource is marked tsc as well, as determined at runtime. This way we can rely on the Linux kernel to make a decision whether tsc is invariant and u
Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?
Hi Lukas, On 1/2/23 20:50, Lukas Fittl wrote: Thanks for continuing to work on this patch, and my apologies for silence on the patch. It would be great if you could review it. Please also share your thoughts around exposing the used clock source as GUC and renaming INSTR_TIME_GET_DOUBLE() to _SECS(). I rebased again on master because of [1]. Patches attached. Its been hard to make time, and especially so because I typically develop on an ARM-based macOS system where I can't test this directly - hence my tests with virtualized EC2 instances, where I ran into the timing oddities. That's good and bad. Bad to do the development and good to test the implementation on more virtualized setups; given that I also encountered "interesting" behavior on VMWare (see my previous mails). On Mon, Jan 2, 2023 at 5:28 AM David Geier wrote: The INSTR_TIME_GET_MICROSEC() returns a uint64 while the other variants return double. This seems error prone. What about renaming the function or also have the function return a double and cast where necessary at the call site? Minor note, but in my understanding using a uint64 (where we can) is faster for any simple arithmetic we do with the values. That's true. So the argument could be that for seconds and milliseconds we want the extra precision while microseconds are precise enough. Still, we could also make the seconds and milliseconds conversion code integer only and e.g. return two integers with the value before and after the comma. FWICS, the functions are nowhere used in performance critical code, so it doesn't really make a difference performance-wise. +1, and feel free to carry this patch forward - I'll try to make an effort to review my earlier testing issues again, as well as your later improvements to the patch. Moved to the current commit fest. Will you become reviewer? Also, FYI, I just posted an alternate idea for speeding up EXPLAIN ANALYZE with timing over in [0], using a sampling-based approach to reduce the timing overhead. Interesting idea. I'll reply with some thoughts on the corresponding thread. [1] https://www.postgresql.org/message-id/flat/CALDaNm3kRBGPhndujr9JcjjbDCG3anhj0vW8b9YtbXrBDMSvvw%40mail.gmail.com -- David Geier (ServiceNow) From f63527c8e4b3b0b71ffacaa111dd93325d973432 Mon Sep 17 00:00:00 2001 From: David Geier Date: Thu, 17 Nov 2022 10:22:01 +0100 Subject: [PATCH 1/3] Change instr_time to just store nanoseconds, that's cheaper. --- src/include/portability/instr_time.h | 62 1 file changed, 26 insertions(+), 36 deletions(-) diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h index 9ea1a68bd9..c64f21eb53 100644 --- a/src/include/portability/instr_time.h +++ b/src/include/portability/instr_time.h @@ -80,63 +80,53 @@ #define PG_INSTR_CLOCK CLOCK_REALTIME #endif -typedef struct timespec instr_time; +typedef int64 instr_time; +#define NS_PER_S INT64CONST(10) +#define US_PER_S INT64CONST(100) +#define MS_PER_S INT64CONST(1000) -#define INSTR_TIME_IS_ZERO(t) ((t).tv_nsec == 0 && (t).tv_sec == 0) +#define NS_PER_MS INT64CONST(100) +#define NS_PER_US INT64CONST(1000) -#define INSTR_TIME_SET_ZERO(t) ((t).tv_sec = 0, (t).tv_nsec = 0) +#define INSTR_TIME_IS_ZERO(t) ((t) == 0) -#define INSTR_TIME_SET_CURRENT(t) ((void) clock_gettime(PG_INSTR_CLOCK, &(t))) +#define INSTR_TIME_SET_ZERO(t) ((t) = 0) + +static inline instr_time pg_clock_gettime_ns(void) +{ + struct timespec tmp; + + clock_gettime(PG_INSTR_CLOCK, ); + + return tmp.tv_sec * NS_PER_S + tmp.tv_nsec; +} + +#define INSTR_TIME_SET_CURRENT(t) \ + (t) = pg_clock_gettime_ns() #define INSTR_TIME_ADD(x,y) \ do { \ - (x).tv_sec += (y).tv_sec; \ - (x).tv_nsec += (y).tv_nsec; \ - /* Normalize */ \ - while ((x).tv_nsec >= 10) \ - { \ - (x).tv_nsec -= 10; \ - (x).tv_sec++; \ - } \ + (x) += (y); \ } while (0) #define INSTR_TIME_SUBTRACT(x,y) \ do { \ - (x).tv_sec -= (y).tv_sec; \ - (x).tv_nsec -= (y).tv_nsec; \ - /* Normalize */ \ - while ((x).tv_nsec < 0) \ - { \ - (x).tv_nsec += 10; \ - (x).tv_sec--; \ - } \ + (x) -= (y); \ } while (0) #define INSTR_TIME_ACCUM_DIFF(x,y,z) \ do { \ - (x).tv_sec += (y).tv_sec - (z).tv_sec; \ - (x).tv_nsec += (y).tv_nsec - (z).tv_nsec; \ - /* Normalize after each add to avoid overflow/underflow of tv_nsec */ \ - while ((x).tv_nsec < 0) \ - { \ - (x).tv_nsec += 10; \ - (x).tv_sec--; \ - } \ - while ((x).tv_nsec >= 10) \ - { \ - (x).tv_nsec -= 10; \ - (x).tv_sec++; \ - } \ + (x) += (y) - (z); \ } while (0) #define INSTR_TIME_GET_DOUBLE(t) \ - (((double) (t).tv_sec) + ((double) (t).tv_nsec) / 10.0) + ((double) (t) / NS_PER_S) #define INSTR_TIME_GET_MILLISEC(t) \ - (((double) (t).tv_sec * 1000.0) + ((double) (t).tv_nsec) / 100.0) + ((double) (t) / NS_PER_MS)
Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?
Hi, I re-based again on master and applied the following changes: I removed the fallback for obtaining the TSC frequency from /proc/cpu as suggested by Andres. Worst-case we fall back to clock_gettime(). I added code to obtain the TSC frequency via CPUID when under a hypervisor. I had to use __cpuid() directly instead of __get_cpuid(), because __get_cpuid() returns an error if the leaf is > 0x8000 (probably the implementation pre-dates the hypervisor timing leafs). Unfortunately, while testing my implementation under VMWare, I found that RDTSC runs awfully slow there (like 30x slower). [1] indicates that we cannot generally rely on RDTSC being actually fast on VMs. However, the same applies to clock_gettime(). It runs as slow as RDTSC on my VMWare setup. Hence, using RDTSC is not at disadvantage. I'm not entirely sure if there aren't cases where e.g. clock_gettime() is actually faster than RDTSC and it would be advantageous to use clock_gettime(). We could add a GUC so that the user can decide which clock source to use. Any thoughts? I also somewhat improved the accuracy of the cycles to milli- and microseconds conversion functions by having two more multipliers with higher precision. For microseconds we could also keep the computation integer-only. I'm wondering what to best do for seconds and milliseconds. I'm currently leaning towards just keeping it as is, because the durations measured and converted are usually long enough that precision shouldn't be a problem. In vacuum_lazy.c we do if ((INSTR_TIME_GET_MICROSEC(elapsed) / 1000). I changed that to use INSTR_TIME_GET_MILLISECS() instead. Additionally, I initialized a few variables of type instr_time which otherwise resulted in warnings due to use of potentially uninitialized variables. I also couldn't reproduce the reported timing discrepancy. For me the runtime reported by \timing is just slightly higher than the time reported by EXPLAIN ANALYZE, which is expected. Beyond that: What about renaming INSTR_TIME_GET_DOUBLE() to INSTR_TIME_GET_SECS() so that it's consistent with the _MILLISEC() and _MICROSEC() variants? The INSTR_TIME_GET_MICROSEC() returns a uint64 while the other variants return double. This seems error prone. What about renaming the function or also have the function return a double and cast where necessary at the call site? If no one objects I would also re-register this patch in the commit fest. [1] https://vmware.com/content/dam/digitalmarketing/vmware/en/pdf/techpaper/Timekeeping-In-VirtualMachines.pdf (page 11 "Virtual TSC") -- David Geier (ServiceNow) From 321d00ae5dd1bcffc8fbdb39879b7f5c78e3930f Mon Sep 17 00:00:00 2001 From: David Geier Date: Thu, 17 Nov 2022 10:22:01 +0100 Subject: [PATCH 1/3] Change instr_time to just store nanoseconds, that's cheaper. --- src/include/portability/instr_time.h | 62 1 file changed, 26 insertions(+), 36 deletions(-) diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h index 22bcf3d288..4bd555113b 100644 --- a/src/include/portability/instr_time.h +++ b/src/include/portability/instr_time.h @@ -80,63 +80,53 @@ #define PG_INSTR_CLOCK CLOCK_REALTIME #endif -typedef struct timespec instr_time; +typedef int64 instr_time; +#define NS_PER_S INT64CONST(10) +#define US_PER_S INT64CONST(100) +#define MS_PER_S INT64CONST(1000) -#define INSTR_TIME_IS_ZERO(t) ((t).tv_nsec == 0 && (t).tv_sec == 0) +#define NS_PER_MS INT64CONST(100) +#define NS_PER_US INT64CONST(1000) -#define INSTR_TIME_SET_ZERO(t) ((t).tv_sec = 0, (t).tv_nsec = 0) +#define INSTR_TIME_IS_ZERO(t) ((t) == 0) -#define INSTR_TIME_SET_CURRENT(t) ((void) clock_gettime(PG_INSTR_CLOCK, &(t))) +#define INSTR_TIME_SET_ZERO(t) ((t) = 0) + +static inline instr_time pg_clock_gettime_ns(void) +{ + struct timespec tmp; + + clock_gettime(PG_INSTR_CLOCK, ); + + return tmp.tv_sec * NS_PER_S + tmp.tv_nsec; +} + +#define INSTR_TIME_SET_CURRENT(t) \ + (t) = pg_clock_gettime_ns() #define INSTR_TIME_ADD(x,y) \ do { \ - (x).tv_sec += (y).tv_sec; \ - (x).tv_nsec += (y).tv_nsec; \ - /* Normalize */ \ - while ((x).tv_nsec >= 10) \ - { \ - (x).tv_nsec -= 10; \ - (x).tv_sec++; \ - } \ + (x) += (y); \ } while (0) #define INSTR_TIME_SUBTRACT(x,y) \ do { \ - (x).tv_sec -= (y).tv_sec; \ - (x).tv_nsec -= (y).tv_nsec; \ - /* Normalize */ \ - while ((x).tv_nsec < 0) \ - { \ - (x).tv_nsec += 10; \ - (x).tv_sec--; \ - } \ + (x) -= (y); \ } while (0) #define INSTR_TIME_ACCUM_DIFF(x,y,z) \ do { \ - (x).tv_sec += (y).tv_sec - (z).tv_sec; \ - (x).tv_nsec += (y).tv_nsec - (z).tv_nsec; \ - /* Normalize after each add to avoid overflow/underflow of tv_nsec */ \ - while ((x).tv_nsec < 0) \ - { \ - (x).tv_nsec += 10; \ - (x).tv_sec--; \ - } \ - while ((x).tv_nsec >= 10) \ - { \ - (x).tv_nsec -= 10; \ -
Re: Aggregate node doesn't include cost for sorting
Hi David, Thanks for the explanations and awesome that this was improved on already. I didn't have this change on my radar. On 12/8/22 11:40, David Rowley wrote: This output surely must be from a version of PostgreSQL prior to 1349d279? I can't quite figure out why you've added a "SET enable_seqscan = FALSE;". That makes it look like you've used the same version of PostgreSQL to produce both of these plans. The 2nd plan you've shown must be post 1349d279. Both plans were captured on 14.5, which is indeed prior to 1349d279. I disabled sequential scan to show that there's an alternative plan which is superior to the chosen plan: Index Only Scan is more expensive and takes longer than the Seq Scan, but the subsequent Aggregate runs much faster as it doesn't have to sort, making the plan overall superior. So, with the assumption that you've used 2 different versions to show this output, for post 1349d279, there does not seem to be much choice on how the aggregate is executed. What's your concern about the costings having to be accurate given there's no other plan choice? There's another plan choice which is using the index to get pre-sorted input rows, see previous comment. The new pre-sorted aggregate code will always request the sort order that suits the largest number of ORDER BY / DISTINCT aggregates. When there are multiple ORDER BY / DISTINCT aggregates and they have different sort requirements then there certainly are completing ways that the aggregation portion of the plan can be executed. I opted to make the choice just based on the number of aggregates that could become presorted. nodeAgg.c is currently not very smart about sharing sorts between multiple aggregates with the same sort requirements. If that was made better, there might be more motivation to have better costing code in make_pathkeys_for_groupagg(). However, the reasons for the reversion of db0d67db seemed to be mainly around the lack of ability to accurately cost multiple competing sort orders. We'd need to come up with some better way to do that if we were to want to give make_pathkeys_for_groupagg() similar abilities. Also, why does the Aggregate node sort itself? Why don't we instead emit an explicit Sort node in the plan so that it's visible to the user what is happening? As soon as there's also a GROUP BY in the query, a Sort node occurs in the plan. This seems inconsistent. Post 1349d279 we do that, but it can only do it for 1 sort order. There can be any number of aggregate functions which require a sort and they don't all have to have the same sort order requirements. We can't do the same as WindowAgg does by chaining nodes together either because the aggregate node aggregates the results and we'd need all the same input rows to be available at each step. The only other way would be to have it so an Aggregate node could be fed by multiple different input nodes and then it would only work on the aggregates that suit the given input before reading the next input and doing the other aggregates. Making it work like that would cause quite a bit of additional effort during planning (not to mention the executor). We'd have to run the join search once per required order, which is one of the slowest parts of planning. Right now, you could probably make that work by just writing the SQL to have a subquery per sort requirement. Thanks for the explanation! -- David Geier (ServiceNow)
Aggregate node doesn't include cost for sorting
Hi hackers, The cost of an Aggregate node seems to be the same regardless of the input being pre-sorted or not. If however the input is not sorted, the Aggregate node must additionally perform a sort which can impact runtime significantly. Here is an example: CREATE TABLE foo(col0 INT, col1 TEXT); INSERT INTO foo SELECT generate_series(1, 10), 'aa' || md5(random()::TEXT); CREATE INDEX foo_idx ON foo(col1); SET max_parallel_workers_per_gather = 0; SET enable_bitmapscan = FALSE; -- Unsorted input. Aggregate node must additionally sort all rows. EXPLAIN ANALYZE SELECT COUNT(DISTINCT(col1)) FROM foo; QUERY PLAN --- Aggregate (cost=2584.00..2584.01 rows=1 width=8) (actual time=1684.705..1684.809 rows=1 loops=1) -> Seq Scan on foo (cost=0.00..2334.00 rows=10 width=71) (actual time=0.018..343.280 rows=10 loops=1) Planning Time: 0.317 ms Execution Time: 1685.543 ms -- Pre-sorted input. Aggregate node doesn't have to sort all rows. SET enable_seqscan = FALSE; EXPLAIN ANALYZE SELECT COUNT(DISTINCT(col1)) FROM foo; QUERY PLAN --- Aggregate (cost=6756.42..6756.43 rows=1 width=8) (actual time=819.028..819.041 rows=1 loops=1) -> Index Only Scan using foo_idx on foo (cost=6506.42..6506.42 rows=10 width=71) (actual time=0.046..404.260 rows=10 loops=1) Heap Fetches: 10 Heap Prefetches: 1334 Planning Time: 0.438 ms Execution Time: 819.515 ms The cost of the Aggregate node is in both cases the same (250.0) while its runtime is 1.3s in the unsorted case and 0.4s in the pre-sorted case. Also, why does the Aggregate node sort itself? Why don't we instead emit an explicit Sort node in the plan so that it's visible to the user what is happening? As soon as there's also a GROUP BY in the query, a Sort node occurs in the plan. This seems inconsistent. -- David Geier (ServiceNow)
Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?
I missed attaching the patches. -- David Geier (ServiceNow) From f4e962729ca605498d0c8bfc97d0f42d68a0df06 Mon Sep 17 00:00:00 2001 From: David Geier Date: Thu, 17 Nov 2022 10:22:01 +0100 Subject: [PATCH 1/2] WIP: Change instr_time to just store nanoseconds, that's cheaper. --- src/include/portability/instr_time.h | 62 1 file changed, 26 insertions(+), 36 deletions(-) diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h index 22bcf3d288..4bd555113b 100644 --- a/src/include/portability/instr_time.h +++ b/src/include/portability/instr_time.h @@ -80,63 +80,53 @@ #define PG_INSTR_CLOCK CLOCK_REALTIME #endif -typedef struct timespec instr_time; +typedef int64 instr_time; +#define NS_PER_S INT64CONST(10) +#define US_PER_S INT64CONST(100) +#define MS_PER_S INT64CONST(1000) -#define INSTR_TIME_IS_ZERO(t) ((t).tv_nsec == 0 && (t).tv_sec == 0) +#define NS_PER_MS INT64CONST(100) +#define NS_PER_US INT64CONST(1000) -#define INSTR_TIME_SET_ZERO(t) ((t).tv_sec = 0, (t).tv_nsec = 0) +#define INSTR_TIME_IS_ZERO(t) ((t) == 0) -#define INSTR_TIME_SET_CURRENT(t) ((void) clock_gettime(PG_INSTR_CLOCK, &(t))) +#define INSTR_TIME_SET_ZERO(t) ((t) = 0) + +static inline instr_time pg_clock_gettime_ns(void) +{ + struct timespec tmp; + + clock_gettime(PG_INSTR_CLOCK, ); + + return tmp.tv_sec * NS_PER_S + tmp.tv_nsec; +} + +#define INSTR_TIME_SET_CURRENT(t) \ + (t) = pg_clock_gettime_ns() #define INSTR_TIME_ADD(x,y) \ do { \ - (x).tv_sec += (y).tv_sec; \ - (x).tv_nsec += (y).tv_nsec; \ - /* Normalize */ \ - while ((x).tv_nsec >= 10) \ - { \ - (x).tv_nsec -= 10; \ - (x).tv_sec++; \ - } \ + (x) += (y); \ } while (0) #define INSTR_TIME_SUBTRACT(x,y) \ do { \ - (x).tv_sec -= (y).tv_sec; \ - (x).tv_nsec -= (y).tv_nsec; \ - /* Normalize */ \ - while ((x).tv_nsec < 0) \ - { \ - (x).tv_nsec += 10; \ - (x).tv_sec--; \ - } \ + (x) -= (y); \ } while (0) #define INSTR_TIME_ACCUM_DIFF(x,y,z) \ do { \ - (x).tv_sec += (y).tv_sec - (z).tv_sec; \ - (x).tv_nsec += (y).tv_nsec - (z).tv_nsec; \ - /* Normalize after each add to avoid overflow/underflow of tv_nsec */ \ - while ((x).tv_nsec < 0) \ - { \ - (x).tv_nsec += 10; \ - (x).tv_sec--; \ - } \ - while ((x).tv_nsec >= 10) \ - { \ - (x).tv_nsec -= 10; \ - (x).tv_sec++; \ - } \ + (x) += (y) - (z); \ } while (0) #define INSTR_TIME_GET_DOUBLE(t) \ - (((double) (t).tv_sec) + ((double) (t).tv_nsec) / 10.0) + ((double) (t) / NS_PER_S) #define INSTR_TIME_GET_MILLISEC(t) \ - (((double) (t).tv_sec * 1000.0) + ((double) (t).tv_nsec) / 100.0) + ((double) (t) / NS_PER_MS) #define INSTR_TIME_GET_MICROSEC(t) \ - (((uint64) (t).tv_sec * (uint64) 100) + (uint64) ((t).tv_nsec / 1000)) + ((double) (t) / NS_PER_US) #else /* WIN32 */ -- 2.34.1 From 7a6317fdf5b1d82f120a4fd5535cfe40c8165153 Mon Sep 17 00:00:00 2001 From: David Geier Date: Thu, 17 Nov 2022 13:03:59 +0100 Subject: [PATCH 2/2] WIP: Use cpu reference cycles, via rdtsc, to measure time for instrumentation. For now this is only enabled on Linux/x86 when the system clocksource is marked tsc as well, as determined at runtime. This way we can rely on the Linux kernel to make a decision whether tsc is invariant and usable on the current CPU architecture. In all other cases we continue to use the clock_gettime() implementation like before. Note that this intentionally uses rdtsc, not rdtscp, as rdtscp waits for currently running CPU instructions to have finished, and that adds up to noticable latency for little benefit in the typical InstrStartNode() / InstrStopNode() use case. --- src/backend/utils/init/postinit.c | 3 + src/bin/pg_test_timing/pg_test_timing.c | 1 + src/bin/pgbench/pgbench.c | 3 + src/bin/psql/startup.c | 4 + src/common/Makefile | 1 + src/common/instr_time.c | 103 src/include/portability/instr_time.h| 50 +--- src/tools/msvc/Mkvcbuild.pm | 2 +- 8 files changed, 157 insertions(+), 10 deletions(-) create mode 100644 src/common/instr_time.c diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c index a990c833c5..c684725af3 100644 --- a/src/backend/utils/init/postinit.c +++ b/src/backend/utils/init/postinit.c @@ -804,6 +804,9 @@ InitPostgres(const char *in_dbname, Oid dboid, /* Initialize portal manager */ EnablePortalManager(); + /* initialize high-precision interval timing */ + INSTR_TIME_INITIALIZE(); + /* Initialize status reporting */ pgstat_beinit(); diff --git a/src/bin/pg_test_timing/pg_test_timing.c b/src/bin/pg_test_timing/pg_test_timing.c index c29d6f8762..0d667ff5a7 100644 --- a/src/bin/pg_test_timing/pg_test_timing.c +++ b/src/bin/pg_test_timing/pg_test_timing.c @@ -132,6 +132,7 @@ test_timi
Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?
I think it would be great to get this patch committed. Beyond the reasons already mentioned, the significant overhead also tends to skew the reported runtimes in ways that makes it difficult to compare them. For example, if two nodes are executed equally often but one needs twice the time to process the rows: in such a case EXPLAIN ANALYZE should report timings that are 2x apart. However, currently, the high overhead of clock_gettime() tends to skew the relative runtimes. On 10/12/22 10:33, Michael Paquier wrote: No rebased version has been sent since this update, so this patch has been marked as RwF. I've rebased the patch set on latest master and fixed a few compiler warnings. Beyond that some findings and thoughts: You're only using RDTSC if the clock source is 'tsc'. Great idea to not bother caring about a lot of hairy TSC details. Looking at the kernel code this seems to imply that the TSC is frequency invariant. I don't think though that this implies that Linux is not running under a hypervisor; which is good because I assume PostgreSQL is used a lot in VMs. However, when running under a hypervisor (at least with VMWare) CPUID leaf 0x16 is not available. In my tests __get_cpuid() indicated success but the returned values were garbage. Instead of using leaf 0x16, we should then use the hypervisor interface to obtain the TSC frequency. Checking if a hypervisor is active can be done via: bool IsHypervisorActive() { uint32 cpuinfo[4] = {0}; int res = __get_cpuid(0x1, [0], [1], [2], [3]); return res > 0 && (cpuinfo[2] & (1 << 30)); } Obtaining the TSC frequency via the hypervisor interface can be done with the following code. See https://lwn.net/Articles/301888/ for more details. // Under hypervisors (tested with VMWare) leaf 0x16 is not available, even though __get_cpuid() succeeds. // Hence, if running under a hypervisor, use the hypervisor interface to obtain TSC frequency. uint32 cpuinfo[4] = {0}; if (IsHypervisorActive() && __get_cpuid(0x4001, [0], [1], [2], [3]) > 0) cycles_to_sec = 1.0 / ((double)cpuinfo[0] * 1000 * 1000); Given that we anyways switch between RDTSC and clock_gettime() with a global variable, what about exposing the clock source as GUC? That way the user can switch back to a working clock source in case we miss a detail around activating or reading the TSC. I'm happy to update the patches accordingly. -- David Geier (ServiceNow)
Re: Optimize join selectivity estimation by not reading MCV stats for unique join attributes
On 11/18/22 14:00, Tomas Vondra wrote: Seems fine. I wonder if we could/could introduce a new constant for 0, similar to ATTSTATSSLOT_NUMBERS/ATTSTATSSLOT_VALUES, instead of using a magic constant. Say, ATTSTATSSLOT_NONE or ATTSTATSSLOT_CHECK. Good idea. I called it ATTSTATSSLOT_EXISTS. New patch attached. I don't think you can write a test for this, because there is no change to behavior that can be observed by the user. If one side has no MCV, the only difference is whether we try to load the other MCV or not. Yeah. I thought along the lines of checking the number of pages read when the pg_stats entry is not in syscache yet. But that seems awfully implementation specific. So no test provided. -- David Geier (ServiceNow) From 5c5c0fb9dd99e79daaa015984c9dda22e4ccda17 Mon Sep 17 00:00:00 2001 From: David Geier Date: Fri, 18 Nov 2022 09:35:08 +0100 Subject: [PATCH] Don't read MCV stats needlessly in join selectivity estimation Join selectivity estimation only uses MCV stats if they're present on both join attributes. Therefore, if MCV stats are not available on one of the two columns, skip reading MCV stats altogether. Frequently encountered situations in which MCV stats not available is unique columns or columns for which all values have roughly the same frequency and therefore ANALYZE decided to not store them. --- src/backend/utils/adt/selfuncs.c| 17 +++-- src/backend/utils/cache/lsyscache.c | 4 src/include/utils/lsyscache.h | 3 ++- 3 files changed, 21 insertions(+), 3 deletions(-) diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c index d597b7e81f..85a29eaeb9 100644 --- a/src/backend/utils/adt/selfuncs.c +++ b/src/backend/utils/adt/selfuncs.c @@ -2263,6 +2263,7 @@ eqjoinsel(PG_FUNCTION_ARGS) bool have_mcvs2 = false; bool join_is_reversed; RelOptInfo *inner_rel; + bool get_mcv_stats; get_join_variables(root, args, sjinfo, , , _is_reversed); @@ -2275,11 +2276,23 @@ eqjoinsel(PG_FUNCTION_ARGS) memset(, 0, sizeof(sslot1)); memset(, 0, sizeof(sslot2)); + /* + * There is no use in fetching one side's MCVs if we lack MCVs for the + * other side, so do a quick check to verify that both stats exist. + */ + get_mcv_stats = (HeapTupleIsValid(vardata1.statsTuple) && + HeapTupleIsValid(vardata2.statsTuple) && + get_attstatsslot(, vardata1.statsTuple, + STATISTIC_KIND_MCV, InvalidOid, ATTSTATSSLOT_EXISTS) && + get_attstatsslot(, vardata2.statsTuple, + STATISTIC_KIND_MCV, InvalidOid, ATTSTATSSLOT_EXISTS)); + + if (HeapTupleIsValid(vardata1.statsTuple)) { /* note we allow use of nullfrac regardless of security check */ stats1 = (Form_pg_statistic) GETSTRUCT(vardata1.statsTuple); - if (statistic_proc_security_check(, opfuncoid)) + if (get_mcv_stats && statistic_proc_security_check(, opfuncoid)) have_mcvs1 = get_attstatsslot(, vardata1.statsTuple, STATISTIC_KIND_MCV, InvalidOid, ATTSTATSSLOT_VALUES | ATTSTATSSLOT_NUMBERS); @@ -2289,7 +2302,7 @@ eqjoinsel(PG_FUNCTION_ARGS) { /* note we allow use of nullfrac regardless of security check */ stats2 = (Form_pg_statistic) GETSTRUCT(vardata2.statsTuple); - if (statistic_proc_security_check(, opfuncoid)) + if (get_mcv_stats && statistic_proc_security_check(, opfuncoid)) have_mcvs2 = get_attstatsslot(, vardata2.statsTuple, STATISTIC_KIND_MCV, InvalidOid, ATTSTATSSLOT_VALUES | ATTSTATSSLOT_NUMBERS); diff --git a/src/backend/utils/cache/lsyscache.c b/src/backend/utils/cache/lsyscache.c index a16a63f495..191f68f7b4 100644 --- a/src/backend/utils/cache/lsyscache.c +++ b/src/backend/utils/cache/lsyscache.c @@ -3157,6 +3157,10 @@ get_attavgwidth(Oid relid, AttrNumber attnum) * reqkind: STAKIND code for desired statistics slot kind. * reqop: STAOP value wanted, or InvalidOid if don't care. * flags: bitmask of ATTSTATSSLOT_VALUES and/or ATTSTATSSLOT_NUMBERS. + *Passing 0 will only check if the requested slot type exists but not + *extract any content. This is useful in cases where MCV stats are only + *useful if they exists on all involved columns (e.g. during join + *selectivity computation). * * If a matching slot is found, true is returned, and *sslot is filled thus: * staop: receives the actual STAOP value. diff --git a/src/include/utils/lsyscache.h b/src/include/utils/lsyscache.h index 50f0288305..d9a3916997 100644 --- a/src/include/utils/lsyscache.h +++ b/src/include/utils/lsyscache.h @@ -39,7 +39,8 @@ typedef enum IOFuncSelector } IOFuncSelector; /* Flag bits for get_attstatsslot */ -#define ATTSTATSSLOT_VALUES 0x01 +#define ATTSTATSSLOT_EXISTS 0x00 +#define ATTSTATSSLOT_VALUES 0x01 #define ATTSTATSSLOT_NUMBERS 0x02 /* Result struct for get_attstatsslot */ -- 2.34.1
Re: CREATE UNLOGGED TABLE seq faults when debug_discard_caches=1
Hi Tom, Back-patching but keeping RelationOpenSgmr() for extensions sounds reasonable. On a different note: are we frequently running our tests suites with debug_discard_caches=1 enabled? It doesn't seem like. I just ran make check with debug_discard_caches=1 on - latest master: everything passes. - version 14.5: fails in create_index, create_index_spgist, create_view. So the buggy code path is at least covered by the tests. But it seems like we could have found it earlier by regularly running with debug_discard_caches=1. -- David Geier (ServiceNow) On 11/17/22 18:51, Tom Lane wrote: I wrote: I wonder whether we ought to back-patch f10f0ae42. We could leave the RelationOpenSmgr macro in existence to avoid unnecessary breakage of extension code, but stop using it within our own code. Concretely, about like this for v14 (didn't look at the older branches yet). I'm not sure whether to recommend that outside extensions switch to using RelationGetSmgr in pre-v15 branches. If they do, they run a risk of compile failure should they be built against old back-branch headers. Once compiled, though, they'd work against any minor release (since RelationGetSmgr is static inline, not something in the core backend). So maybe that'd be good enough, and keeping their code in sync with what they need for v15 would be worth something. regards, tom lane
Re: Assertion failure with barriers in parallel hash join
Thanks! Please let me know if I can help out, e.g. with re-testing. -- David Geier (ServiceNow) On 11/17/22 08:28, Thomas Munro wrote: On Thu, Nov 17, 2022 at 8:01 PM David Geier wrote: Can we make progress with this patch in the current commit fest, or discuss what is still missing to bring this in? Hi David, Sorry for the delay. I'll aim to get this done in the next few days.
Re: Optimize join selectivity estimation by not reading MCV stats for unique join attributes
Thanks everyone for the great feedback and suggestions. Yes, it is. Using zero flag would short-cut get_attstatsslot() to just return whether the slot type exists without loading it. Do you think we need to emphasize this use case in the comments for 'flags'? Perhaps, it's not really obvious now. Comment added. I wonder whether we need to also check statistic_proc_security_check() when determining if MCVs exists in both sides. Yeah, I thought about hoisting the statistic_proc_security_check tests up into get_mcv_stats. I don't think it's a great idea though. Again, it'd complicate untangling this if we ever generalize the use of MCVs in this function. Also, I don't think we should be micro-optimizing the case where the security check doesn't pass --- if it doesn't, you're going to be hurting from bad plans a lot more than you are from some wasted cycles here. Sounds reasonable. Attached is v2 of the patch. This is basically Tom's version plus a comment for the flags of get_attstatslot() as suggested by Richard. I couldn't come up with any reasonable way of writing an automated test for that. Any ideas? -- David Geier (ServiceNow) From 66b16792e8e16dad33adb5614a0936cc41002f4c Mon Sep 17 00:00:00 2001 From: David Geier Date: Fri, 18 Nov 2022 09:35:08 +0100 Subject: [PATCH] Don't read MCV stats needlessly in join selectivity estimation Join selectivity estimation only uses MCV stats if they're present on both join attributes. Therefore, if MCV stats are not available on one of the two columns, skip reading MCV stats altogether. Frequently encountered situations in which MCV stats not available is unique columns or columns for which all values have roughly the same frequency and therefore ANALYZE decided to not store them. --- src/backend/utils/adt/selfuncs.c| 15 +-- src/backend/utils/cache/lsyscache.c | 4 2 files changed, 17 insertions(+), 2 deletions(-) diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c index d597b7e81f..5baab4e3ac 100644 --- a/src/backend/utils/adt/selfuncs.c +++ b/src/backend/utils/adt/selfuncs.c @@ -2263,6 +2263,7 @@ eqjoinsel(PG_FUNCTION_ARGS) bool have_mcvs2 = false; bool join_is_reversed; RelOptInfo *inner_rel; + bool get_mcv_stats; get_join_variables(root, args, sjinfo, , , _is_reversed); @@ -2275,11 +2276,21 @@ eqjoinsel(PG_FUNCTION_ARGS) memset(, 0, sizeof(sslot1)); memset(, 0, sizeof(sslot2)); + /* + * There is no use in fetching one side's MCVs if we lack MCVs for the + * other side, so do a quick check to verify that both stats exist. + */ + get_mcv_stats = (HeapTupleIsValid(vardata1.statsTuple) && + HeapTupleIsValid(vardata2.statsTuple) && + get_attstatsslot(, vardata1.statsTuple, STATISTIC_KIND_MCV, InvalidOid, 0) && + get_attstatsslot(, vardata2.statsTuple, STATISTIC_KIND_MCV, InvalidOid, 0)); + + if (HeapTupleIsValid(vardata1.statsTuple)) { /* note we allow use of nullfrac regardless of security check */ stats1 = (Form_pg_statistic) GETSTRUCT(vardata1.statsTuple); - if (statistic_proc_security_check(, opfuncoid)) + if (get_mcv_stats && statistic_proc_security_check(, opfuncoid)) have_mcvs1 = get_attstatsslot(, vardata1.statsTuple, STATISTIC_KIND_MCV, InvalidOid, ATTSTATSSLOT_VALUES | ATTSTATSSLOT_NUMBERS); @@ -2289,7 +2300,7 @@ eqjoinsel(PG_FUNCTION_ARGS) { /* note we allow use of nullfrac regardless of security check */ stats2 = (Form_pg_statistic) GETSTRUCT(vardata2.statsTuple); - if (statistic_proc_security_check(, opfuncoid)) + if (get_mcv_stats && statistic_proc_security_check(, opfuncoid)) have_mcvs2 = get_attstatsslot(, vardata2.statsTuple, STATISTIC_KIND_MCV, InvalidOid, ATTSTATSSLOT_VALUES | ATTSTATSSLOT_NUMBERS); diff --git a/src/backend/utils/cache/lsyscache.c b/src/backend/utils/cache/lsyscache.c index a16a63f495..191f68f7b4 100644 --- a/src/backend/utils/cache/lsyscache.c +++ b/src/backend/utils/cache/lsyscache.c @@ -3157,6 +3157,10 @@ get_attavgwidth(Oid relid, AttrNumber attnum) * reqkind: STAKIND code for desired statistics slot kind. * reqop: STAOP value wanted, or InvalidOid if don't care. * flags: bitmask of ATTSTATSSLOT_VALUES and/or ATTSTATSSLOT_NUMBERS. + *Passing 0 will only check if the requested slot type exists but not + *extract any content. This is useful in cases where MCV stats are only + *useful if they exists on all involved columns (e.g. during join + *selectivity computation). * * If a matching slot is found, true is returned, and *sslot is filled thus: * staop: receives the actual STAOP value. -- 2.34.1
Re: Assertion failure with barriers in parallel hash join
Hi Thomas, Can we make progress with this patch in the current commit fest, or discuss what is still missing to bring this in? Thanks! -- David Geier (ServiceNow) On 6/6/22 17:01, David Geier wrote: Hi Thomas, Correct. We're running with disabled parallel leader participation and we have to do so, because another custom plan node we built relies on that. That would be great. Anything else I can help with to get this patch in? Thanks! -- David (ServiceNow) On Fri, 3 Jun 2022 at 00:06, Thomas Munro wrote: On Thu, Jun 2, 2022 at 9:31 PM David Geier wrote: > We recently encountered the same bug in the field. Oleksii Kozlov managed to come up with reproduction steps, which reliably trigger it. Interestingly, the bug does not only manifest as failing assertion, but also as segmentation fault; in builds with disabled and with enabled (!) assertions. So it can crash production environments. We applied the proposed patch v3 from Melanie to the REL_14_3 branch and can confirm that with the patch neither the assertion nor the segmentation fault still occur. Thanks for the report, testing, and for creating the CF entry. I assume you are using parallel_leader_participation=off, and the reason we haven't heard more about this is because few people do that. By coincidence I was just about to restart a bunch of hash join projects and have been paging the topic area back into my brain, so I'll start with another round of testing/analysis of this bug/patch next week.
Re: Optimize join selectivity estimation by not reading MCV stats for unique join attributes
Hi Tom, There won't *be* any MCV stats for a column that ANALYZE perceives to be unique, so I'm not quite sure where the claimed savings comes from. We save if one join attribute is unique while the other isn't. In that case stored MCV stats are read for the non-unique attribute but then never used. This is because MCV stats in join selectivity estimation are only used if they're present on both columns Please provide a concrete example. A super simple case already showing a significant speedup is the following. The more ways to join two tables and the more joins overall, the higher the expected gain. CREATE TABLE bar(col INT UNIQUE); CREATE TABLE foo (col INT); INSERT INTO foo SELECT generate_series(1, 100, 0.5); SET default_statistics_target = 1; ANALYZE foo, bar; \timing on EXPLAIN SELECT * FROM foo, bar WHERE foo.col = bar.col; Running the above query five times gave me average runtimes of: - 0.62 ms without the patch and - 0.48 ms with the patch. -- David Geier (ServiceNow)
Optimize join selectivity estimation by not reading MCV stats for unique join attributes
Hi hackers, eqjoinsel() can be optimized by not reading MCV stats if at least one of the two join attributes is unique. As primary keys are implicitly unique this situation can occur frequently. For unique columns no MCV stats are stored and eqjoinsel_inner() and eqjoinsel_semi(), called from eqjoinsel(), only consider MCV stats in the join selectivity computation if they're present on both columns. Attached is a small patch that implements the skipping. With this change we saw some queries improve planning time by more than 2x, especially with larger values for default_statistics_target. That's because get_attstatsslot() deconstructs the array holding the MCV. The size of that array depends on default_statistics_target. Thanks for your consideration! -- David Geier (ServiceNow) From 7a8176b9eb9dd9982662d83b86996c5402378674 Mon Sep 17 00:00:00 2001 From: David Geier Date: Fri, 11 Nov 2022 12:33:11 +0100 Subject: [PATCH] Skip reading MCV stats for unique join attributes MCV stats are not stored for unique columns. If one of the two join attributes is unique, reading the MCV stats can be skipped for both columns because they are only ever used during join selectivity estimation if they're present on both attributes. --- src/backend/utils/adt/selfuncs.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c index d597b7e81f..3c9007c8d1 100644 --- a/src/backend/utils/adt/selfuncs.c +++ b/src/backend/utils/adt/selfuncs.c @@ -2263,6 +2263,7 @@ eqjoinsel(PG_FUNCTION_ARGS) bool have_mcvs2 = false; bool join_is_reversed; RelOptInfo *inner_rel; + bool get_mcv_stats; get_join_variables(root, args, sjinfo, , , _is_reversed); @@ -2270,6 +2271,7 @@ eqjoinsel(PG_FUNCTION_ARGS) nd1 = get_variable_numdistinct(, ); nd2 = get_variable_numdistinct(, ); + get_mcv_stats = !vardata1.isunique && !vardata2.isunique; opfuncoid = get_opcode(operator); memset(, 0, sizeof(sslot1)); @@ -2279,7 +2281,7 @@ eqjoinsel(PG_FUNCTION_ARGS) { /* note we allow use of nullfrac regardless of security check */ stats1 = (Form_pg_statistic) GETSTRUCT(vardata1.statsTuple); - if (statistic_proc_security_check(, opfuncoid)) + if (get_mcv_stats && statistic_proc_security_check(, opfuncoid)) have_mcvs1 = get_attstatsslot(, vardata1.statsTuple, STATISTIC_KIND_MCV, InvalidOid, ATTSTATSSLOT_VALUES | ATTSTATSSLOT_NUMBERS); @@ -2289,7 +2291,7 @@ eqjoinsel(PG_FUNCTION_ARGS) { /* note we allow use of nullfrac regardless of security check */ stats2 = (Form_pg_statistic) GETSTRUCT(vardata2.statsTuple); - if (statistic_proc_security_check(, opfuncoid)) + if (get_mcv_stats && statistic_proc_security_check(, opfuncoid)) have_mcvs2 = get_attstatsslot(, vardata2.statsTuple, STATISTIC_KIND_MCV, InvalidOid, ATTSTATSSLOT_VALUES | ATTSTATSSLOT_NUMBERS); -- 2.34.1
Re: Add explicit casts in four places to simplehash.h
On 11/3/22 15:50, Tom Lane wrote: Seems reasonable, so done (I fixed one additional spot you missed). Thanks! The bigger picture here is that we do actually endeavor to keep (most of) our headers C++ clean, but our tool cpluspluscheck misses these problems because it doesn't try to use these macros. I wonder whether there is a way to do better. What about having a custom header alongside cpluspluscheck which references all macros we care about? We could start with the really big macros like the ones from simplehash.h and add as we go. I could give this a try if deemed useful. -- David Geier (ServiceNow)
Add explicit casts in four places to simplehash.h
Hi hackers, While working on an extension, I found that simplehash.h is missing explicit casts in four places. Without these casts, compiling code including simplehash.h yields warnings if the code is compiled with -Wc++-compat. PostgreSQL seems to mostly prefer omitting the explicit casts, however there are many places where an explicit cast is actually used. Among many others, see e.g. bool.c: state = (BoolAggState *) MemoryContextAlloc(agg_context, sizeof(BoolAggState)); domains.c: my_extra = (DomainIOData *) MemoryContextAlloc(mcxt, sizeof(DomainIOData)); What about, while not being strictly necessary for PostgreSQL itself, also adding such casts to simplehash.h so that it can be used in code where -Wc++-compat is enabled? Attached is a small patch that adds the aforementioned casts. Thanks for your consideration! -- David Geier (ServiceNow) From d2665b4065227ffa6efd37ee912e4add082869cb Mon Sep 17 00:00:00 2001 From: David Geier Date: Thu, 3 Nov 2022 12:28:08 +0100 Subject: [PATCH] Add explicit casts to simplehash.h Without these casts compilation yields warnings when compiling the code with -Wc++-compat. --- src/include/lib/simplehash.h | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/src/include/lib/simplehash.h b/src/include/lib/simplehash.h index 329687c1a5..cdfae537a1 100644 --- a/src/include/lib/simplehash.h +++ b/src/include/lib/simplehash.h @@ -438,7 +438,7 @@ SH_CREATE(MemoryContext ctx, uint32 nelements, void *private_data) #ifdef SH_RAW_ALLOCATOR tb = SH_RAW_ALLOCATOR(sizeof(SH_TYPE)); #else - tb = MemoryContextAllocZero(ctx, sizeof(SH_TYPE)); + tb = (SH_TYPE *) MemoryContextAllocZero(ctx, sizeof(SH_TYPE)); tb->ctx = ctx; #endif tb->private_data = private_data; @@ -448,7 +448,7 @@ SH_CREATE(MemoryContext ctx, uint32 nelements, void *private_data) SH_COMPUTE_PARAMETERS(tb, size); - tb->data = SH_ALLOCATE(tb, sizeof(SH_ELEMENT_TYPE) * tb->size); + tb->data = (SH_ELEMENT_TYPE *) SH_ALLOCATE(tb, sizeof(SH_ELEMENT_TYPE) * tb->size); return tb; } @@ -493,7 +493,7 @@ SH_GROW(SH_TYPE * tb, uint64 newsize) /* compute parameters for new table */ SH_COMPUTE_PARAMETERS(tb, newsize); - tb->data = SH_ALLOCATE(tb, sizeof(SH_ELEMENT_TYPE) * tb->size); + tb->data = (SH_ELEMENT_TYPE *) SH_ALLOCATE(tb, sizeof(SH_ELEMENT_TYPE) * tb->size); newdata = tb->data; @@ -1059,7 +1059,7 @@ SH_STAT(SH_TYPE * tb) double fillfactor; uint32 i; - uint32 *collisions = palloc0(tb->size * sizeof(uint32)); + uint32 *collisions = (uint32 *) palloc0(tb->size * sizeof(uint32)); uint32 total_collisions = 0; uint32 max_collisions = 0; double avg_collisions; -- 2.34.1
Re: Reducing planning time on tables with many indexes
On 8/1/22 15:33, David Geier wrote: Hi Tom, On Wed, Jul 27, 2022 at 7:15 PM Tom Lane wrote: I wrote: > Unfortunately, as things stand today, the planner needs more than the > right to look at the indexes' schemas, because it makes physical accesses > to btree indexes to find out their tree height (and I think there are some > comparable behaviors in other AMs). I've never particularly cared for > that implementation, and would be glad to rip out that behavior if we can > find another way. Maybe we could persuade VACUUM or ANALYZE to store that > info in the index's pg_index row, or some such, and then the planner > could use it with no lock? It seems like _bt_getrootheight() first checks if the height is cached and only if it isn't it accesses index meta pages. If the index locks are only taken for the sake of _bt_getrootheight() accessing index meta pages in case they are not cached, maybe the index locks could be taken conditionally. However, postponing the call to where it is really needed sounds even better. A first step here could just be to postpone fetching _bt_getrootheight() until we actually need it during cost estimation. That would avoid the need to do it at all for indexes that indxpath.c discards as irrelevant, which is a decision made on considerably more information than the proposed patch uses. Hi Tom, I gave the idea of moving _bt_getrootheight() into costsize.c and filling IndexOptInfo in get_relation_info() via syscache instead of relcache a try, but didn't get very far. Moving out _bt_getrootheight() was straightforward, and we should do nevertheless. However, it seems like get_relation_info() strongly depends on the index's Relation for quite some stuff. A fair amount of fields I could actually fill from syscache, but there are some that either need data not stored in syscache (e.g. estimate_rel_size(), Relation::rd_smgr needed by RelationGetNumberOfBlocksInFork()) or need fields that are cached in the index's Relation and would have to be recomputed otherwise (e.g. Relation::rd_indexprs filled by RelationGetIndexExpressions(), Relation::rd_indpred filled by RelationGetIndexPredicate()). Even if we could somehow obtain the missing info from somewhere, recomputing the otherwise cached fields from Relation would likely cause a significant slowdown in the serial case. Beyond that I did some off-CPU profiling to precisely track down which lock serializes execution. It turned out to be the MyProc::fpInfoLock lightweight lock. This lock is used in the fast path of the heavyweight lock. In the contenting case, fpInfoLock is acquired in LW_EXCLUSIVE mode to (1) check if there is no other process holding a stronger lock, and if not, to reserve a process local fast path lock slot and (2) to return the fast path lock slots all in one go. To do so, the current implementation always linearly iterates over all lock slots. The corresponding call stacks are: get_relation_info() CommitTransaction() index_open() ResourceOwnerRelease() relation_open() ResourceOwnerReleaseInternal() LockRelationOid() ProcReleaseLocks() LockAcquireExtended() LockReleaseAll() <-- called twice from ProcReleaseLocks() LWLockAcquire() On top of that there are only 16 fast path lock slots. One slot is always taken up by the parent relation, leaving only 15 slots for the indexes. As soon as a session process runs out of slots, it falls back to the normal lock path which has to mess around with the lock table. To do so it also acquires a lightweight lock in LW_EXCLUSIVE mode. This lightweight lock however is partitioned and therefore does not content. Hence, normal lock acquisition is slower but contents less. To prove above findings I increased the number of fast path lock slots per connection and optimized FastPathGrantRelationLock() and FastPathUnGrantRelationLock(). With these changes the lock contention disappeared and the workload scales linearly (the code I tested with also included moving out _bt_getrootheight()): | Parallel streams | TPS | TPS / stream | |--|--|---| | 1 | 5,253 | 5,253 | | 10 | 51,406 | 5,140 | | 20 | 101,401 | 5,070 | | 30 | 152,023 | 5,067 | | 40 | 200,607 | 5,015 | | 50 | 245,359 | 4,907 | | 60 | 302,994 | 5,049 | However, with the very same setup, the index filtering approach yields 486k TPS with 60 streams and 9,827 TPS with a single stream. The single stream number shows that this is not because it scales even better, but just because less work is spent during planning. A quick perf session showed that a fair amo
Re: Reducing planning time on tables with many indexes
Hi Tom, On Wed, Jul 27, 2022 at 6:39 PM Tom Lane wrote: > David Geier writes: > > We tracked down the root cause of this slowdown to lock contention in > > 'get_relation_info()'. The index lock of every single index of every > single > > table used in that query is acquired. We attempted a fix by pre-filtering > > out all indexes that anyways cannot be used with a certain query, without > > taking the index locks (credits to Luc Vlaming for idea and > > implementation). The patch does so by caching the columns present in > every > > index, inside 'struct Relation', similarly to 'rd_indexlist'. > > I wonder how much thought you gave to the costs imposed by that extra > cache space. We have a lot of users who moan about relcache bloat > already. The current representation could be compacted (e.g. by storing the column indexes as 16-bit integers, instead of using a Bitmapset). That should make the additional space needed neglectable compared to the current size of RelationData. On top of that we could maybe reorder the members of RelationData to save padding bytes. Currently, there is lots of interleaving of members of different sizes. Even when taking cache locality into consideration it looks like a fair amount could be saved, probably accounting for the additional space needed to store the index column data. But more to the point, I do not buy the assumption that > an index's set of columns is a good filter for which indexes are of > interest. A trivial counterexample from the regression database is > > regression=# explain select count(*) from tenk1; > QUERY PLAN > > > > > > Aggregate (cost=219.28..219.29 rows=1 width=8) >-> Index Only Scan using tenk1_hundred on tenk1 (cost=0.29..194.28 > rows=100 > 00 width=0) > (2 rows) > > Only for queries without index conditions, the current version of the patch simply discards all indexes but the one with the least columns. This is case (3) in s64_IsUnnecessaryIndex(). This is an over-simplification with the goal of picking the index which yields least I/O. For single column indexes that works, but it can fall short for multi-column indexes (e.g. [INT, TEXT] index vs [INT, INT]t index have both 2 columns but the latter would be better suited when there's no other index and we want to read the first integer column). What we should do here instead is to discard indexes based on storage size. > It looks to me like the patch also makes unwarranted assumptions about > being able to discard all but the smallest index having a given set > of columns. This would, for example, possibly lead to dropping the > index that has the most useful sort order, or that has the operator > class needed to support a specific WHERE clause.t > Why would that be? If we keep all indexes that contain columns that are used by the query, we also keep the indexes which provide a certain sort order or operator class. > > In short, I'm not sure I buy this concept at all. I think it might > be more useful to attack the locking overhead more directly. I kind > of wonder why we need per-index locks at all during planning --- > I think that we already interpret AccessShareLock on the parent table > as being sufficient to block schema changes on existing indexes. > > As I said in the reply to your other mail, there's huge savings also in the serial case where lock contention is not an issue. It seems like pre-filtering saves work down the road. I'll do some perf runs to track down where exactly the savings come from. One source I can think of is only having to consider a subset of all indexes during path creation. > Unfortunately, as things stand today, the planner needs more than the > right to look at the indexes' schemas, because it makes physical accesses > to btree indexes to find out their tree height (and I think there are some > comparable behaviors in other AMs). I've never particularly cared for > that implementation, and would be glad to rip out that behavior if we can > find another way. Maybe we could persuade VACUUM or ANALYZE to store that > info in the index's pg_index row, or some such, and then the planner > could use it with no lock? > > That's another interesting approach, but I would love to save the planner cycles for the sequential case. -- David Geier (ServiceNow)
Re: Reducing planning time on tables with many indexes
Hi Tom, On Wed, Jul 27, 2022 at 7:15 PM Tom Lane wrote: > I wrote: > > Unfortunately, as things stand today, the planner needs more than the > > right to look at the indexes' schemas, because it makes physical accesses > > to btree indexes to find out their tree height (and I think there are > some > > comparable behaviors in other AMs). I've never particularly cared for > > that implementation, and would be glad to rip out that behavior if we can > > find another way. Maybe we could persuade VACUUM or ANALYZE to store > that > > info in the index's pg_index row, or some such, and then the planner > > could use it with no lock? > It seems like _bt_getrootheight() first checks if the height is cached and only if it isn't it accesses index meta pages. If the index locks are only taken for the sake of _bt_getrootheight() accessing index meta pages in case they are not cached, maybe the index locks could be taken conditionally. However, postponing the call to where it is really needed sounds even better. > > A first step here could just be to postpone fetching _bt_getrootheight() > until we actually need it during cost estimation. That would avoid the > need to do it at all for indexes that indxpath.c discards as irrelevant, > which is a decision made on considerably more information than the > proposed patch uses. > > Having done that, you could look into revising plancat.c to fill the > IndexOptInfo structs from catcache entries instead of opening the > index per se. (You'd have to also make sure that the appropriate > index locks are acquired eventually, for indexes the query does use > at runtime. I think that's the case, but I'm not sure if anything > downstream has been optimized on the assumption the planner did it.) > > I can give this a try. That way we would get rid of the scalability issues. However, what about the runtime savings observed with a single query stream? In that case there is no contention, so it seems like having less indexes to look at further down the road, also yields substantial savings. Any clue where exactly these savings might come from? Or is it actually the calls to _bt_getrootheight()? I can also do a few perf runs to track that down. > This'd probably get us a large part of the way there. Further > optimization of acquisition of tree height etc could be an > optional follow-up. > > regards, tom lane >
Re: [PoC] Reducing planning time on tables with many indexes
Sorry, by accident I sent this one out twice. -- David Geier (ServiceNow) On Wed, Jul 27, 2022 at 2:42 PM David Geier wrote: > Hi hackers, > > We came across a slowdown in planning, where queries use tables with many > indexes. In setups with wide tables it is not uncommon to have easily > 10-100 indexes on a single table. The slowdown is already visible in serial > workloads with just a handful of indexes, but gets drastically amplified > when running queries with more indexes in parallel at high throughput. > > We measured the TPS and planning time of running parallel streams of > simple point look-up queries on a single empty table with 60 columns and 60 > indexes. The query used is 'SELECT * FROM synth_table WHERE col5 = 42'. No > rows are returned because the table is empty. We used a machine with 64 > physical CPU cores. The schema and sysbench script to reproduce these > numbers are attached. We used the TPS as reported by sysbench and obtained > planning time by running 'EXPLAIN ANALYZE' on the same query in a > separately opened connection. We averaged the planning time of 3 successive > 'EXPLAIN ANALYZE' runs. sysbench ran on the same machine with varying > numbers of threads using the following command line: > > sysbench repro.lua --db-driver=pgsql --pgsql-host=localhost > --pgsql-db=postgres --pgsql-port=? --pgsql-user=? --pgsql-password=? > --report-interval=1 --threads=64 run > > The following table shows the results. It is clearly visible that the TPS > flatten out already at 8 parallel streams, while the planning time is > increasing drastically. > > Parallel streams | TPS (before) | Planning time (before) > -|--|--- > 1| 5,486 | 0.13 ms > 2| 8,098 | 0.22 ms > 4| 15,013 | 0.19 ms > 8| 27,107 | 0.29 ms > 16 | 30,938 | 0.43 ms > 32 | 26,330 | 1.68 ms > 64 | 24,314 | 2.48 ms > > We tracked down the root cause of this slowdown to lock contention in > 'get_relation_info()'. The index lock of every single index of every single > table used in that query is acquired. We attempted a fix by pre-filtering > out all indexes that anyways cannot be used with a certain query, without > taking the index locks (credits to Luc Vlaming for idea and > implementation). The patch does so by caching the columns present in every > index, inside 'struct Relation', similarly to 'rd_indexlist'. Then, before > opening (= locking) the indexes in 'get_relation_info()', we check if the > index can actually contribute to the query and if not it is discarded right > away. Caching the index info saves considerable work for every query run > subsequently, because less indexes must be inspected and thereby locked. > This way we also save cycles in any code that later on goes over all > relation indexes. > > The work-in-progress version of the patch is attached. It is still fairly > rough (e.g. uses a global variable, selects the best index in scans without > restrictions by column count instead of physical column size, is missing > some renaming, etc.), but shows the principle. > > The following table shows the TPS, planning time and speed-ups after > applying the patch and rerunning above described benchmark. Now, the > planning time remains roughly constant and TPS roughly doubles each time > the number of parallel streams is doubled. The higher the stream count the > more severe the lock contention is and the more pronounced the gained > speed-up gets. Interestingly, even for a single query stream the speed-up > in planning time is already very significant. This applies also for lower > index counts. For example just with 10 indexes the TPS for a single query > stream goes from 9,159 to 12,558. We can do more measurements if there is > interest in details for a lower number of indexes. > > Parallel streams | TPS (after) | Planning time (after) | Speed-up TPS | > Speed-up planning > > -|-|---|--|-- > 1| 10,344 | 0.046 | 1.9x| > 2.8x > 2| 20,140 | 0.045 ms | 2.5x| > 4.9x > 4| 40,349 | 0.047 ms | 2.7x| > 4.0x > 8| 80,121 | 0.046 ms | 3.0x| > 6.3x > 16 | 152,632 | 0.051 ms | 4.9x| > 8.4x > 32 | 301,359 | 0.052 ms | 11.4x| > 32.3x > 64 | 525,115 | 0.062 ms | 21.6x| > 40.0x > > We are happy to receive your feedback and polish up the patch. > > -- > David Geier > (ServiceNow) >
[PoC] Reducing planning time on tables with many indexes
Hi hackers, We came across a slowdown in planning, where queries use tables with many indexes. In setups with wide tables it is not uncommon to have easily 10-100 indexes on a single table. The slowdown is already visible in serial workloads with just a handful of indexes, but gets drastically amplified when running queries with more indexes in parallel at high throughput. We measured the TPS and planning time of running parallel streams of simple point look-up queries on a single empty table with 60 columns and 60 indexes. The query used is 'SELECT * FROM synth_table WHERE col5 = 42'. No rows are returned because the table is empty. We used a machine with 64 physical CPU cores. The schema and sysbench script to reproduce these numbers are attached. We used the TPS as reported by sysbench and obtained planning time by running 'EXPLAIN ANALYZE' on the same query in a separately opened connection. We averaged the planning time of 3 successive 'EXPLAIN ANALYZE' runs. sysbench ran on the same machine with varying numbers of threads using the following command line: sysbench repro.lua --db-driver=pgsql --pgsql-host=localhost --pgsql-db=postgres --pgsql-port=? --pgsql-user=? --pgsql-password=? --report-interval=1 --threads=64 run The following table shows the results. It is clearly visible that the TPS flatten out already at 8 parallel streams, while the planning time is increasing drastically. Parallel streams | TPS (before) | Planning time (before) -|--|--- 1| 5,486 | 0.13 ms 2| 8,098 | 0.22 ms 4| 15,013 | 0.19 ms 8| 27,107 | 0.29 ms 16 | 30,938 | 0.43 ms 32 | 26,330 | 1.68 ms 64 | 24,314 | 2.48 ms We tracked down the root cause of this slowdown to lock contention in 'get_relation_info()'. The index lock of every single index of every single table used in that query is acquired. We attempted a fix by pre-filtering out all indexes that anyways cannot be used with a certain query, without taking the index locks (credits to Luc Vlaming for idea and implementation). The patch does so by caching the columns present in every index, inside 'struct Relation', similarly to 'rd_indexlist'. Then, before opening (= locking) the indexes in 'get_relation_info()', we check if the index can actually contribute to the query and if not it is discarded right away. Caching the index info saves considerable work for every query run subsequently, because less indexes must be inspected and thereby locked. This way we also save cycles in any code that later on goes over all relation indexes. The work-in-progress version of the patch is attached. It is still fairly rough (e.g. uses a global variable, selects the best index in scans without restrictions by column count instead of physical column size, is missing some renaming, etc.), but shows the principle. The following table shows the TPS, planning time and speed-ups after applying the patch and rerunning above described benchmark. Now, the planning time remains roughly constant and TPS roughly doubles each time the number of parallel streams is doubled. The higher the stream count the more severe the lock contention is and the more pronounced the gained speed-up gets. Interestingly, even for a single query stream the speed-up in planning time is already very significant. This applies also for lower index counts. For example just with 10 indexes the TPS for a single query stream goes from 9,159 to 12,558. We can do more measurements if there is interest in details for a lower number of indexes. Parallel streams | TPS (after) | Planning time (after) | Speed-up TPS | Speed-up planning -|-|---|--|-- 1| 10,344 | 0.046 | 1.9x| 2.8x 2| 20,140 | 0.045 ms | 2.5x| 4.9x 4| 40,349 | 0.047 ms | 2.7x| 4.0x 8| 80,121 | 0.046 ms | 3.0x| 6.3x 16 | 152,632 | 0.051 ms | 4.9x| 8.4x 32 | 301,359 | 0.052 ms | 11.4x| 32.3x 64 | 525,115 | 0.062 ms | 21.6x| 40.0x We are happy to receive your feedback and polish up the patch. -- David Geier (ServiceNow) From 60e300b5cac8f527e61483296df81ab783670ac9 Mon Sep 17 00:00:00 2001 From: David Geier Date: Fri, 15 Jul 2022 11:53:27 +0200 Subject: [PATCH] Index filtering --- src/backend/optimizer/path/Makefile | 3 +- src/backend/optimizer/path/index_filtering.c | 220 ++ src/backend/optimizer/plan/planmain.c | 7 + src/backend/optimizer/util/plancat.c | 124 ++ src/backend/utils/cache/relcache.c| 17 +- src/include
Reducing planning time on tables with many indexes
Hi hackers, We came across a slowdown in planning, where queries use tables with many indexes. In setups with wide tables it is not uncommon to have easily 10-100 indexes on a single table. The slowdown is already visible in serial workloads with just a handful of indexes, but gets drastically amplified when running queries with more indexes in parallel at high throughput. We measured the TPS and planning time of running parallel streams of simple point look-up queries on a single empty table with 60 columns and 60 indexes. The query used is 'SELECT * FROM synth_table WHERE col5 = 42'. No rows are returned because the table is empty. We used a machine with 64 physical CPU cores. The schema and sysbench script to reproduce these numbers are attached. We used the TPS as reported by sysbench and obtained planning time by running 'EXPLAIN ANALYZE' on the same query in a separately opened connection. We averaged the planning time of 3 successive 'EXPLAIN ANALYZE' runs. sysbench ran on the same machine with varying numbers of threads using the following command line: sysbench repro.lua --db-driver=pgsql --pgsql-host=localhost --pgsql-db=postgres --pgsql-port=? --pgsql-user=? --pgsql-password=? --report-interval=1 --threads=64 run The following table shows the results. It is clearly visible that the TPS flatten out already at 8 parallel streams, while the planning time is increasing drastically. Parallel streams | TPS (before) | Planning time (before) -|--|--- 1| 5,486 | 0.13 ms 2| 8,098 | 0.22 ms 4| 15,013 | 0.19 ms 8| 27,107 | 0.29 ms 16 | 30,938 | 0.43 ms 32 | 26,330 | 1.68 ms 64 | 24,314 | 2.48 ms We tracked down the root cause of this slowdown to lock contention in 'get_relation_info()'. The index lock of every single index of every single table used in that query is acquired. We attempted a fix by pre-filtering out all indexes that anyways cannot be used with a certain query, without taking the index locks (credits to Luc Vlaming for idea and implementation). The patch does so by caching the columns present in every index, inside 'struct Relation', similarly to 'rd_indexlist'. Then, before opening (= locking) the indexes in 'get_relation_info()', we check if the index can actually contribute to the query and if not it is discarded right away. Caching the index info saves considerable work for every query run subsequently, because less indexes must be inspected and thereby locked. This way we also save cycles in any code that later on goes over all relation indexes. The work-in-progress version of the patch is attached. It is still fairly rough (e.g. uses a global variable, selects the best index in scans without restrictions by column count instead of physical column size, is missing some renaming, etc.), but shows the principle. The following table shows the TPS, planning time and speed-ups after applying the patch and rerunning above described benchmark. Now, the planning time remains roughly constant and TPS roughly doubles each time the number of parallel streams is doubled. The higher the stream count the more severe the lock contention is and the more pronounced the gained speed-up gets. Interestingly, even for a single query stream the speed-up in planning time is already very significant. This applies also for lower index counts. For example just with 10 indexes the TPS for a single query stream goes from 9,159 to 12,558. We can do more measurements if there is interest in details for a lower number of indexes. Parallel streams | TPS (after) | Planning time (after) | Speed-up TPS | Speed-up planning -|-|---|--|-- 1| 10,344 | 0.046 | 1.9x| 2.8x 2| 20,140 | 0.045 ms | 2.5x| 4.9x 4| 40,349 | 0.047 ms | 2.7x| 4.0x 8| 80,121 | 0.046 ms | 3.0x| 6.3x 16 | 152,632 | 0.051 ms | 4.9x| 8.4x 32 | 301,359 | 0.052 ms | 11.4x| 32.3x 64 | 525,115 | 0.062 ms | 21.6x| 40.0x We are happy to receive your feedback and polish up the patch. -- David Geier (ServiceNow) function thread_init() drv = sysbench.sql.driver() con = drv:connect() end function thread_done() con:disconnect() end function event() con:query('SELECT * FROM synth_table WHERE col05 = 52') end schema.sql Description: application/sql From 60e300b5cac8f527e61483296df81ab783670ac9 Mon Sep 17 00:00:00 2001 From: David Geier Date: Fri, 15 Jul 2022 11:53:27 +0200 Subject: [PATCH] Index filtering --- src/backend/optimizer/path/Makefile | 3 +- src/backend
Re: Lazy JIT IR code generation to increase JIT speed with partitions
Can you elaborate a bit more on how you conclude that? Looking at the numbers I measured in one of my previous e-mails, it looks to me like the overhead of using multiple modules is fairly low and only measurable in queries with dozens of modules. Given that JIT is most useful in queries that process a fair amount of rows, having to spend marginally more time on creating the JIT program while being able to use JIT much more fine grained seems desirable. For example, the time you lose for handling more modules, you save right away because not the whole plan gets JIT compiled. It is a trade-off between optimizing for the best case where everything in the plan can truly benefit from jitting and hence a single module that has it all is best, vs the worst-case where almost nothing truly profits from jitting and hence only a small fraction of the plan should actually be jitted. The penalty for the best case seems low though, because (1) the overhead is low in absolute terms, and (2) also if the entire plan truly benefits from jitting, spending sub-ms more per node seems neglectable because there is anyways going to be significant time spent. -- David Geier (ServiceNow) On 7/4/22 22:23, Andres Freund wrote: Hi, On 2022-07-04 06:43:00 +, Luc Vlaming Hummel wrote: Thanks for reviewing this and the interesting examples! Wanted to give a bit of extra insight as to why I'd love to have a system that can lazily emit JIT code and hence creates roughly a module per function: In the end I'm hoping that we can migrate to a system where we only JIT after a configurable cost has been exceeded for this node, as well as a configurable amount of rows has actually been processed. Reason is that this would safeguard against some problematic planning issues wrt JIT (node not being executed, row count being massively off). I still don't see how it's viable to move to always doing function-by-function emission overhead wise. I also want to go to do JIT in the background and triggered by acutal usage. But to me it seems a dead end to require moving to one-function-per-module model for that. If this means we have to invest more in making it cheap(er) to emit modules, I'm all for that. I think that's just inherently more expensive and thus a no-go. @Andres if there's any other things we ought to fix to make this cheap (enough) compared to the previous code I'd love to know your thoughts. I'm not seeing it. Greetings, Andres Freund
Re: Lazy JIT IR code generation to increase JIT speed with partitions
On Mon, Jul 4, 2022 at 10:32 PM Andres Freund wrote: > Hi, > > On 2022-06-27 16:55:55 +0200, David Geier wrote: > > Indeed, the total JIT time increases the more modules are used. The > reason > > for this to happen is that the inlining pass loads and deserializes all > to > > be inlined modules (.bc files) from disk prior to inlining them via > > llvm::IRMover. There's already a cache for such modules in the code, but > it > > is currently unused. This is because llvm::IRMover takes the module to be > > inlined as std::unique_ptr. The by-value argument requires > > the source module to be moved, which means it cannot be reused > afterwards. > > The code is accounting for that by erasing the module from the cache > after > > inlining it, which in turns requires reloading the module next time a > > reference to it is encountered. > > > > Instead of each time loading and deserializing all to be inlined modules > > from disk, they can reside in the cache and instead be cloned via > > llvm::CloneModule() before they get inlined. Key to calling > > llvm::CloneModule() is fully deserializing the module upfront, instead of > > loading the module lazily. That is why I changed the call from > > LLVMGetBitcodeModuleInContext2() (which lazily loads the module via > > llvm::getOwningLazyBitcodeModule()) to LLVMParseBitCodeInContext2() > (which > > fully loads the module via llvm::parseBitcodeFile()). Beyond that it > seems > > like that prior to LLVM 13, cloning modules could fail with an assertion > > (not sure though if that would cause problems in a release build without > > assertions). Andres reported this problem back in the days here [1]. In > the > > meanwhile the issue got discussed in [2] and finally fixed for LLVM 13, > see > > [3]. > > Unfortunately that doesn't work right now - that's where I had started. The > problem is that IRMover renames types. Which, in the case of cloned modules > unfortunately means that types used cloned modules are also renamed in the > "origin" module. Which then causes problems down the line, because parts of > the LLVM code match types by type names. > > That can then have the effect of drastically decreasing code generation > quality over time, because e.g. inlining never manages to find signatures > compatible. > > Looking at the LLVM patches these issues seem to be gone. The same suggests dumping the bitcode and running all tests with JIT force-enabled. See below. > > > However, curiously the time spent on optimizing is also reduced (95ms > > instead of 164ms). Could this be because some of the applied > optimizations > > are ending up in the cached module? > > I suspect it's more that optimization stops being able to do a lot, due to > the > type renamign issue. > > > > @Andres: could you provide me with the queries that caused the assertion > > failure in LLVM? > > I don't think I have the concrete query. What I tend to do is to run the > whole > regression tests with forced JITing. I'm fairly certain this triggered the > bug > at the time. > > > > Have you ever observed a segfault with a non-assert-enabled build? > > I think I observed bogus code generation that then could lead to segfaults > or > such. > OK, then using a non-assert-enabled LLVM build seems good enough. Especially, given the fixes they merged in [3]. > > > > I just want to make sure this is truly fixed in LLVM 13. Running 'make > > check-world' all tests passed. > > With jit-ing forced for everything? > > One more thing to try is to jit-compile twice and ensure the code is the > same. It certainly wasn't in the past due to the above issue. > > Yes, with jitting force-enabled everything passes (I removed all checks for the various 'jit_above' costs from planner.c). I also dumped the bitcode via 'SET jit_dump_bitcode = ON' and compared the bitcode output after inlining and after optimization for two successive executions of the very same query. The .bc files after inlining only differ in two bytes. The .bc files after optimization differ in a lot more bytes. However, the same is true for the master branch. How do you exactly conclude the identity of compilation output? -- David Geier (ServiceNow)
Improving scalability of Parallel Bitmap Heap/Index Scan
Hi hackers, While debugging some slow queries containing Bitmap Heap/Index Scans (in short BHS / BIS), we observed a few issues regarding scalability: 1. The BIS always only runs in a single process, also when the parent BHS is parallel. The first process arriving in the BHS serves as leader and executes the BIS. 2. As long as execution is "exact" (TIDs are stored instead of page bits), the parallel BHS sorts all TIDs to ensure pages are accessed sequentially. The sort is also performed just by a single worker. Already with a few tens of thousands of pages to scan, the sort time can make up a significant portion of the total runtime. Large page counts and the need for parallelism are not uncommon for BHS, as one use case is closing the gap between index and sequential scans. The BHS costing seems to not account for that. 3. The BHS does not scale well with an increasing number of parallel workers, even when accounting for the sequential parts of execution. A perf profile shows that the TID list / bitmap iteration code heavily contents on a mutex taken for every single TID / page bit (see LWLockAcquire(>lock, LW_EXCLUSIVE) in tidbitmap.c:1067). 4. The EXPLAIN ANALYZE statistics of the parallel BHS do not include the statistics of the parallel workers. For example the number of heap pages processed is what just the leader did. Similarly to other parallel plan nodes we should aggregate statistics across workers. The EXPLAIN ANALYZE output below shows (1) to (3) happening in action for different numbers of workers. I had to obfuscate the query slightly. The difference between the startup time of the BHS and the BIS is the time it takes to sort the TID list. The self time of the BHS is just the time spent on processing the shared TID list and processing the pages. That part runs in parallel but does not scale. Workers | Total runtime | Startup time BIS | Startup time BHS | Self time BHS (excl. sorting) ---|--|-- 2 | 15322 ms | 3107 ms | 5912 ms | 9269 ms 4 | 13277 ms | 3094 ms | 5869 ms | 7260 ms 8 | 14628 ms | 3106 ms | 5882 ms | 8598 ms None of this is really new and some of it is even documented. So, what I am more wondering about is why things are the way they are and how hard it would be to change them. I am especially curious about: - What stops us from extending the BIS to run in parallel? Parallel Bitmap Index Scans are also supported. - What about reducing the sort time by, e.g. - dividing TIDs across workers, ending up with N parallely sorted streams, - cooperatively sorting the TIDs with multiple workers using barriers for synchronization, - optimizing the PagetableEntry data structure for size and using a faster sorting algorithm like e.g. radix sort - a combination of the first three options - With separate TID lists per worker process the iteration problem would be solved. Otherwise, we could - optimize the iteration code and thereby minimize the duration of the critical section, - have worker processes acquire chunks of TIDs / page bits to reduce locking. Is there interest in patches improving on the above mentioned shortcomings? If so, which options do you deem best? -- David Geier (ServiceNow) -- 2 workers Finalize Aggregate (actual time=15228.937..15321.356 rows=1 loops=1) Output: count(*) -> Gather (actual time=15187.942..15321.345 rows=2 loops=1) Output: (PARTIAL count(*)) Workers Planned: 2 Workers Launched: 2 -> Partial Aggregate (actual time=15181.486..15181.488 rows=1 loops=2) Output: PARTIAL count(*) Worker 0: actual time=15181.364..15181.366 rows=1 loops=1 Worker 1: actual time=15181.608..15181.610 rows=1 loops=1 -> Parallel Bitmap Heap Scan on foo (actual time=5912.731..15166.992 rows=269713 loops=2) Filter: ... Rows Removed by Filter: 4020149 Worker 0: actual time=5912.498..15166.936 rows=269305 loops=1 Worker 1: actual time=5912.963..15167.048 rows=270121 loops=1 -> Bitmap Index Scan on foo_idx (actual time=3107.947..3107.948 rows=8579724 loops=1) Index Cond: - Worker 1: actual time=3107.947..3107.948 rows=8579724 loops=1 Planning Time: 0.167 ms Execution Time: 15322.081 ms -- 4 workers Finalize Aggregate (actual time=13175.765..13276.415 rows=1 loops=1) Output: count(*) -> Gather (actual time=13137.981..13276.403 rows=4 loops=1) Output: (PARTIAL count(*)) Workers Planned: 4 Workers Launched: 4 -> Partial Aggregate (a
Re: Lazy JIT IR code generation to increase JIT speed with partitions
Hi Alvaro, That's a very interesting case and might indeed be fixed or at least improved by this patch. I tried to reproduce this, but at least when running a simple, serial query with increasing numbers of functions, the time spent per function is linear or even slightly sub-linear (same as Tom observed in [1]). I also couldn't reproduce the JIT runtimes you shared, when running the attached catalog query. The catalog query ran serially for me with the following JIT stats: JIT: Functions: 169 Options: Inlining true, Optimization true, Expressions true, Deforming true Timing: Generation 12.223 ms, Inlining 17.323 ms, Optimization 388.491 ms, Emission 283.464 ms, Total 701.501 ms Is it possible that the query ran in parallel for you? For parallel queries, every worker JITs all of the functions it uses. Even though the workers might JIT the functions in parallel, the time reported in the EXPLAIN ANALYZE output is the sum of the time spent by all workers. With this patch applied, the JIT time drops significantly, as many of the generated functions remain unused. JIT: Modules: 15 Functions: 26 Options: Inlining true, Optimization true, Expressions true, Deforming true Timing: Generation 1.931 ms, Inlining 0.722 ms, Optimization 67.195 ms, Emission 70.347 ms, Total 140.195 ms Of course, this does not prove that the nonlinearity that you observed went away. Could you share with me how you ran the query so that I can reproduce your numbers on master to then compare them with the patched version? Also, which LLVM version did you run with? I'm currently running with LLVM 13. Thanks! -- David Geier (ServiceNow) On Mon, Jun 27, 2022 at 5:37 PM Alvaro Herrera wrote: > On 2021-Jan-18, Luc Vlaming wrote: > > > I would like this topic to somehow progress and was wondering what other > > benchmarks / tests would be needed to have some progress? I've so far > > provided benchmarks for small(ish) queries and some tpch numbers, > assuming > > those would be enough. > > Hi, some time ago I reported a case[1] where our JIT implementation does > a very poor job and perhaps the changes that you're making could explain > what is going on, and maybe even fix it: > > [1] https://postgr.es/m/20241706.wqq7xoyigwa2@alvherre.pgsql > > The query for which I investigated the problem involved some pg_logical > metadata tables, so I didn't post it anywhere public; but the blog post > I found later contains a link to a query that shows the same symptoms, > and which is luckily still available online: > https://gist.github.com/saicitus/251ba20b211e9e73285af35e61b19580 > I attach it here in case it goes missing sometime. > > -- > Álvaro Herrera 48°01'N 7°57'E — > https://www.EnterpriseDB.com/ >
Re: Lazy JIT IR code generation to increase JIT speed with partitions
les are used. However, curiously the time spent on optimizing is also reduced (95ms instead of 164ms). Could this be because some of the applied optimizations are ending up in the cached module? With more modules the total time spent on jitting seems to increase slightly (175ms vs 195ms). Almost all of that time is spent on optimization and emission. Time for inlining only increases from ~0.2ms to ~1.1ms. With improved optimization settings the final time spent is pretty much on par with the single module variant (175ms vs 181ms). So overall it looks like using many modules would this way have competitive performance for the worst-case (all functions used), but much better performance for cases where a significant amount of the functions remain unused. @Andres: could you provide me with the queries that caused the assertion failure in LLVM? Have you ever observed a segfault with a non-assert-enabled build? I just want to make sure this is truly fixed in LLVM 13. Running 'make check-world' all tests passed. The attached patch currently does not yet have a case distinction for LLVM 13. But that would be straightforward to add. Thanks for your consideration. -- David Geier (ServiceNow) [1] https://lists.llvm.org/pipermail/llvm-dev/2018-March/122111.html [2] https://reviews.llvm.org/D96531 [3] https://reviews.llvm.org/rG22a52dfddcefad4f275eb8ad1cc0e200074c2d8a On Fri, Jun 24, 2022 at 2:13 PM Luc Vlaming wrote: > On 18-01-2021 08:47, Luc Vlaming wrote: > > Hi everyone, Andres, > > > > On 03-01-2021 11:05, Luc Vlaming wrote: > >> On 30-12-2020 14:23, Luc Vlaming wrote: > >>> On 30-12-2020 02:57, Andres Freund wrote: > >>>> Hi, > >>>> > >>>> Great to see work in this area! > > > > I would like this topic to somehow progress and was wondering what other > > benchmarks / tests would be needed to have some progress? I've so far > > provided benchmarks for small(ish) queries and some tpch numbers, > > assuming those would be enough. > > > >>>> > >>>> On 2020-12-28 09:44:26 +0100, Luc Vlaming wrote: > >>>>> I would like to propose a small patch to the JIT machinery which > >>>>> makes the > >>>>> IR code generation lazy. The reason for postponing the generation > >>>>> of the IR > >>>>> code is that with partitions we get an explosion in the number of JIT > >>>>> functions generated as many child tables are involved, each with > >>>>> their own > >>>>> JITted functions, especially when e.g. partition-aware > >>>>> joins/aggregates are > >>>>> enabled. However, only a fraction of those functions is actually > >>>>> executed > >>>>> because the Parallel Append node distributes the workers among the > >>>>> nodes. > >>>>> With the attached patch we get a lazy generation which makes that > >>>>> this is no > >>>>> longer a problem. > >>>> > >>>> I unfortunately don't think this is quite good enough, because it'll > >>>> lead to emitting all functions separately, which can also lead to very > >>>> substantial increases of the required time (as emitting code is an > >>>> expensive step). Obviously that is only relevant in the cases where > the > >>>> generated functions actually end up being used - which isn't the > >>>> case in > >>>> your example. > >>>> > >>>> If you e.g. look at a query like > >>>>SELECT blub, count(*),sum(zap) FROM foo WHERE blarg = 3 GROUP BY > >>>> blub; > >>>> on a table without indexes, you would end up with functions for > >>>> > >>>> - WHERE clause (including deforming) > >>>> - projection (including deforming) > >>>> - grouping key > >>>> - aggregate transition > >>>> - aggregate result projection > >>>> > >>>> with your patch each of these would be emitted separately, instead of > >>>> one go. Which IIRC increases the required time by a significant > amount, > >>>> especially if inlining is done (where each separate code generation > >>>> ends > >>>> up with copies of the inlined code). > >>>> > >>>> > >>>> As far as I can see you've basically falsified the second part of this > >>>> comment (which you moved): > >>>> > >>>>> + > >>>>> +/* > >>>>&g
Re: Assertion failure with barriers in parallel hash join
Hi Thomas, Correct. We're running with disabled parallel leader participation and we have to do so, because another custom plan node we built relies on that. That would be great. Anything else I can help with to get this patch in? Thanks! -- David (ServiceNow) On Fri, 3 Jun 2022 at 00:06, Thomas Munro wrote: > On Thu, Jun 2, 2022 at 9:31 PM David Geier wrote: > > We recently encountered the same bug in the field. Oleksii Kozlov > managed to come up with reproduction steps, which reliably trigger it. > Interestingly, the bug does not only manifest as failing assertion, but > also as segmentation fault; in builds with disabled and with enabled (!) > assertions. So it can crash production environments. We applied the > proposed patch v3 from Melanie to the REL_14_3 branch and can confirm that > with the patch neither the assertion nor the segmentation fault still occur. > > Thanks for the report, testing, and for creating the CF entry. > > I assume you are using parallel_leader_participation=off, and the > reason we haven't heard more about this is because few people do that. > By coincidence I was just about to restart a bunch of hash join > projects and have been paging the topic area back into my brain, so > I'll start with another round of testing/analysis of this bug/patch > next week. >
Re: Assertion failure with barriers in parallel hash join
The following review has been posted through the commitfest application: make installcheck-world: tested, passed Implements feature: tested, passed Spec compliant: tested, passed Documentation:not tested Hi all, We recently encountered the same bug in the field. Oleksii Kozlov managed to come up with reproduction steps, which reliably trigger it. Interestingly, the bug does not only manifest as failing assertion, but also as segmentation fault; in builds with disabled and with enabled (!) assertions. So it can crash production environments. We applied the proposed patch v3 from Melanie to the REL_14_3 branch and can confirm that with the patch neither the assertion nor the segmentation fault still occur. I have also glanced at the code and the implementation looks fine. However, I'm not an expert for the fairly involved hash join state machine. There seems to be no need for additional documentation. For completeness here is the stack trace of the segmentation fault. Like the stack trace from the assertion failure initially shared by Michael and also encountered by us, the stack trace of the segmentation fault also contains ExecParallelHashJoinNewBatch(). #9 | Source "/opt/src/backend/executor/execMain.c", line 361, in standard_ExecutorRun | Source "/opt/src/backend/executor/execMain.c", line 1551, in ExecutePlan Source "/opt/src/include/executor/executor.h", line 257, in ExecProcNode [0x657e4d] #8 | Source "/opt/src/backend/executor/nodeAgg.c", line 2179, in ExecAgg Source "/opt/src/backend/executor/nodeAgg.c", line 2364, in agg_retrieve_direct [0x66ba60] #7 | Source "/opt/src/backend/executor/nodeAgg.c", line 581, in fetch_input_tuple Source "/opt/src/include/executor/executor.h", line 257, in ExecProcNode [0x66d585] #6 | Source "/opt/src/backend/executor/nodeHashjoin.c", line 607, in ExecParallelHashJoin | Source "/opt/src/backend/executor/nodeHashjoin.c", line 560, in ExecHashJoinImpl Source "/opt/src/backend/executor/nodeHashjoin.c", line 1132, in ExecParallelHashJoinNewBatch [0x67a89b] #5 | Source "/opt/src/backend/storage/ipc/barrier.c", line 242, in BarrierAttach Source "/opt/src/include/storage/s_lock.h", line 228, in tas [0x7c2a1b] #4Object "/lib/x86_64-linux-gnu/libpthread.so.0", at 0x7f4db364841f, in __funlockfile -- David Geier (SericeNow) The new status of this patch is: Ready for Committer
Re: WIP Patch: Precalculate stable functions, infrastructure v1
Hi hackers! I would like to revive this thread. At ServiceNow we recurringly encounter queries that are much slower than they would have to be, because of frequent calls to uncached stable functions with constant arguments (mostly to_date()). We've seen e.g. queries that get more than 8x faster by temporarily changing to_date() from stable to immutable. I would be glad to help bringing this effort forward. Was there more work on the patch left than rebasing on latest master? @Marina: do you have any plans to continue with this? For reference here are all existing mailing list discussions I could find on this topic: - [WIP] Caching constant stable expressions per execution (Marti, 2011), https://www.postgresql.org/message-id/flat/CABRT9RC-1wGxZC_Z5mwkdk70fgY2DRX3sLXzdP4voBKuKPZDow%40mail.gmail.com - Caching for stable expressions with constant arguments v6 (Marti, 2012), https://www.postgresql.org/message-id/flat/CABRT9RA-RomVS-yzQ2wUtZ=m-ev61lcbrl1p1j3jydpsttf...@mail.gmail.com - WIP Patch: Precalculate stable functions (Marina, 2017), https://www.postgresql.org/message-id/flat/ba261b9fc25dea4069d8ba9a8fcad...@postgrespro.ru - WIP Patch: Precalculate stable functions, infrastructure v1 (Marina, 2017), https://www.postgresql.org/message-id/flat/da87bb6a014e029176a04f6e50033cfb%40postgrespro.ru -- David Geier (ServiceNow) On Mon, 23 May 2022 at 17:06, Andres Freund wrote: > On 2018-11-29 18:00:15 +0100, Dmitry Dolgov wrote: > > > On Tue, Oct 2, 2018 at 4:22 AM Michael Paquier > wrote: > > > > > > On Thu, May 24, 2018 at 04:00:33PM +0300, Marina Polyakova wrote: > > > > Here there's a 9-th version of the patches for the precalculation of > stable > > > > or immutable functions, stable or immutable operators and other > nonvolatile > > > > expressions. This is a try to execute cached expressions as > PARAM_EXEC, > > > > thanks to the comments of Tom Lane and Andres Freund [1]. > > > > > > Please note that v9-0004 fails to apply, so a rebase is needed. This > > > patch is moved to next CF, waiting on author. > > > > Unfortunately, patch still has some conflicts, could you please post an > updated > > version? > > As nothing has happened since, I'm marking this as returned with > feedback. > > Greetings, > > Andres Freund > > > >
Re: search_plan_tree(): handling of non-leaf CustomScanState nodes causes segfault
Hi, On 18.01.21 23:42, Tom Lane wrote: David Geier writes: On 18.01.21 19:46, Tom Lane wrote: Hm. I agree that we shouldn't simply assume that ss_currentRelation isn't null. However, we cannot make search_plan_tree() descend through non-leaf CustomScan nodes, because we don't know what processing is involved there. We need to find a scan that is guaranteed to return rows that are one-to-one with the cursor output. This is why the function doesn't descend through join or aggregation nodes, and I see no argument by which we should assume we know more about what a customscan node will do than we know about those. That makes sense. Thanks for the explanation. OK, cool. I was afraid you'd argue that you really needed your CustomScan node to be transparent in such cases. We could imagine inventing an additional custom-scan-provider callback to embed the necessary knowledge, but I'd rather not add the complexity until someone has a use-case. I was thinking about that. Generally, having such possibility would be very useful. Unfortunately, I believe that in my specific case even having such functionality wouldn't allow for the query to work with CURRENT OF, because my CSP behaves similarly to a materialize node. My understanding is only such plans are supported which consist of nodes that guarantee that the tuple returned by the plan is the unmodified tuple a scan leaf node is currently positioned on? Still, if there's interest I would be happy to draft a patch. Instead of a separate CSP callback, we could also provide an additional flag like CUSTOMPATH_SUPPORT_CURRENT_OF. The advantage of the callback would be that we could delay the decision until execution time where potentially more information is available. I updated the patch to match your proposal. WFM, will push in a bit. regards, tom lane Best regards, David Swarm64
Re: search_plan_tree(): handling of non-leaf CustomScanState nodes causes segfault
Hi, On 18.01.21 19:46, Tom Lane wrote: David Geier writes: search_plan_tree() assumes that CustomScanState::ScanState::ss_currentRelation is never NULL. In my understanding that only holds for CustomScanState nodes which are at the bottom of the plan and actually read from a relation. CustomScanState nodes which are not at the bottom don't have ss_currentRelation set. I believe for such nodes, instead search_plan_tree() should recurse into CustomScanState::custom_ps. Hm. I agree that we shouldn't simply assume that ss_currentRelation isn't null. However, we cannot make search_plan_tree() descend through non-leaf CustomScan nodes, because we don't know what processing is involved there. We need to find a scan that is guaranteed to return rows that are one-to-one with the cursor output. This is why the function doesn't descend through join or aggregation nodes, and I see no argument by which we should assume we know more about what a customscan node will do than we know about those. That makes sense. Thanks for the explanation. So I'm inclined to think a suitable fix is just - if (RelationGetRelid(sstate->ss_currentRelation) == table_oid) + if (sstate->ss_currentRelation && + RelationGetRelid(sstate->ss_currentRelation) == table_oid) result = sstate; regards, tom lane I updated the patch to match your proposal. Best regards, David Swarm64 diff --git a/src/backend/executor/execCurrent.c b/src/backend/executor/execCurrent.c index c7f909241b..fb3cf9537c 100644 --- a/src/backend/executor/execCurrent.c +++ b/src/backend/executor/execCurrent.c @@ -330,7 +330,8 @@ search_plan_tree(PlanState *node, Oid table_oid, { ScanState *sstate = (ScanState *) node; - if (RelationGetRelid(sstate->ss_currentRelation) == table_oid) + if (sstate->ss_currentRelation && + RelationGetRelid(sstate->ss_currentRelation) == table_oid) result = sstate; break; }
search_plan_tree(): handling of non-leaf CustomScanState nodes causes segfault
Hi hackers, While working with cursors that reference plans with CustomScanStates nodes, I encountered a segfault which originates from search_plan_tree(). The query plan is the result of a simple SELECT statement into which I inject a Custom Scan node at the root to do some post-processing before returning rows. This plan is referenced by a second plan with a Tid Scan which originates from a query of the form DELETE FROM foo WHERE CURRENT OF my_cursor; search_plan_tree() assumes that CustomScanState::ScanState::ss_currentRelation is never NULL. In my understanding that only holds for CustomScanState nodes which are at the bottom of the plan and actually read from a relation. CustomScanState nodes which are not at the bottom don't have ss_currentRelation set. I believe for such nodes, instead search_plan_tree() should recurse into CustomScanState::custom_ps. I attached a patch. Any thoughts? Best regards, David Swarm64 diff --git a/src/backend/executor/execCurrent.c b/src/backend/executor/execCurrent.c index f89319fcd8..0d5f09402b 100644 --- a/src/backend/executor/execCurrent.c +++ b/src/backend/executor/execCurrent.c @@ -326,7 +326,6 @@ search_plan_tree(PlanState *node, Oid table_oid, case T_BitmapHeapScanState: case T_TidScanState: case T_ForeignScanState: - case T_CustomScanState: { ScanState *sstate = (ScanState *) node; @@ -335,6 +334,39 @@ search_plan_tree(PlanState *node, Oid table_oid, break; } + + /* +* Custom scan nodes can be leaf nodes or inner nodes and therfore need special treatment. +*/ + case T_CustomScanState: + { + CustomScanState *css = castNode(CustomScanState, node); + ScanState *sstate = (ScanState *) node; + + if (sstate->ss_currentRelation == NULL) /* inner node */ + { + ListCell *lc; + + foreach (lc, sstate->custom_ps) + { + ScanState *elem = search_plan_tree((PlanState *)lfirst(lc), table_oid, pending_rescan); + + if (!elem) + continue; + if (result) + return NULL;/* multiple matches */ + result = elem; + } + } + else /* leaf node */ + { + if (RelationGetRelid(sstate->ss_currentRelation) == table_oid) + result = sstate; + } + + break; + } + /* * For Append, we must look through the members; watch out for * multiple matches (possible if it was from UNION ALL)