Re: BitmapHeapScan streaming read user and prelim refactoring
On 2024-May-14, Melanie Plageman wrote: > On Tue, May 14, 2024 at 4:05 PM Alvaro Herrera > wrote: > > I do wonder how do we _know_ that the test is testing what it wants to > > test: > We include the explain output (the plan) to ensure it is still > using a bitmap heap scan. The test exercises the skip fetch > optimization in bitmap heap scan when not all of the inner tuples are > emitted. I meant -- the query returns an empty resultset, so how do we know it's the empty resultset that we want and not a different empty resultset that happens to be identical? (This is probably not a critical point anyhow.) > Without the patch, the test fails, so it is protection against someone > adding back that assert in the future. It is not protection against > someone deleting the line > scan->rs_empty_tuples_pending = 0 > That is, it doesn't verify that the empty unused tuples count is > discarded. Do you think that is necessary? I don't think that's absolutely necessary. I suspect there are thousands of lines that you could delete that would break things, with no test directly raising red flags. At this point I would go with the RMT's recommendation of pushing this now to make sure the bug is fixed for beta1, even if the test is imperfect. You can always polish the test afterwards if you feel like it ... -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ "No necesitamos banderas No reconocemos fronteras" (Jorge González)
Re: BitmapHeapScan streaming read user and prelim refactoring
Procedural comment: It's better to get this patch committed with an imperfect test case than to have it miss beta1. ...Robert
Re: BitmapHeapScan streaming read user and prelim refactoring
On Tue, May 14, 2024 at 4:05 PM Alvaro Herrera wrote: > > On 2024-May-14, Tomas Vondra wrote: > > I wonder why it resets enable_indexscan at all. I see that this query > first tries a seqscan, then if you disable that it tries an index only > scan, and if you disable that you get the expected bitmap indexscan. > But an indexscan doesn't seem to be in the cards. Ah, yes. That is true. I think I added that when, in an older version of the test, I had a query that did try an index scan before bitmap heap scan. I've removed that guc from the attached v7. > > IMHO if the test requires a specific plan, it's better to do an actual > > "explain (rows off, costs off)" to check that. > > That's already in the patch, right? Yep. > I do wonder how do we _know_ that the test is testing what it wants to > test: >QUERY PLAN > ─ > Nested Loop Anti Join >-> Seq Scan on skip_fetch t1 >-> Materialize > -> Bitmap Heap Scan on skip_fetch t2 >Recheck Cond: (a = 1) >-> Bitmap Index Scan on skip_fetch_a_idx > Index Cond: (a = 1) > > Is it because of the shape of the index condition? Maybe it's worth > explaining in the comments for the tests. There is a comment in the test that explains what it is exercising and how. We include the explain output (the plan) to ensure it is still using a bitmap heap scan. The test exercises the skip fetch optimization in bitmap heap scan when not all of the inner tuples are emitted. Without the patch, the test fails, so it is protection against someone adding back that assert in the future. It is not protection against someone deleting the line scan->rs_empty_tuples_pending = 0 That is, it doesn't verify that the empty unused tuples count is discarded. Do you think that is necessary? - Melanie From 031012be9d77d5f5888b306d546322e6176e527e Mon Sep 17 00:00:00 2001 From: Melanie Plageman Date: Tue, 14 May 2024 13:36:42 -0400 Subject: [PATCH v7] BitmapHeapScan: Remove incorrect assert and reset field 04e72ed617be pushed the skip fetch optimization (allowing bitmap heap scans to operate like index-only scans if none of the underlying data is needed) into heap AM implementations of bitmap table scan callbacks. 04e72ed617be added an assert that all tuples in blocks eligible for the optimization had been NULL-filled and emitted by the end of the scan. This assert is incorrect when not all tuples need be scanned to execute the query; for example: a join in which not all inner tuples need to be scanned before skipping to the next outer tuple. Remove the assert and reset the field on which it previously asserted to avoid incorrectly emitting NULL-filled tuples from a previous scan on rescan. Author: Melanie Plageman Reviewed-by: Tomas Vondra, Michael Paquier, Alvaro Herrera Reported-by: Melanie Plageman Reproduced-by: Tomas Vondra, Richard Guo Discussion: https://postgr.es/m/CAMbWs48orzZVXa7-vP9Nt7vQWLTE04Qy4PePaLQYsVNQgo6qRg%40mail.gmail.com --- src/backend/access/heap/heapam.c | 9 +--- src/test/regress/expected/join.out | 36 ++ src/test/regress/sql/join.sql | 24 3 files changed, 66 insertions(+), 3 deletions(-) diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index 4be0dee4de..82bb9cb33b 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -1184,7 +1184,12 @@ heap_rescan(TableScanDesc sscan, ScanKey key, bool set_params, scan->rs_vmbuffer = InvalidBuffer; } - Assert(scan->rs_empty_tuples_pending == 0); + /* + * Reset rs_empty_tuples_pending, a field only used by bitmap heap scan, + * to avoid incorrectly emitting NULL-filled tuples from a previous scan + * on rescan. + */ + scan->rs_empty_tuples_pending = 0; /* * The read stream is reset on rescan. This must be done before @@ -1216,8 +1221,6 @@ heap_endscan(TableScanDesc sscan) if (BufferIsValid(scan->rs_vmbuffer)) ReleaseBuffer(scan->rs_vmbuffer); - Assert(scan->rs_empty_tuples_pending == 0); - /* * Must free the read stream before freeing the BufferAccessStrategy. */ diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out index 0246d56aea..6b16c3a676 100644 --- a/src/test/regress/expected/join.out +++ b/src/test/regress/expected/join.out @@ -7924,3 +7924,39 @@ where exists (select 1 from j3 (13 rows) drop table j3; +-- Exercise the "skip fetch" Bitmap Heap Scan optimization when candidate +-- tuples are discarded. This may occur when: +-- 1. A join doesn't require all inner tuples to be scanned for each outer +-- tuple, and +-- 2. The inner side is scanned using a bitmap heap scan, and +-- 3. The bitmap heap scan is eligible for the "skip fetch" optimization. +-- This optimization is usable when no data from the underlying table is +-- needed. Use a temp table so it is
Re: BitmapHeapScan streaming read user and prelim refactoring
On 2024-May-14, Alvaro Herrera wrote: > BTW, I was running the explain while desultorily enabling and disabling > these GUCs and hit this assertion failure: > > #4 0x55e6c72afe28 in ExceptionalCondition > (conditionName=conditionName@entry=0x55e6c731a928 > "scan->rs_empty_tuples_pending == 0", > fileName=fileName@entry=0x55e6c731a3b0 > "../../../../../../../../../pgsql/source/master/src/backend/access/heap/heapam.c", > lineNumber=lineNumber@entry=1219) > at > ../../../../../../../../../pgsql/source/master/src/backend/utils/error/assert.c:66 Ah, I see now that this is precisely the assertion that this patch removes. Nevermind ... -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "This is what I like so much about PostgreSQL. Most of the surprises are of the "oh wow! That's cool" Not the "oh shit!" kind. :)" Scott Marlowe, http://archives.postgresql.org/pgsql-admin/2008-10/msg00152.php
Re: BitmapHeapScan streaming read user and prelim refactoring
On 2024-May-14, Tomas Vondra wrote: > On 5/14/24 19:42, Melanie Plageman wrote: > > >>> +SET enable_indexonlyscan = off; > >>> +set enable_indexscan = off; > >>> +SET enable_seqscan = off; > >> > >> Nit: adjusting the casing of the second SET here. > > > > I've fixed this. I've also set enable_material off as I mentioned I > > might in my earlier mail. > > I'm not sure this (setting more and more GUCs to prevent hypothetical > plan changes) is a good practice. Because how do you know the plan does > not change for some other unexpected reason, possibly in the future? I wonder why it resets enable_indexscan at all. I see that this query first tries a seqscan, then if you disable that it tries an index only scan, and if you disable that you get the expected bitmap indexscan. But an indexscan doesn't seem to be in the cards. > IMHO if the test requires a specific plan, it's better to do an actual > "explain (rows off, costs off)" to check that. That's already in the patch, right? I do wonder how do we _know_ that the test is testing what it wants to test: QUERY PLAN ─ Nested Loop Anti Join -> Seq Scan on skip_fetch t1 -> Materialize -> Bitmap Heap Scan on skip_fetch t2 Recheck Cond: (a = 1) -> Bitmap Index Scan on skip_fetch_a_idx Index Cond: (a = 1) Is it because of the shape of the index condition? Maybe it's worth explaining in the comments for the tests. BTW, I was running the explain while desultorily enabling and disabling these GUCs and hit this assertion failure: #4 0x55e6c72afe28 in ExceptionalCondition (conditionName=conditionName@entry=0x55e6c731a928 "scan->rs_empty_tuples_pending == 0", fileName=fileName@entry=0x55e6c731a3b0 "../../../../../../../../../pgsql/source/master/src/backend/access/heap/heapam.c", lineNumber=lineNumber@entry=1219) at ../../../../../../../../../pgsql/source/master/src/backend/utils/error/assert.c:66 #5 0x55e6c6e2e0c7 in heap_endscan (sscan=0x55e6c7b63e28) at ../../../../../../../../../pgsql/source/master/src/backend/access/heap/heapam.c:1219 #6 0x55e6c6fb35a7 in ExecEndPlan (estate=0x55e6c7a7e9d0, planstate=) at ../../../../../../../../pgsql/source/master/src/backend/executor/execMain.c:1485 #7 standard_ExecutorEnd (queryDesc=0x55e6c7a736b8) at ../../../../../../../../pgsql/source/master/src/backend/executor/execMain.c:501 #8 0x55e6c6f4d9aa in ExplainOnePlan (plannedstmt=plannedstmt@entry=0x55e6c7a735a8, into=into@entry=0x0, es=es@entry=0x55e6c7a448b8, queryString=queryString@entry=0x55e6c796c210 "EXPLAIN (analyze, verbose, COSTS OFF) SELECT t1.a FROM skip_fetch t1 LEFT JOIN skip_fetch t2 ON t2.a = 1 WHERE t2.a IS NULL;", params=params@entry=0x0, queryEnv=queryEnv@entry=0x0, planduration=0x7ffe8a291848, bufusage=0x0, mem_counters=0x0) at ../../../../../../../../pgsql/source/master/src/backend/commands/explain.c:770 #9 0x55e6c6f4e257 in standard_ExplainOneQuery (query=, cursorOptions=2048, into=0x0, es=0x55e6c7a448b8, queryString=0x55e6c796c210 "EXPLAIN (analyze, verbose, COSTS OFF) SELECT t1.a FROM skip_fetch t1 LEFT JOIN skip_fetch t2 ON t2.a = 1 WHERE t2.a IS NULL;", params=0x0, queryEnv=0x0) at ../../../../../../../../pgsql/source/master/src/backend/commands/explain.c:502 I couldn't reproduce it again, though -- and for sure I don't know what it means. All three GUCs are set false in the core. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "Here's a general engineering tip: if the non-fun part is too complex for you to figure out, that might indicate the fun part is too ambitious." (John Naylor) https://postgr.es/m/CAFBsxsG4OWHBbSDM%3DsSeXrQGOtkPiOEOuME4yD7Ce41NtaAD9g%40mail.gmail.com
Re: BitmapHeapScan streaming read user and prelim refactoring
On Tue, May 14, 2024 at 2:44 PM Tomas Vondra wrote: > > On 5/14/24 20:40, Melanie Plageman wrote: > > On Tue, May 14, 2024 at 2:33 PM Tomas Vondra > > wrote: > >> > >> On 5/14/24 19:42, Melanie Plageman wrote: > >>> I've fixed this. I've also set enable_material off as I mentioned I > >>> might in my earlier mail. > >>> > >> > >> I'm not sure this (setting more and more GUCs to prevent hypothetical > >> plan changes) is a good practice. Because how do you know the plan does > >> not change for some other unexpected reason, possibly in the future? > > > > Sure. So, you think it is better not to have enable_material = false? > > > > Right. Unless it's actually needed to force the necessary plan. Attached v6 does not use enable_material = false (as it is not needed). - Melanie From 6dc674fff206dc62c64d348ff0042b1d20798511 Mon Sep 17 00:00:00 2001 From: Melanie Plageman Date: Tue, 14 May 2024 13:36:42 -0400 Subject: [PATCH v6] BitmapHeapScan: Remove incorrect assert and reset field 04e72ed617be pushed the skip fetch optimization (allowing bitmap heap scans to operate like index-only scans if none of the underlying data is needed) into heap AM implementations of bitmap table scan callbacks. 04e72ed617be added an assert that all tuples in blocks eligible for the optimization had been NULL-filled and emitted by the end of the scan. This assert is incorrect when not all tuples need be scanned to execute the query; for example: a join in which not all inner tuples need to be scanned before skipping to the next outer tuple. Remove the assert and reset the field on which it previously asserted to avoid incorrectly emitting NULL-filled tuples from a previous scan on rescan. Author: Melanie Plageman Reviewed-by: Tomas Vondra, Michael Paquier Reported-by: Melanie Plageman Reproduced-by: Tomas Vondra, Richard Guo Discussion: https://postgr.es/m/CAMbWs48orzZVXa7-vP9Nt7vQWLTE04Qy4PePaLQYsVNQgo6qRg%40mail.gmail.com --- src/backend/access/heap/heapam.c | 9 --- src/test/regress/expected/join.out | 38 ++ src/test/regress/sql/join.sql | 26 3 files changed, 70 insertions(+), 3 deletions(-) diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index 4be0dee4de..82bb9cb33b 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -1184,7 +1184,12 @@ heap_rescan(TableScanDesc sscan, ScanKey key, bool set_params, scan->rs_vmbuffer = InvalidBuffer; } - Assert(scan->rs_empty_tuples_pending == 0); + /* + * Reset rs_empty_tuples_pending, a field only used by bitmap heap scan, + * to avoid incorrectly emitting NULL-filled tuples from a previous scan + * on rescan. + */ + scan->rs_empty_tuples_pending = 0; /* * The read stream is reset on rescan. This must be done before @@ -1216,8 +1221,6 @@ heap_endscan(TableScanDesc sscan) if (BufferIsValid(scan->rs_vmbuffer)) ReleaseBuffer(scan->rs_vmbuffer); - Assert(scan->rs_empty_tuples_pending == 0); - /* * Must free the read stream before freeing the BufferAccessStrategy. */ diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out index 0246d56aea..df358f7605 100644 --- a/src/test/regress/expected/join.out +++ b/src/test/regress/expected/join.out @@ -7924,3 +7924,41 @@ where exists (select 1 from j3 (13 rows) drop table j3; +-- Exercise the "skip fetch" Bitmap Heap Scan optimization when candidate +-- tuples are discarded. This may occur when: +-- 1. A join doesn't require all inner tuples to be scanned for each outer +-- tuple, and +-- 2. The inner side is scanned using a bitmap heap scan, and +-- 3. The bitmap heap scan is eligible for the "skip fetch" optimization. +-- This optimization is usable when no data from the underlying table is +-- needed. Use a temp table so it is only visible to this backend and +-- vacuum may reliably mark all blocks in the table all visible in the +-- visibility map. +CREATE TEMP TABLE skip_fetch (a INT, b INT) WITH (fillfactor=10); +INSERT INTO skip_fetch SELECT i % 3, i FROM generate_series(0,30) i; +CREATE INDEX ON skip_fetch(a); +VACUUM (ANALYZE) skip_fetch; +SET enable_indexonlyscan = off; +SET enable_indexscan = off; +SET enable_seqscan = off; +EXPLAIN (COSTS OFF) +SELECT t1.a FROM skip_fetch t1 LEFT JOIN skip_fetch t2 ON t2.a = 1 WHERE t2.a IS NULL; + QUERY PLAN +- + Nested Loop Anti Join + -> Seq Scan on skip_fetch t1 + -> Materialize + -> Bitmap Heap Scan on skip_fetch t2 + Recheck Cond: (a = 1) + -> Bitmap Index Scan on skip_fetch_a_idx + Index Cond: (a = 1) +(7 rows) + +SELECT t1.a FROM skip_fetch t1 LEFT JOIN skip_fetch t2 ON t2.a = 1 WHERE t2.a IS NULL; + a +--- +(0 rows) + +RESET enable_indexonlyscan; +RESET enable_indexscan; +RESET enable_seqscan; diff --git
Re: BitmapHeapScan streaming read user and prelim refactoring
On 5/14/24 20:40, Melanie Plageman wrote: > On Tue, May 14, 2024 at 2:33 PM Tomas Vondra > wrote: >> >> On 5/14/24 19:42, Melanie Plageman wrote: >>> I've fixed this. I've also set enable_material off as I mentioned I >>> might in my earlier mail. >>> >> >> I'm not sure this (setting more and more GUCs to prevent hypothetical >> plan changes) is a good practice. Because how do you know the plan does >> not change for some other unexpected reason, possibly in the future? > > Sure. So, you think it is better not to have enable_material = false? > Right. Unless it's actually needed to force the necessary plan. >> IMHO if the test requires a specific plan, it's better to do an actual >> "explain (rows off, costs off)" to check that. > > When you say "rows off", do you mean do something to ensure that it > doesn't return tuples? Because I don't see a ROWS option for EXPLAIN > in the docs. > Sorry, I meant to hide the cardinality estimates in the explain, but I got confused. "COSTS OFF" is enough. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: BitmapHeapScan streaming read user and prelim refactoring
On Tue, May 14, 2024 at 2:33 PM Tomas Vondra wrote: > > On 5/14/24 19:42, Melanie Plageman wrote: > > I've fixed this. I've also set enable_material off as I mentioned I > > might in my earlier mail. > > > > I'm not sure this (setting more and more GUCs to prevent hypothetical > plan changes) is a good practice. Because how do you know the plan does > not change for some other unexpected reason, possibly in the future? Sure. So, you think it is better not to have enable_material = false? > IMHO if the test requires a specific plan, it's better to do an actual > "explain (rows off, costs off)" to check that. When you say "rows off", do you mean do something to ensure that it doesn't return tuples? Because I don't see a ROWS option for EXPLAIN in the docs. - Melanie
Re: BitmapHeapScan streaming read user and prelim refactoring
On 5/14/24 19:42, Melanie Plageman wrote: > On Tue, May 14, 2024 at 2:18 AM Michael Paquier wrote: >> >> On Mon, May 13, 2024 at 10:05:03AM -0400, Melanie Plageman wrote: >>> Remove the assert and reset the field on which it previously asserted to >>> avoid incorrectly emitting NULL-filled tuples from a previous scan on >>> rescan. >> >>> - Assert(scan->rs_empty_tuples_pending == 0); >>> + scan->rs_empty_tuples_pending = 0; >> >> Perhaps this should document the reason why the reset is done in these >> two paths rather than let the reader guess it? And this is about >> avoiding emitting some tuples from a previous scan. > > I've added a comment to heap_rescan() in the attached v5. Doing so > made me realize that we shouldn't bother resetting it in > heap_endscan(). Doing so is perhaps more confusing, because it implies > that field may somehow be used later. I've removed the reset of > rs_empty_tuples_pending from heap_endscan(). > +1 >>> +SET enable_indexonlyscan = off; >>> +set enable_indexscan = off; >>> +SET enable_seqscan = off; >> >> Nit: adjusting the casing of the second SET here. > > I've fixed this. I've also set enable_material off as I mentioned I > might in my earlier mail. > I'm not sure this (setting more and more GUCs to prevent hypothetical plan changes) is a good practice. Because how do you know the plan does not change for some other unexpected reason, possibly in the future? IMHO if the test requires a specific plan, it's better to do an actual "explain (rows off, costs off)" to check that. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: BitmapHeapScan streaming read user and prelim refactoring
On Tue, May 14, 2024 at 2:18 AM Michael Paquier wrote: > > On Mon, May 13, 2024 at 10:05:03AM -0400, Melanie Plageman wrote: > > Remove the assert and reset the field on which it previously asserted to > > avoid incorrectly emitting NULL-filled tuples from a previous scan on > > rescan. > > > - Assert(scan->rs_empty_tuples_pending == 0); > > + scan->rs_empty_tuples_pending = 0; > > Perhaps this should document the reason why the reset is done in these > two paths rather than let the reader guess it? And this is about > avoiding emitting some tuples from a previous scan. I've added a comment to heap_rescan() in the attached v5. Doing so made me realize that we shouldn't bother resetting it in heap_endscan(). Doing so is perhaps more confusing, because it implies that field may somehow be used later. I've removed the reset of rs_empty_tuples_pending from heap_endscan(). > > +SET enable_indexonlyscan = off; > > +set enable_indexscan = off; > > +SET enable_seqscan = off; > > Nit: adjusting the casing of the second SET here. I've fixed this. I've also set enable_material off as I mentioned I might in my earlier mail. - Melanie From d44679397e3aaa043afad2108ec904f68b7052b3 Mon Sep 17 00:00:00 2001 From: Melanie Plageman Date: Tue, 14 May 2024 13:36:42 -0400 Subject: [PATCH v5] BitmapHeapScan: Remove incorrect assert and reset field 04e72ed617be pushed the skip fetch optimization (allowing bitmap heap scans to operate like index-only scans if none of the underlying data is needed) into heap AM implementations of bitmap table scan callbacks. 04e72ed617be added an assert that all tuples in blocks eligible for the optimization had been NULL-filled and emitted by the end of the scan. This assert is incorrect when not all tuples need be scanned to execute the query; for example: a join in which not all inner tuples need to be scanned before skipping to the next outer tuple. Remove the assert and reset the field on which it previously asserted to avoid incorrectly emitting NULL-filled tuples from a previous scan on rescan. Author: Melanie Plageman Reviewed-by: Tomas Vondra, Michael Paquier Reported-by: Melanie Plageman Reproduced-by: Tomas Vondra, Richard Guo Discussion: https://postgr.es/m/CAMbWs48orzZVXa7-vP9Nt7vQWLTE04Qy4PePaLQYsVNQgo6qRg%40mail.gmail.com --- src/backend/access/heap/heapam.c | 9 --- src/test/regress/expected/join.out | 39 ++ src/test/regress/sql/join.sql | 28 + 3 files changed, 73 insertions(+), 3 deletions(-) diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index 4be0dee4de..82bb9cb33b 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -1184,7 +1184,12 @@ heap_rescan(TableScanDesc sscan, ScanKey key, bool set_params, scan->rs_vmbuffer = InvalidBuffer; } - Assert(scan->rs_empty_tuples_pending == 0); + /* + * Reset rs_empty_tuples_pending, a field only used by bitmap heap scan, + * to avoid incorrectly emitting NULL-filled tuples from a previous scan + * on rescan. + */ + scan->rs_empty_tuples_pending = 0; /* * The read stream is reset on rescan. This must be done before @@ -1216,8 +1221,6 @@ heap_endscan(TableScanDesc sscan) if (BufferIsValid(scan->rs_vmbuffer)) ReleaseBuffer(scan->rs_vmbuffer); - Assert(scan->rs_empty_tuples_pending == 0); - /* * Must free the read stream before freeing the BufferAccessStrategy. */ diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out index 0246d56aea..466be7b439 100644 --- a/src/test/regress/expected/join.out +++ b/src/test/regress/expected/join.out @@ -7924,3 +7924,42 @@ where exists (select 1 from j3 (13 rows) drop table j3; +-- Exercise the "skip fetch" Bitmap Heap Scan optimization when candidate +-- tuples are discarded. This may occur when: +-- 1. A join doesn't require all inner tuples to be scanned for each outer +-- tuple, and +-- 2. The inner side is scanned using a bitmap heap scan, and +-- 3. The bitmap heap scan is eligible for the "skip fetch" optimization. +-- This optimization is usable when no data from the underlying table is +-- needed. Use a temp table so it is only visible to this backend and +-- vacuum may reliably mark all blocks in the table all visible in the +-- visibility map. +CREATE TEMP TABLE skip_fetch (a INT, b INT) WITH (fillfactor=10); +INSERT INTO skip_fetch SELECT i % 3, i FROM generate_series(0,30) i; +CREATE INDEX ON skip_fetch(a); +VACUUM (ANALYZE) skip_fetch; +SET enable_indexonlyscan = off; +SET enable_indexscan = off; +SET enable_material = off; +SET enable_seqscan = off; +EXPLAIN (COSTS OFF) +SELECT t1.a FROM skip_fetch t1 LEFT JOIN skip_fetch t2 ON t2.a = 1 WHERE t2.a IS NULL; +QUERY PLAN +--- + Nested Loop Anti Join + -> Seq Scan on skip_fetch t1 + -> Bitmap Heap Scan on
Re: BitmapHeapScan streaming read user and prelim refactoring
On Mon, May 13, 2024 at 10:05:03AM -0400, Melanie Plageman wrote: > Remove the assert and reset the field on which it previously asserted to > avoid incorrectly emitting NULL-filled tuples from a previous scan on > rescan. > - Assert(scan->rs_empty_tuples_pending == 0); > + scan->rs_empty_tuples_pending = 0; Perhaps this should document the reason why the reset is done in these two paths rather than let the reader guess it? And this is about avoiding emitting some tuples from a previous scan. > +SET enable_indexonlyscan = off; > +set enable_indexscan = off; > +SET enable_seqscan = off; Nit: adjusting the casing of the second SET here. -- Michael signature.asc Description: PGP signature
Re: BitmapHeapScan streaming read user and prelim refactoring
On Sat, May 11, 2024 at 3:18 PM Tomas Vondra wrote: > > On 5/10/24 21:48, Melanie Plageman wrote: > > Attached is v3. I didn't use your exact language because the test > > wouldn't actually verify that we properly discard the tuples. Whether > > or not the empty tuples are all emitted, it just resets the counter to > > 0. I decided to go with "exercise" instead. > > > > I did go over the v3 patch, did a bunch of tests, and I think it's fine > and ready to go. The one thing that might need some minor tweaks is the > commit message. > > 1) Isn't the subject "Remove incorrect assert" a bit misleading, as the > patch does not simply remove an assert, but replaces it with a reset of > the field the assert used to check? (The commit message does not mention > this either, at least not explicitly.) I've updated the commit message. > 2) The "heap AM-specific bitmap heap scan code" sounds a bit strange to > me, isn't the first "heap" unnecessary? bitmap heap scan has been used to refer to bitmap table scans, as the name wasn't changed from heap when the table AM API was added (e.g. BitmapHeapNext() is used by all table AMs doing bitmap table scans). 04e72ed617be specifically pushed the skip fetch optimization into heap implementations of bitmap table scan callbacks, so it was important to make this distinction. I've changed the commit message to say heap AM implementations of bitmap table scan callbacks. While looking at the patch again, I wondered if I should set enable_material=false in the test. It doesn't matter from the perspective of exercising the correct code; however, I wasn't sure if disabling materialization would make the test more resilient against future planner changes which could cause it to incorrectly fail. - Melanie From d3628d8d36b2be54b64c18402bdda80e1c8a436f Mon Sep 17 00:00:00 2001 From: Melanie Plageman Date: Fri, 10 May 2024 14:52:34 -0400 Subject: [PATCH v4] BitmapHeapScan: Replace incorrect assert with reinitialization 04e72ed617be pushed the skip fetch optimization (allowing bitmap heap scans to operate like index-only scans if none of the underlying data is needed) into heap AM implementations of bitmap table scan callbacks. 04e72ed617be added an assert that all tuples in blocks eligible for the optimization had been NULL-filled and emitted by the end of the scan. This assert is incorrect when not all tuples need be scanned to execute the query; for example: a join in which not all inner tuples need to be scanned before skipping to the next outer tuple. Remove the assert and reset the field on which it previously asserted to avoid incorrectly emitting NULL-filled tuples from a previous scan on rescan. Author: Melanie Plageman Reviewed-by: Tomas Vondra Reported-by: Melanie Plageman Reproduced-by: Tomas Vondra, Richard Guo Discussion: https://postgr.es/m/CAMbWs48orzZVXa7-vP9Nt7vQWLTE04Qy4PePaLQYsVNQgo6qRg%40mail.gmail.com --- src/backend/access/heap/heapam.c | 4 ++-- src/test/regress/expected/join.out | 38 ++ src/test/regress/sql/join.sql | 26 3 files changed, 66 insertions(+), 2 deletions(-) diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index 4be0dee4de..8600c22515 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -1184,7 +1184,7 @@ heap_rescan(TableScanDesc sscan, ScanKey key, bool set_params, scan->rs_vmbuffer = InvalidBuffer; } - Assert(scan->rs_empty_tuples_pending == 0); + scan->rs_empty_tuples_pending = 0; /* * The read stream is reset on rescan. This must be done before @@ -1216,7 +1216,7 @@ heap_endscan(TableScanDesc sscan) if (BufferIsValid(scan->rs_vmbuffer)) ReleaseBuffer(scan->rs_vmbuffer); - Assert(scan->rs_empty_tuples_pending == 0); + scan->rs_empty_tuples_pending = 0; /* * Must free the read stream before freeing the BufferAccessStrategy. diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out index 0246d56aea..8829bd76e7 100644 --- a/src/test/regress/expected/join.out +++ b/src/test/regress/expected/join.out @@ -7924,3 +7924,41 @@ where exists (select 1 from j3 (13 rows) drop table j3; +-- Exercise the "skip fetch" Bitmap Heap Scan optimization when candidate +-- tuples are discarded. This may occur when: +-- 1. A join doesn't require all inner tuples to be scanned for each outer +-- tuple, and +-- 2. The inner side is scanned using a bitmap heap scan, and +-- 3. The bitmap heap scan is eligible for the "skip fetch" optimization. +-- This optimization is usable when no data from the underlying table is +-- needed. Use a temp table so it is only visible to this backend and +-- vacuum may reliably mark all blocks in the table all visible in the +-- visibility map. +CREATE TEMP TABLE skip_fetch (a INT, b INT) WITH (fillfactor=10); +INSERT INTO skip_fetch SELECT i % 3, i FROM generate_series(0,30) i; +CREATE INDEX ON skip_fetch(a); +VACUUM
Re: BitmapHeapScan streaming read user and prelim refactoring
On 5/10/24 21:48, Melanie Plageman wrote: > On Thu, May 2, 2024 at 5:37 PM Tomas Vondra > wrote: >> >> >> >> On 4/24/24 22:46, Melanie Plageman wrote: >>> On Tue, Apr 23, 2024 at 6:43 PM Tomas Vondra >>> wrote: On 4/23/24 18:05, Melanie Plageman wrote: > The patch with a fix is attached. I put the test in > src/test/regress/sql/join.sql. It isn't the perfect location because > it is testing something exercisable with a join but not directly > related to the fact that it is a join. I also considered > src/test/regress/sql/select.sql, but it also isn't directly related to > the query being a SELECT query. If there is a better place for a test > of a bitmap heap scan edge case, let me know. I don't see a problem with adding this to join.sql - why wouldn't this count as something related to a join? Sure, it's not like this code matters only for joins, but if you look at join.sql that applies to a number of other tests (e.g. there are a couple btree tests). >>> >>> I suppose it's true that other tests in this file use joins to test >>> other code. I guess if we limited join.sql to containing tests of join >>> implementation, it would be rather small. I just imagined it would be >>> nice if tests were grouped by what they were testing -- not how they >>> were testing it. >>> That being said, it'd be good to explain in the comment why we're testing this particular plan, not just what the plan looks like. >>> >>> You mean I should explain in the test comment why I included the >>> EXPLAIN plan output? (because it needs to contain a bitmapheapscan to >>> actually be testing anything) >>> >> >> No, I meant that the comment before the test describes a couple >> requirements the plan needs to meet (no need to process all inner >> tuples, bitmapscan eligible for skip_fetch on outer side, ...), but it >> does not explain why we're testing that plan. >> >> I could get to that by doing git-blame to see what commit added this >> code, and then read the linked discussion. Perhaps that's enough, but >> maybe the comment could say something like "verify we properly discard >> tuples on rescans" or something like that? > > Attached is v3. I didn't use your exact language because the test > wouldn't actually verify that we properly discard the tuples. Whether > or not the empty tuples are all emitted, it just resets the counter to > 0. I decided to go with "exercise" instead. > I did go over the v3 patch, did a bunch of tests, and I think it's fine and ready to go. The one thing that might need some minor tweaks is the commit message. 1) Isn't the subject "Remove incorrect assert" a bit misleading, as the patch does not simply remove an assert, but replaces it with a reset of the field the assert used to check? (The commit message does not mention this either, at least not explicitly.) 2) The "heap AM-specific bitmap heap scan code" sounds a bit strange to me, isn't the first "heap" unnecessary? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: BitmapHeapScan streaming read user and prelim refactoring
On Thu, May 2, 2024 at 5:37 PM Tomas Vondra wrote: > > > > On 4/24/24 22:46, Melanie Plageman wrote: > > On Tue, Apr 23, 2024 at 6:43 PM Tomas Vondra > > wrote: > >> > >> On 4/23/24 18:05, Melanie Plageman wrote: > >>> The patch with a fix is attached. I put the test in > >>> src/test/regress/sql/join.sql. It isn't the perfect location because > >>> it is testing something exercisable with a join but not directly > >>> related to the fact that it is a join. I also considered > >>> src/test/regress/sql/select.sql, but it also isn't directly related to > >>> the query being a SELECT query. If there is a better place for a test > >>> of a bitmap heap scan edge case, let me know. > >> > >> I don't see a problem with adding this to join.sql - why wouldn't this > >> count as something related to a join? Sure, it's not like this code > >> matters only for joins, but if you look at join.sql that applies to a > >> number of other tests (e.g. there are a couple btree tests). > > > > I suppose it's true that other tests in this file use joins to test > > other code. I guess if we limited join.sql to containing tests of join > > implementation, it would be rather small. I just imagined it would be > > nice if tests were grouped by what they were testing -- not how they > > were testing it. > > > >> That being said, it'd be good to explain in the comment why we're > >> testing this particular plan, not just what the plan looks like. > > > > You mean I should explain in the test comment why I included the > > EXPLAIN plan output? (because it needs to contain a bitmapheapscan to > > actually be testing anything) > > > > No, I meant that the comment before the test describes a couple > requirements the plan needs to meet (no need to process all inner > tuples, bitmapscan eligible for skip_fetch on outer side, ...), but it > does not explain why we're testing that plan. > > I could get to that by doing git-blame to see what commit added this > code, and then read the linked discussion. Perhaps that's enough, but > maybe the comment could say something like "verify we properly discard > tuples on rescans" or something like that? Attached is v3. I didn't use your exact language because the test wouldn't actually verify that we properly discard the tuples. Whether or not the empty tuples are all emitted, it just resets the counter to 0. I decided to go with "exercise" instead. - Melanie From 0f10d304fc7dcce99622f348183f36a8062e70c6 Mon Sep 17 00:00:00 2001 From: Melanie Plageman Date: Fri, 10 May 2024 14:52:34 -0400 Subject: [PATCH v3] BitmapHeapScan: Remove incorrect assert 04e72ed617be pushed the skip fetch optimization (allowing bitmap heap scans to operate like index-only scans if none of the underlying data is needed) into heap AM-specific bitmap heap scan code. 04e72ed617be added an assert that all tuples in blocks eligible for the optimization had been NULL-filled and emitted by the end of the scan. This assert is incorrect when not all tuples need be scanned to execute the query; for example: a join in which not all inner tuples need to be scanned before skipping to the next outer tuple. Author: Melanie Plageman Reviewed-by: Richard Guo, Tomas Vondra Discussion: https://postgr.es/m/CAMbWs48orzZVXa7-vP9Nt7vQWLTE04Qy4PePaLQYsVNQgo6qRg%40mail.gmail.com --- src/backend/access/heap/heapam.c | 4 ++-- src/test/regress/expected/join.out | 38 ++ src/test/regress/sql/join.sql | 26 3 files changed, 66 insertions(+), 2 deletions(-) diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index 4be0dee4de..8600c22515 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -1184,7 +1184,7 @@ heap_rescan(TableScanDesc sscan, ScanKey key, bool set_params, scan->rs_vmbuffer = InvalidBuffer; } - Assert(scan->rs_empty_tuples_pending == 0); + scan->rs_empty_tuples_pending = 0; /* * The read stream is reset on rescan. This must be done before @@ -1216,7 +1216,7 @@ heap_endscan(TableScanDesc sscan) if (BufferIsValid(scan->rs_vmbuffer)) ReleaseBuffer(scan->rs_vmbuffer); - Assert(scan->rs_empty_tuples_pending == 0); + scan->rs_empty_tuples_pending = 0; /* * Must free the read stream before freeing the BufferAccessStrategy. diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out index 0246d56aea..8829bd76e7 100644 --- a/src/test/regress/expected/join.out +++ b/src/test/regress/expected/join.out @@ -7924,3 +7924,41 @@ where exists (select 1 from j3 (13 rows) drop table j3; +-- Exercise the "skip fetch" Bitmap Heap Scan optimization when candidate +-- tuples are discarded. This may occur when: +-- 1. A join doesn't require all inner tuples to be scanned for each outer +-- tuple, and +-- 2. The inner side is scanned using a bitmap heap scan, and +-- 3. The bitmap heap scan is eligible for the "skip fetch" optimization. +-- This optimization is
Re: BitmapHeapScan streaming read user and prelim refactoring
On 4/24/24 22:46, Melanie Plageman wrote: > On Tue, Apr 23, 2024 at 6:43 PM Tomas Vondra > wrote: >> >> On 4/23/24 18:05, Melanie Plageman wrote: >>> The patch with a fix is attached. I put the test in >>> src/test/regress/sql/join.sql. It isn't the perfect location because >>> it is testing something exercisable with a join but not directly >>> related to the fact that it is a join. I also considered >>> src/test/regress/sql/select.sql, but it also isn't directly related to >>> the query being a SELECT query. If there is a better place for a test >>> of a bitmap heap scan edge case, let me know. >> >> I don't see a problem with adding this to join.sql - why wouldn't this >> count as something related to a join? Sure, it's not like this code >> matters only for joins, but if you look at join.sql that applies to a >> number of other tests (e.g. there are a couple btree tests). > > I suppose it's true that other tests in this file use joins to test > other code. I guess if we limited join.sql to containing tests of join > implementation, it would be rather small. I just imagined it would be > nice if tests were grouped by what they were testing -- not how they > were testing it. > >> That being said, it'd be good to explain in the comment why we're >> testing this particular plan, not just what the plan looks like. > > You mean I should explain in the test comment why I included the > EXPLAIN plan output? (because it needs to contain a bitmapheapscan to > actually be testing anything) > No, I meant that the comment before the test describes a couple requirements the plan needs to meet (no need to process all inner tuples, bitmapscan eligible for skip_fetch on outer side, ...), but it does not explain why we're testing that plan. I could get to that by doing git-blame to see what commit added this code, and then read the linked discussion. Perhaps that's enough, but maybe the comment could say something like "verify we properly discard tuples on rescans" or something like that? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: BitmapHeapScan streaming read user and prelim refactoring
On 4/30/24 14:07, Daniel Gustafsson wrote: >> On 26 Apr 2024, at 15:04, Melanie Plageman wrote: > >> If this seems correct to you, are you okay with the rest of the fix >> and test? We could close this open item once the patch is acceptable. > > From reading the discussion and the patch this seems like the right fix to me. I agree. > Does the test added here aptly cover 04e72ed617be in terms its functionality? > AFAIK the test fails without the fix and works with it, so I believe it does cover the relevant functionality. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: BitmapHeapScan streaming read user and prelim refactoring
> On 26 Apr 2024, at 15:04, Melanie Plageman wrote: > If this seems correct to you, are you okay with the rest of the fix > and test? We could close this open item once the patch is acceptable. From reading the discussion and the patch this seems like the right fix to me. Does the test added here aptly cover 04e72ed617be in terms its functionality? -- Daniel Gustafsson
Re: BitmapHeapScan streaming read user and prelim refactoring
On Thu, Apr 25, 2024 at 7:57 PM Tom Lane wrote: > > I wrote: > > Hmm, is that actually true? There's no more reason to think a tuple > > in a temp table is old enough to be visible to all other sessions > > than one in any other table. It could be all right if we had a > > special-case rule for setting all-visible in temp tables. Which > > indeed I thought we had, but I can't find any evidence of that in > > vacuumlazy.c, nor did a trawl of the commit log turn up anything > > promising. Am I just looking in the wrong place? > > Ah, never mind that --- I must be looking in the wrong place. > Direct experimentation proves that VACUUM will set all-visible bits > for temp tables even in the presence of concurrent transactions. If this seems correct to you, are you okay with the rest of the fix and test? We could close this open item once the patch is acceptable. - Melanie
Re: BitmapHeapScan streaming read user and prelim refactoring
I wrote: > Hmm, is that actually true? There's no more reason to think a tuple > in a temp table is old enough to be visible to all other sessions > than one in any other table. It could be all right if we had a > special-case rule for setting all-visible in temp tables. Which > indeed I thought we had, but I can't find any evidence of that in > vacuumlazy.c, nor did a trawl of the commit log turn up anything > promising. Am I just looking in the wrong place? Ah, never mind that --- I must be looking in the wrong place. Direct experimentation proves that VACUUM will set all-visible bits for temp tables even in the presence of concurrent transactions. regards, tom lane
Re: BitmapHeapScan streaming read user and prelim refactoring
Melanie Plageman writes: > On Wed, Apr 24, 2024 at 4:46 PM Melanie Plageman > wrote: >> After thinking about it more, I suppose we can't add a test that >> relies on the relation being all visible in the VM in a group in the >> parallel schedule. I'm not sure this edge case is important enough to >> merit its own group or an isolation test. What do you think? > Andres rightly pointed out to me off-list that if I just used a temp > table, the table would only be visible to the testing backend anyway. > I've done that in the attached v2. Now the test is deterministic. Hmm, is that actually true? There's no more reason to think a tuple in a temp table is old enough to be visible to all other sessions than one in any other table. It could be all right if we had a special-case rule for setting all-visible in temp tables. Which indeed I thought we had, but I can't find any evidence of that in vacuumlazy.c, nor did a trawl of the commit log turn up anything promising. Am I just looking in the wrong place? regards, tom lane
Re: BitmapHeapScan streaming read user and prelim refactoring
On Wed, Apr 24, 2024 at 4:46 PM Melanie Plageman wrote: > > On Tue, Apr 23, 2024 at 6:43 PM Tomas Vondra > wrote: > > > > On 4/23/24 18:05, Melanie Plageman wrote: > > > One other note: there is some concurrency effect in the parallel > > > schedule group containing "join" where you won't trip the assert if > > > all the tests in that group in the parallel schedule are run. But, if > > > you would like to verify that the test exercises the correct code, > > > just reduce the group containing "join". > > > > > > > That is ... interesting. Doesn't that mean that most test runs won't > > actually detect the problem? That would make the test a bit useless. > > Yes, I should really have thought about it more. After further > investigation, I found that the reason it doesn't trip the assert when > the join test is run concurrently with other tests is that the SELECT > query doesn't use the skip fetch optimization because the VACUUM > doesn't set the pages all visible in the VM. In this case, it's > because the tuples' xmins are not before VacuumCutoffs->OldestXmin > (which is derived from GetOldestNonRemovableTransactionId()). > > After thinking about it more, I suppose we can't add a test that > relies on the relation being all visible in the VM in a group in the > parallel schedule. I'm not sure this edge case is important enough to > merit its own group or an isolation test. What do you think? Andres rightly pointed out to me off-list that if I just used a temp table, the table would only be visible to the testing backend anyway. I've done that in the attached v2. Now the test is deterministic. - Melanie From c5f28d12f75f5af84ede0db563fc5c0b53295c65 Mon Sep 17 00:00:00 2001 From: Melanie Plageman Date: Thu, 25 Apr 2024 18:50:14 -0400 Subject: [PATCH v2] BitmapHeapScan: Remove incorrect assert 04e72ed617be pushed the skip fetch optimization (allowing bitmap heap scans to operate like index-only scans if none of the underlying data is needed) into heap AM-specific bitmap heap scan code. 04e72ed617be added an assert that all tuples in blocks eligible for the optimization had been NULL-filled and emitted by the end of the scan. This assert is incorrect when not all tuples need be scanned to execute the query; for example: a join in which not all inner tuples need to be scanned before skipping to the next outer tuple. Author: Melanie Plageman Reviewed-by: Richard Guo, Tomas Vondra Discussion: https://postgr.es/m/CAMbWs48orzZVXa7-vP9Nt7vQWLTE04Qy4PePaLQYsVNQgo6qRg%40mail.gmail.com --- src/backend/access/heap/heapam.c | 4 ++-- src/test/regress/expected/join.out | 37 ++ src/test/regress/sql/join.sql | 25 3 files changed, 64 insertions(+), 2 deletions(-) diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index 4a4cf76269d..8906f161320 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -1184,7 +1184,7 @@ heap_rescan(TableScanDesc sscan, ScanKey key, bool set_params, scan->rs_vmbuffer = InvalidBuffer; } - Assert(scan->rs_empty_tuples_pending == 0); + scan->rs_empty_tuples_pending = 0; /* * The read stream is reset on rescan. This must be done before @@ -1216,7 +1216,7 @@ heap_endscan(TableScanDesc sscan) if (BufferIsValid(scan->rs_vmbuffer)) ReleaseBuffer(scan->rs_vmbuffer); - Assert(scan->rs_empty_tuples_pending == 0); + scan->rs_empty_tuples_pending = 0; /* * Must free the read stream before freeing the BufferAccessStrategy. diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out index 8b640c2fc2f..4f0292c7285 100644 --- a/src/test/regress/expected/join.out +++ b/src/test/regress/expected/join.out @@ -8956,3 +8956,40 @@ where exists (select 1 from j3 (13 rows) drop table j3; +-- Check the case when: +-- 1. A join doesn't require all inner tuples to be scanned for each outer +-- tuple, and +-- 2. The inner side is scanned using a bitmap heap scan, and +-- 3. The bitmap heap scan is eligible for the "skip fetch" optimization. +-- This optimization is usable when no data from the underlying table is +-- needed. Use a temp table so it is only visible to this backend and +-- vacuum may reliably mark all blocks in the table all visible in the +-- visibility map. +CREATE TEMP TABLE skip_fetch (a INT, b INT) WITH (fillfactor=10); +INSERT INTO skip_fetch SELECT i % 3, i FROM generate_series(0,30) i; +CREATE INDEX ON skip_fetch(a); +VACUUM (ANALYZE) skip_fetch; +SET enable_indexonlyscan = off; +set enable_indexscan = off; +SET enable_seqscan = off; +EXPLAIN (COSTS OFF) +SELECT t1.a FROM skip_fetch t1 LEFT JOIN skip_fetch t2 ON t2.a = 1 WHERE t2.a IS NULL; + QUERY PLAN +- + Nested Loop Anti Join + -> Seq Scan on skip_fetch t1 + -> Materialize + -> Bitmap Heap Scan on skip_fetch t2 +
Re: BitmapHeapScan streaming read user and prelim refactoring
On Tue, Apr 23, 2024 at 6:43 PM Tomas Vondra wrote: > > On 4/23/24 18:05, Melanie Plageman wrote: > > The patch with a fix is attached. I put the test in > > src/test/regress/sql/join.sql. It isn't the perfect location because > > it is testing something exercisable with a join but not directly > > related to the fact that it is a join. I also considered > > src/test/regress/sql/select.sql, but it also isn't directly related to > > the query being a SELECT query. If there is a better place for a test > > of a bitmap heap scan edge case, let me know. > > I don't see a problem with adding this to join.sql - why wouldn't this > count as something related to a join? Sure, it's not like this code > matters only for joins, but if you look at join.sql that applies to a > number of other tests (e.g. there are a couple btree tests). I suppose it's true that other tests in this file use joins to test other code. I guess if we limited join.sql to containing tests of join implementation, it would be rather small. I just imagined it would be nice if tests were grouped by what they were testing -- not how they were testing it. > That being said, it'd be good to explain in the comment why we're > testing this particular plan, not just what the plan looks like. You mean I should explain in the test comment why I included the EXPLAIN plan output? (because it needs to contain a bitmapheapscan to actually be testing anything) I do have a detailed explanation in my test comment of why this particular query exercises the code we want to test. > > One other note: there is some concurrency effect in the parallel > > schedule group containing "join" where you won't trip the assert if > > all the tests in that group in the parallel schedule are run. But, if > > you would like to verify that the test exercises the correct code, > > just reduce the group containing "join". > > > > That is ... interesting. Doesn't that mean that most test runs won't > actually detect the problem? That would make the test a bit useless. Yes, I should really have thought about it more. After further investigation, I found that the reason it doesn't trip the assert when the join test is run concurrently with other tests is that the SELECT query doesn't use the skip fetch optimization because the VACUUM doesn't set the pages all visible in the VM. In this case, it's because the tuples' xmins are not before VacuumCutoffs->OldestXmin (which is derived from GetOldestNonRemovableTransactionId()). After thinking about it more, I suppose we can't add a test that relies on the relation being all visible in the VM in a group in the parallel schedule. I'm not sure this edge case is important enough to merit its own group or an isolation test. What do you think? - Melanie
Re: BitmapHeapScan streaming read user and prelim refactoring
On 4/23/24 18:05, Melanie Plageman wrote: > On Mon, Apr 22, 2024 at 1:01 PM Melanie Plageman > wrote: >> >> On Thu, Apr 18, 2024 at 5:39 AM Tomas Vondra >> wrote: >>> >>> On 4/18/24 09:10, Michael Paquier wrote: On Sun, Apr 07, 2024 at 10:54:56AM -0400, Melanie Plageman wrote: > I've added an open item [1], because what's one open item when you can > have two? (me) And this is still an open item as of today. What's the plan to move forward here? >>> >>> AFAIK the plan is to replace the asserts with actually resetting the >>> rs_empty_tuples_pending field to 0, as suggested by Melanie a week ago. >>> I assume she was busy with the post-freeze AM reworks last week, so this >>> was on a back burner. >> >> yep, sorry. Also I took a few days off. I'm just catching up today. I >> want to pop in one of Richard or Tomas' examples as a test, since it >> seems like it would add some coverage. I will have a patch soon. > > The patch with a fix is attached. I put the test in > src/test/regress/sql/join.sql. It isn't the perfect location because > it is testing something exercisable with a join but not directly > related to the fact that it is a join. I also considered > src/test/regress/sql/select.sql, but it also isn't directly related to > the query being a SELECT query. If there is a better place for a test > of a bitmap heap scan edge case, let me know. > I don't see a problem with adding this to join.sql - why wouldn't this count as something related to a join? Sure, it's not like this code matters only for joins, but if you look at join.sql that applies to a number of other tests (e.g. there are a couple btree tests). That being said, it'd be good to explain in the comment why we're testing this particular plan, not just what the plan looks like. > One other note: there is some concurrency effect in the parallel > schedule group containing "join" where you won't trip the assert if > all the tests in that group in the parallel schedule are run. But, if > you would like to verify that the test exercises the correct code, > just reduce the group containing "join". > That is ... interesting. Doesn't that mean that most test runs won't actually detect the problem? That would make the test a bit useless. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: BitmapHeapScan streaming read user and prelim refactoring
On Mon, Apr 22, 2024 at 1:01 PM Melanie Plageman wrote: > > On Thu, Apr 18, 2024 at 5:39 AM Tomas Vondra > wrote: > > > > On 4/18/24 09:10, Michael Paquier wrote: > > > On Sun, Apr 07, 2024 at 10:54:56AM -0400, Melanie Plageman wrote: > > >> I've added an open item [1], because what's one open item when you can > > >> have two? (me) > > > > > > And this is still an open item as of today. What's the plan to move > > > forward here? > > > > AFAIK the plan is to replace the asserts with actually resetting the > > rs_empty_tuples_pending field to 0, as suggested by Melanie a week ago. > > I assume she was busy with the post-freeze AM reworks last week, so this > > was on a back burner. > > yep, sorry. Also I took a few days off. I'm just catching up today. I > want to pop in one of Richard or Tomas' examples as a test, since it > seems like it would add some coverage. I will have a patch soon. The patch with a fix is attached. I put the test in src/test/regress/sql/join.sql. It isn't the perfect location because it is testing something exercisable with a join but not directly related to the fact that it is a join. I also considered src/test/regress/sql/select.sql, but it also isn't directly related to the query being a SELECT query. If there is a better place for a test of a bitmap heap scan edge case, let me know. One other note: there is some concurrency effect in the parallel schedule group containing "join" where you won't trip the assert if all the tests in that group in the parallel schedule are run. But, if you would like to verify that the test exercises the correct code, just reduce the group containing "join". - Melanie From 6ad777979c335f6cc16d3936defb634176a44995 Mon Sep 17 00:00:00 2001 From: Melanie Plageman Date: Tue, 23 Apr 2024 11:45:37 -0400 Subject: [PATCH v1] BitmapHeapScan: Remove incorrect assert 04e72ed617be pushed the skip fetch optimization (allowing bitmap heap scans to operate like index-only scans if none of the underlying data is needed) into heap AM-specific bitmap heap scan code. 04e72ed617be added an assert that all tuples in blocks eligible for the optimization had been NULL-filled and emitted by the end of the scan. This assert is incorrect when not all tuples need be scanned to execute the query; for example: a join in which not all inner tuples need to be scanned before skipping to the next outer tuple. Author: Melanie Plageman Reviewed-by: Richard Guo, Tomas Vondra Discussion: https://postgr.es/m/CAMbWs48orzZVXa7-vP9Nt7vQWLTE04Qy4PePaLQYsVNQgo6qRg%40mail.gmail.com --- src/backend/access/heap/heapam.c | 4 ++-- src/test/regress/expected/join.out | 36 ++ src/test/regress/sql/join.sql | 24 3 files changed, 62 insertions(+), 2 deletions(-) diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index 4a4cf76269d..8906f161320 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -1184,7 +1184,7 @@ heap_rescan(TableScanDesc sscan, ScanKey key, bool set_params, scan->rs_vmbuffer = InvalidBuffer; } - Assert(scan->rs_empty_tuples_pending == 0); + scan->rs_empty_tuples_pending = 0; /* * The read stream is reset on rescan. This must be done before @@ -1216,7 +1216,7 @@ heap_endscan(TableScanDesc sscan) if (BufferIsValid(scan->rs_vmbuffer)) ReleaseBuffer(scan->rs_vmbuffer); - Assert(scan->rs_empty_tuples_pending == 0); + scan->rs_empty_tuples_pending = 0; /* * Must free the read stream before freeing the BufferAccessStrategy. diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out index 8b640c2fc2f..ce73939c267 100644 --- a/src/test/regress/expected/join.out +++ b/src/test/regress/expected/join.out @@ -8956,3 +8956,39 @@ where exists (select 1 from j3 (13 rows) drop table j3; +-- Check the case when: +-- 1. A join doesn't require all inner tuples to be scanned for each outer +-- tuple, and +-- 2. The inner side is scanned using a bitmap heap scan, and +-- 3. The bitmap heap scan is eligible for the "skip fetch" optimization. +-- This optimization is usable when no data from the underlying table is +-- needed. +CREATE TABLE skip_fetch (a INT, b INT) WITH (fillfactor=10); +INSERT INTO skip_fetch SELECT i % 3, i FROM GENERATE_SERIES(0,30) i; +CREATE INDEX ON skip_fetch(a); +VACUUM (ANALYZE) skip_fetch; +SET enable_indexonlyscan = off; +set enable_indexscan = off; +SET enable_seqscan = off; +EXPLAIN (COSTS OFF) +SELECT t1.a FROM skip_fetch t1 LEFT JOIN skip_fetch t2 ON t2.a = 1 WHERE t2.a IS NULL; + QUERY PLAN +- + Nested Loop Anti Join + -> Seq Scan on skip_fetch t1 + -> Materialize + -> Bitmap Heap Scan on skip_fetch t2 + Recheck Cond: (a = 1) + -> Bitmap Index Scan on skip_fetch_a_idx + Index Cond: (a = 1) +(7
Re: BitmapHeapScan streaming read user and prelim refactoring
On Thu, Apr 18, 2024 at 5:39 AM Tomas Vondra wrote: > > On 4/18/24 09:10, Michael Paquier wrote: > > On Sun, Apr 07, 2024 at 10:54:56AM -0400, Melanie Plageman wrote: > >> I've added an open item [1], because what's one open item when you can > >> have two? (me) > > > > And this is still an open item as of today. What's the plan to move > > forward here? > > AFAIK the plan is to replace the asserts with actually resetting the > rs_empty_tuples_pending field to 0, as suggested by Melanie a week ago. > I assume she was busy with the post-freeze AM reworks last week, so this > was on a back burner. yep, sorry. Also I took a few days off. I'm just catching up today. I want to pop in one of Richard or Tomas' examples as a test, since it seems like it would add some coverage. I will have a patch soon. - Melanie
Re: BitmapHeapScan streaming read user and prelim refactoring
On 4/18/24 09:10, Michael Paquier wrote: > On Sun, Apr 07, 2024 at 10:54:56AM -0400, Melanie Plageman wrote: >> I've added an open item [1], because what's one open item when you can >> have two? (me) > > And this is still an open item as of today. What's the plan to move > forward here? AFAIK the plan is to replace the asserts with actually resetting the rs_empty_tuples_pending field to 0, as suggested by Melanie a week ago. I assume she was busy with the post-freeze AM reworks last week, so this was on a back burner. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: BitmapHeapScan streaming read user and prelim refactoring
On Sun, Apr 07, 2024 at 10:54:56AM -0400, Melanie Plageman wrote: > I've added an open item [1], because what's one open item when you can > have two? (me) And this is still an open item as of today. What's the plan to move forward here? -- Michael signature.asc Description: PGP signature
Re: BitmapHeapScan streaming read user and prelim refactoring
On Sun, Apr 7, 2024 at 10:42 PM Tomas Vondra wrote: > create table t (a int, b int) with (fillfactor=10); > insert into t select mod((i/22),2), (i/22) from generate_series(0,1000) > S(i); > create index on t(a); > vacuum analyze t; > > set enable_indexonlyscan = off; > set enable_seqscan = off; > explain (analyze, verbose) select 1 from (values (1)) s(x) where exists > (select * from t where a = x); > > KABOOM! FWIW, it seems to me that this assert could be triggered in cases where, during a join, not all inner tuples need to be scanned before skipping to next outer tuple. This can happen for 'single_match' or anti-join. The query provided by Tomas is an example of 'single_match' case. Here is a query for anti-join that can also trigger this assert. explain (analyze, verbose) select t1.a from t t1 left join t t2 on t2.a = 1 where t2.a is null; server closed the connection unexpectedly Thanks Richard
Re: BitmapHeapScan streaming read user and prelim refactoring
On Sun, Apr 7, 2024 at 10:42 AM Tomas Vondra wrote: > > > > On 4/7/24 16:24, Melanie Plageman wrote: > >>> Thanks! I thought about it a bit more, and I got worried about the > >>> > >>> Assert(scan->rs_empty_tuples_pending == 0); > >>> > >>> in heap_rescan() and heap_endscan(). > >>> > >>> I was worried if we don't complete the scan it could end up tripping > >>> incorrectly. > >>> > >>> I tried to come up with a query which didn't end up emitting all of the > >>> tuples on the page (using a LIMIT clause), but I struggled to come up > >>> with an example that qualified for the skip fetch optimization and also > >>> returned before completing the scan. > >>> > >>> I could work a bit harder tomorrow to try and come up with something. > >>> However, I think it might be safer to just change these to: > >>> > >>> scan->rs_empty_tuples_pending = 0 > >>> > >> > >> Hmmm, good point. I haven't tried, but wouldn't something like "SELECT 1 > >> FROM t WHERE column = X LIMIT 1" do the trick? Probably in a join, as a > >> correlated subquery? > > > > Unfortunately (or fortunately, I guess) that exact thing won't work > > because even constant values in the target list disqualify it for the > > skip fetch optimization. > > > > Being a bit too lazy to look at planner code this morning, I removed > > the target list requirement like this: > > > > - need_tuples = (node->ss.ps.plan->qual != NIL || > > - node->ss.ps.plan->targetlist != NIL); > > + need_tuples = (node->ss.ps.plan->qual != NIL); > > > > And can easily trip the assert with this: > > > > create table foo (a int); > > insert into foo select i from generate_series(1,10)i; > > create index on foo(a); > > vacuum foo; > > select 1 from (select 2 from foo limit 3); > > > > Anyway, I don't know if we could find a query that does actually hit > > this. The only bitmap heap scan queries in the regress suite that meet > > the > >BitmapHeapScanState->ss.ps.plan->targetlist == NIL > > condition are aggregates (all are count(*)). > > > > I'll dig a bit more later, but do you think this is worth adding an > > open item for? Even though I don't have a repro yet? > > > > Try this: > > create table t (a int, b int) with (fillfactor=10); > insert into t select mod((i/22),2), (i/22) from generate_series(0,1000) > S(i); > create index on t(a); > vacuum analyze t; > > set enable_indexonlyscan = off; > set enable_seqscan = off; > explain (analyze, verbose) select 1 from (values (1)) s(x) where exists > (select * from t where a = x); > > KABOOM! Ooo fancy query! Good job. > #2 0x78a16ac5fafe in __GI_raise (sig=sig@entry=6) at > ../sysdeps/posix/raise.c:26 > #3 0x78a16ac4887f in __GI_abort () at abort.c:79 > #4 0x00bb2c5a in ExceptionalCondition (conditionName=0xc42ba8 > "scan->rs_empty_tuples_pending == 0", fileName=0xc429c8 "heapam.c", > lineNumber=1090) at assert.c:66 > #5 0x004f68bb in heap_endscan (sscan=0x19af3a0) at heapam.c:1090 > #6 0x0077a94c in table_endscan (scan=0x19af3a0) at > ../../../src/include/access/tableam.h:1001 > > So yeah, this assert is not quite correct. It's not breaking anything at > the moment, so we can fix it now or add it as an open item. I've added an open item [1], because what's one open item when you can have two? (me) - Melanie [1] https://wiki.postgresql.org/wiki/PostgreSQL_17_Open_Items#Open_Issues
Re: BitmapHeapScan streaming read user and prelim refactoring
On 4/7/24 16:24, Melanie Plageman wrote: > On Sun, Apr 7, 2024 at 7:38 AM Tomas Vondra > wrote: >> >> >> >> On 4/7/24 06:17, Melanie Plageman wrote: >>> On Sun, Apr 07, 2024 at 02:27:43AM +0200, Tomas Vondra wrote: On 4/6/24 23:34, Melanie Plageman wrote: > ... >> >> I realized it makes more sense to add a FIXME (I used XXX. I'm not when >> to use what) with a link to the message where Andres describes why he >> thinks it is a bug. If we plan on fixing it, it is good to have a record >> of that. And it makes it easier to put a clear and accurate comment. >> Done in 0009. >> >>> OK, thanks. If think 0001-0008 are ready to go, with some minor tweaks >>> per above (tuple vs. tuples etc.), and the question about the recheck >>> flag. If you can do these tweaks, I'll get that committed today and we >>> can try to get a couple more patches in tomorrow. > > Attached v19 rebases the rest of the commits from v17 over the first > nine patches from v18. All patches 0001-0009 are unchanged from v18. I > have made updates and done cleanup on 0010-0021. > I've pushed 0001-0005, I'll get back to this tomorrow and see how much more we can get in for v17. >>> >>> Thanks! I thought about it a bit more, and I got worried about the >>> >>> Assert(scan->rs_empty_tuples_pending == 0); >>> >>> in heap_rescan() and heap_endscan(). >>> >>> I was worried if we don't complete the scan it could end up tripping >>> incorrectly. >>> >>> I tried to come up with a query which didn't end up emitting all of the >>> tuples on the page (using a LIMIT clause), but I struggled to come up >>> with an example that qualified for the skip fetch optimization and also >>> returned before completing the scan. >>> >>> I could work a bit harder tomorrow to try and come up with something. >>> However, I think it might be safer to just change these to: >>> >>> scan->rs_empty_tuples_pending = 0 >>> >> >> Hmmm, good point. I haven't tried, but wouldn't something like "SELECT 1 >> FROM t WHERE column = X LIMIT 1" do the trick? Probably in a join, as a >> correlated subquery? > > Unfortunately (or fortunately, I guess) that exact thing won't work > because even constant values in the target list disqualify it for the > skip fetch optimization. > > Being a bit too lazy to look at planner code this morning, I removed > the target list requirement like this: > > - need_tuples = (node->ss.ps.plan->qual != NIL || > - node->ss.ps.plan->targetlist != NIL); > + need_tuples = (node->ss.ps.plan->qual != NIL); > > And can easily trip the assert with this: > > create table foo (a int); > insert into foo select i from generate_series(1,10)i; > create index on foo(a); > vacuum foo; > select 1 from (select 2 from foo limit 3); > > Anyway, I don't know if we could find a query that does actually hit > this. The only bitmap heap scan queries in the regress suite that meet > the >BitmapHeapScanState->ss.ps.plan->targetlist == NIL > condition are aggregates (all are count(*)). > > I'll dig a bit more later, but do you think this is worth adding an > open item for? Even though I don't have a repro yet? > Try this: create table t (a int, b int) with (fillfactor=10); insert into t select mod((i/22),2), (i/22) from generate_series(0,1000) S(i); create index on t(a); vacuum analyze t; set enable_indexonlyscan = off; set enable_seqscan = off; explain (analyze, verbose) select 1 from (values (1)) s(x) where exists (select * from t where a = x); KABOOM! #2 0x78a16ac5fafe in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26 #3 0x78a16ac4887f in __GI_abort () at abort.c:79 #4 0x00bb2c5a in ExceptionalCondition (conditionName=0xc42ba8 "scan->rs_empty_tuples_pending == 0", fileName=0xc429c8 "heapam.c", lineNumber=1090) at assert.c:66 #5 0x004f68bb in heap_endscan (sscan=0x19af3a0) at heapam.c:1090 #6 0x0077a94c in table_endscan (scan=0x19af3a0) at ../../../src/include/access/tableam.h:1001 So yeah, this assert is not quite correct. It's not breaking anything at the moment, so we can fix it now or add it as an open item. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: BitmapHeapScan streaming read user and prelim refactoring
On Sun, Apr 7, 2024 at 10:10 AM Tomas Vondra wrote: > > On 4/7/24 15:11, Melanie Plageman wrote: > > > Also, the iterators in the TableScanDescData might be something I > > could live with in the source code for a couple months before we make > > the rest of the changes in July+. But, adding them does push the > > TableScanDescData->rs_parallel member into the second cacheline, which > > will be true in versions of Postgres people are using for years. I > > didn't perf test, but seems bad. > > > > I haven't though about how it affects cachelines, TBH. I'd expect it to > have minimal impact, because while it makes this struct larger it should > make some other struct (used in essentially the same places) smaller. So > I'd guess this to be a zero sum game, but perhaps I'm wrong. Yea, to be honest, I didn't do extensive analysis. I just ran `pahole -C TableScanDescData` with the patch and on master and further convinced myself the whole thing was a bad idea. > For me the main question was "Is this the right place for this, even if > it's only temporary?" Yep. > > So, yes, unfortunately, I think we should pick up on the BHS saga in a > > few months. Or, actually, we should start focusing on that parallel > > BHS + 0 readahead bug and whether or not we are going to fix it. > > > > Yes, the July CF is a good time to focus on this, early in the cycle. I've pushed the entry to the next CF. Even though some of the patches were committed, I think it makes more sense to leave the CF item open until at least the prefetch table AM violation is removed. Then we could make a new CF entry for the streaming read user. When I get a chance, I'll post a full set of the outstanding patches to this thread -- including the streaming read-related refactoring and user. Oh, and, side note, in my previous email [1] about the empty_tuples_pending assert, I should mention that you need set enable_seqscan = off set enable_indexscan = off to force bitmpaheapscan and get the plan for my fake repro. - Melanie [1] https://www.postgresql.org/message-id/CAAKRu_Zg8Bj66OZD46Jd-ksh02OGrPR8z0JLPQqEZNEHASi6uw%40mail.gmail.com
Re: BitmapHeapScan streaming read user and prelim refactoring
On Sun, Apr 7, 2024 at 7:38 AM Tomas Vondra wrote: > > > > On 4/7/24 06:17, Melanie Plageman wrote: > > On Sun, Apr 07, 2024 at 02:27:43AM +0200, Tomas Vondra wrote: > >> On 4/6/24 23:34, Melanie Plageman wrote: > >>> ... > > I realized it makes more sense to add a FIXME (I used XXX. I'm not when > to use what) with a link to the message where Andres describes why he > thinks it is a bug. If we plan on fixing it, it is good to have a record > of that. And it makes it easier to put a clear and accurate comment. > Done in 0009. > > > OK, thanks. If think 0001-0008 are ready to go, with some minor tweaks > > per above (tuple vs. tuples etc.), and the question about the recheck > > flag. If you can do these tweaks, I'll get that committed today and we > > can try to get a couple more patches in tomorrow. > >>> > >>> Attached v19 rebases the rest of the commits from v17 over the first > >>> nine patches from v18. All patches 0001-0009 are unchanged from v18. I > >>> have made updates and done cleanup on 0010-0021. > >>> > >> > >> I've pushed 0001-0005, I'll get back to this tomorrow and see how much > >> more we can get in for v17. > > > > Thanks! I thought about it a bit more, and I got worried about the > > > > Assert(scan->rs_empty_tuples_pending == 0); > > > > in heap_rescan() and heap_endscan(). > > > > I was worried if we don't complete the scan it could end up tripping > > incorrectly. > > > > I tried to come up with a query which didn't end up emitting all of the > > tuples on the page (using a LIMIT clause), but I struggled to come up > > with an example that qualified for the skip fetch optimization and also > > returned before completing the scan. > > > > I could work a bit harder tomorrow to try and come up with something. > > However, I think it might be safer to just change these to: > > > > scan->rs_empty_tuples_pending = 0 > > > > Hmmm, good point. I haven't tried, but wouldn't something like "SELECT 1 > FROM t WHERE column = X LIMIT 1" do the trick? Probably in a join, as a > correlated subquery? Unfortunately (or fortunately, I guess) that exact thing won't work because even constant values in the target list disqualify it for the skip fetch optimization. Being a bit too lazy to look at planner code this morning, I removed the target list requirement like this: - need_tuples = (node->ss.ps.plan->qual != NIL || - node->ss.ps.plan->targetlist != NIL); + need_tuples = (node->ss.ps.plan->qual != NIL); And can easily trip the assert with this: create table foo (a int); insert into foo select i from generate_series(1,10)i; create index on foo(a); vacuum foo; select 1 from (select 2 from foo limit 3); Anyway, I don't know if we could find a query that does actually hit this. The only bitmap heap scan queries in the regress suite that meet the BitmapHeapScanState->ss.ps.plan->targetlist == NIL condition are aggregates (all are count(*)). I'll dig a bit more later, but do you think this is worth adding an open item for? Even though I don't have a repro yet? - Melanie
Re: BitmapHeapScan streaming read user and prelim refactoring
On 4/7/24 15:11, Melanie Plageman wrote: > On Sun, Apr 7, 2024 at 7:38 AM Tomas Vondra > wrote: >> >> On 4/7/24 06:17, Melanie Plageman wrote: What bothers me on 0006-0008 is that the justification in the commit messages is "future commit will do something". I think it's fine to have a separate "prepareatory" patches (I really like how you structured the patches this way), but it'd be good to have them right before that "future" commit - I'd like not to have one in v17 and then the "future commit" in v18, because that's annoying complication for backpatching, (and probably also when implementing the AM?) etc. >>> >>> Yes, I was thinking about this also. >>> >> >> Good we're on the same page. > > Having thought about this some more I think we need to stop here for > 17. v20-0001 and v20-0002 both make changes to the table AM API that > seem bizarre and unjustifiable without the other changes. Like, here > we changed all your parameters because someday we are going to do > something! You're welcome! > OK, I think that's essentially the "temporary breakage" that should not span multiple releases, I mentioned ~yesterday. I appreciate you're careful about this. > Also, the iterators in the TableScanDescData might be something I > could live with in the source code for a couple months before we make > the rest of the changes in July+. But, adding them does push the > TableScanDescData->rs_parallel member into the second cacheline, which > will be true in versions of Postgres people are using for years. I > didn't perf test, but seems bad. > I haven't though about how it affects cachelines, TBH. I'd expect it to have minimal impact, because while it makes this struct larger it should make some other struct (used in essentially the same places) smaller. So I'd guess this to be a zero sum game, but perhaps I'm wrong. For me the main question was "Is this the right place for this, even if it's only temporary?" > So, yes, unfortunately, I think we should pick up on the BHS saga in a > few months. Or, actually, we should start focusing on that parallel > BHS + 0 readahead bug and whether or not we are going to fix it. > Yes, the July CF is a good time to focus on this, early in the cycle. > Sorry for the about-face. > No problem. I very much prefer this over something that may not be quite ready yet. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: BitmapHeapScan streaming read user and prelim refactoring
On Sun, Apr 7, 2024 at 7:38 AM Tomas Vondra wrote: > > On 4/7/24 06:17, Melanie Plageman wrote: > >> What bothers me on 0006-0008 is that the justification in the commit > >> messages is "future commit will do something". I think it's fine to have > >> a separate "prepareatory" patches (I really like how you structured the > >> patches this way), but it'd be good to have them right before that > >> "future" commit - I'd like not to have one in v17 and then the "future > >> commit" in v18, because that's annoying complication for backpatching, > >> (and probably also when implementing the AM?) etc. > > > > Yes, I was thinking about this also. > > > > Good we're on the same page. Having thought about this some more I think we need to stop here for 17. v20-0001 and v20-0002 both make changes to the table AM API that seem bizarre and unjustifiable without the other changes. Like, here we changed all your parameters because someday we are going to do something! You're welcome! Also, the iterators in the TableScanDescData might be something I could live with in the source code for a couple months before we make the rest of the changes in July+. But, adding them does push the TableScanDescData->rs_parallel member into the second cacheline, which will be true in versions of Postgres people are using for years. I didn't perf test, but seems bad. So, yes, unfortunately, I think we should pick up on the BHS saga in a few months. Or, actually, we should start focusing on that parallel BHS + 0 readahead bug and whether or not we are going to fix it. Sorry for the about-face. - Melanie
Re: BitmapHeapScan streaming read user and prelim refactoring
On 4/7/24 06:17, Melanie Plageman wrote: > On Sun, Apr 07, 2024 at 02:27:43AM +0200, Tomas Vondra wrote: >> On 4/6/24 23:34, Melanie Plageman wrote: >>> ... I realized it makes more sense to add a FIXME (I used XXX. I'm not when to use what) with a link to the message where Andres describes why he thinks it is a bug. If we plan on fixing it, it is good to have a record of that. And it makes it easier to put a clear and accurate comment. Done in 0009. > OK, thanks. If think 0001-0008 are ready to go, with some minor tweaks > per above (tuple vs. tuples etc.), and the question about the recheck > flag. If you can do these tweaks, I'll get that committed today and we > can try to get a couple more patches in tomorrow. >>> >>> Attached v19 rebases the rest of the commits from v17 over the first >>> nine patches from v18. All patches 0001-0009 are unchanged from v18. I >>> have made updates and done cleanup on 0010-0021. >>> >> >> I've pushed 0001-0005, I'll get back to this tomorrow and see how much >> more we can get in for v17. > > Thanks! I thought about it a bit more, and I got worried about the > > Assert(scan->rs_empty_tuples_pending == 0); > > in heap_rescan() and heap_endscan(). > > I was worried if we don't complete the scan it could end up tripping > incorrectly. > > I tried to come up with a query which didn't end up emitting all of the > tuples on the page (using a LIMIT clause), but I struggled to come up > with an example that qualified for the skip fetch optimization and also > returned before completing the scan. > > I could work a bit harder tomorrow to try and come up with something. > However, I think it might be safer to just change these to: > > scan->rs_empty_tuples_pending = 0 > Hmmm, good point. I haven't tried, but wouldn't something like "SELECT 1 FROM t WHERE column = X LIMIT 1" do the trick? Probably in a join, as a correlated subquery? It seemed OK to me and the buildfarm did not turn red, so I'd leave this until after the code freeze. >> What bothers me on 0006-0008 is that the justification in the commit >> messages is "future commit will do something". I think it's fine to have >> a separate "prepareatory" patches (I really like how you structured the >> patches this way), but it'd be good to have them right before that >> "future" commit - I'd like not to have one in v17 and then the "future >> commit" in v18, because that's annoying complication for backpatching, >> (and probably also when implementing the AM?) etc. > > Yes, I was thinking about this also. > Good we're on the same page. >> AFAICS for v19, the "future commit" for all three patches (0006-0008) is >> 0012, which introduces the unified iterator. Is that correct? > > Actually, those patches (v19 0006-0008) were required for v19 0009, > which is why I put them directly before it. 0009 eliminates use of the > TBMIterateResult for control flow in BitmapHeapNext(). > Ah, OK. Thanks for the clarification. > I've rephrased the commit messages to not mention future commits and > instead focus on what the changes in the commit are enabling. > > v19-0006 actually squashed very easily with v19-0009 and is actually > probably better that way. It is still easy to understand IMO. > > In v20, I've attached just the functionality from v19 0006-0009 but in > three patches instead of four. > Good. I'll take a look today. >> Also, for 0008 I'm not sure we could even split it between v17 and v18, >> because even if heapam did not use the iterator, what if some other AM >> uses it? Without 0012 it'd be a problem for the AM, no? > > The iterators in the TableScanDescData were introduced in v19-0009. It > is okay for other AMs to use it. In fact, they will want to use it. It > is still initialized and set up in BitmapHeapNext(). They would just > need to call tbm_iterate()/tbm_shared_iterate() on it. > > As for how table AMs will cope without the TBMIterateResult passed to > table_scan_bitmap_next_tuple() (which is what v19 0008 did): they can > save the location of the tuples to be scanned somewhere in their scan > descriptor. Heap AM already did this and actually didn't use the > TBMIterateResult at all. > The reason I feel a bit uneasy about putting this in TableScanDescData is I see that struct as a "description of the scan" (~input parameters defining what the scan should do), while for runtime state we have ScanState in execnodes.h. But maybe I'm wrong - I see we have similar runtime state in the other scan descriptors (like xs_itup/xs_hitup, kill prior tuple in index scans etc.). >> Would it make sense to move 0009 before these three patches? That seems >> like a meaningful change on it's own, right? > > Since v19 0009 requires these patches, I don't think we could do that. > I think up to and including 0009 would be an improvement in clarity and > function. Right, thanks for the correction. > > As for all the patches after 0009,
Re: BitmapHeapScan streaming read user and prelim refactoring
On Sun, Apr 07, 2024 at 02:27:43AM +0200, Tomas Vondra wrote: > On 4/6/24 23:34, Melanie Plageman wrote: > > ... > >> > >> I realized it makes more sense to add a FIXME (I used XXX. I'm not when > >> to use what) with a link to the message where Andres describes why he > >> thinks it is a bug. If we plan on fixing it, it is good to have a record > >> of that. And it makes it easier to put a clear and accurate comment. > >> Done in 0009. > >> > >>> OK, thanks. If think 0001-0008 are ready to go, with some minor tweaks > >>> per above (tuple vs. tuples etc.), and the question about the recheck > >>> flag. If you can do these tweaks, I'll get that committed today and we > >>> can try to get a couple more patches in tomorrow. > > > > Attached v19 rebases the rest of the commits from v17 over the first > > nine patches from v18. All patches 0001-0009 are unchanged from v18. I > > have made updates and done cleanup on 0010-0021. > > > > I've pushed 0001-0005, I'll get back to this tomorrow and see how much > more we can get in for v17. Thanks! I thought about it a bit more, and I got worried about the Assert(scan->rs_empty_tuples_pending == 0); in heap_rescan() and heap_endscan(). I was worried if we don't complete the scan it could end up tripping incorrectly. I tried to come up with a query which didn't end up emitting all of the tuples on the page (using a LIMIT clause), but I struggled to come up with an example that qualified for the skip fetch optimization and also returned before completing the scan. I could work a bit harder tomorrow to try and come up with something. However, I think it might be safer to just change these to: scan->rs_empty_tuples_pending = 0 > What bothers me on 0006-0008 is that the justification in the commit > messages is "future commit will do something". I think it's fine to have > a separate "prepareatory" patches (I really like how you structured the > patches this way), but it'd be good to have them right before that > "future" commit - I'd like not to have one in v17 and then the "future > commit" in v18, because that's annoying complication for backpatching, > (and probably also when implementing the AM?) etc. Yes, I was thinking about this also. > AFAICS for v19, the "future commit" for all three patches (0006-0008) is > 0012, which introduces the unified iterator. Is that correct? Actually, those patches (v19 0006-0008) were required for v19 0009, which is why I put them directly before it. 0009 eliminates use of the TBMIterateResult for control flow in BitmapHeapNext(). I've rephrased the commit messages to not mention future commits and instead focus on what the changes in the commit are enabling. v19-0006 actually squashed very easily with v19-0009 and is actually probably better that way. It is still easy to understand IMO. In v20, I've attached just the functionality from v19 0006-0009 but in three patches instead of four. > Also, for 0008 I'm not sure we could even split it between v17 and v18, > because even if heapam did not use the iterator, what if some other AM > uses it? Without 0012 it'd be a problem for the AM, no? The iterators in the TableScanDescData were introduced in v19-0009. It is okay for other AMs to use it. In fact, they will want to use it. It is still initialized and set up in BitmapHeapNext(). They would just need to call tbm_iterate()/tbm_shared_iterate() on it. As for how table AMs will cope without the TBMIterateResult passed to table_scan_bitmap_next_tuple() (which is what v19 0008 did): they can save the location of the tuples to be scanned somewhere in their scan descriptor. Heap AM already did this and actually didn't use the TBMIterateResult at all. > Would it make sense to move 0009 before these three patches? That seems > like a meaningful change on it's own, right? Since v19 0009 requires these patches, I don't think we could do that. I think up to and including 0009 would be an improvement in clarity and function. As for all the patches after 0009, I've dropped them from this version. We are out of time, and they need more thought. After we decided not to pursue streaming bitmapheapscan for 17, I wanted to make sure we removed the prefetch code table AM violation -- since we weren't deleting that code. So what started out as me looking for a way to clean up one commit ended up becoming a much larger project. Sorry about that last minute code explosion! I do think there is a way to do it right and make it nice. Also that violation would be gone if we figure out how to get streaming bitmapheapscan behaving correctly. So, there's just more motivation to make streaming bitmapheapscan awesome for 18! Given all that, I've only included the three patches I think we are considering (former v19 0006-0008). They are largely the same as you saw them last except for squashing the two commits I mentioned above and updating all of the commit messages. > FWIW I don't think it's very likely I'll commit the
Re: BitmapHeapScan streaming read user and prelim refactoring
On 4/6/24 23:34, Melanie Plageman wrote: > ... >> >> I realized it makes more sense to add a FIXME (I used XXX. I'm not when >> to use what) with a link to the message where Andres describes why he >> thinks it is a bug. If we plan on fixing it, it is good to have a record >> of that. And it makes it easier to put a clear and accurate comment. >> Done in 0009. >> >>> OK, thanks. If think 0001-0008 are ready to go, with some minor tweaks >>> per above (tuple vs. tuples etc.), and the question about the recheck >>> flag. If you can do these tweaks, I'll get that committed today and we >>> can try to get a couple more patches in tomorrow. > > Attached v19 rebases the rest of the commits from v17 over the first > nine patches from v18. All patches 0001-0009 are unchanged from v18. I > have made updates and done cleanup on 0010-0021. > I've pushed 0001-0005, I'll get back to this tomorrow and see how much more we can get in for v17. What bothers me on 0006-0008 is that the justification in the commit messages is "future commit will do something". I think it's fine to have a separate "prepareatory" patches (I really like how you structured the patches this way), but it'd be good to have them right before that "future" commit - I'd like not to have one in v17 and then the "future commit" in v18, because that's annoying complication for backpatching, (and probably also when implementing the AM?) etc. AFAICS for v19, the "future commit" for all three patches (0006-0008) is 0012, which introduces the unified iterator. Is that correct? Also, for 0008 I'm not sure we could even split it between v17 and v18, because even if heapam did not use the iterator, what if some other AM uses it? Without 0012 it'd be a problem for the AM, no? Would it make sense to move 0009 before these three patches? That seems like a meaningful change on it's own, right? FWIW I don't think it's very likely I'll commit the UnifiedTBMIterator stuff. I do agree with the idea in general, but I think I'd need more time to think about the details. Sorry about that ... regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: BitmapHeapScan streaming read user and prelim refactoring
On Sat, Apr 06, 2024 at 04:57:51PM +0200, Tomas Vondra wrote: > On 4/6/24 15:40, Melanie Plageman wrote: > > On Sat, Apr 06, 2024 at 02:51:45AM +0200, Tomas Vondra wrote: > >> > >> > >> On 4/6/24 01:53, Melanie Plageman wrote: > >>> On Fri, Apr 05, 2024 at 04:06:34AM -0400, Melanie Plageman wrote: > On Thu, Apr 04, 2024 at 04:35:45PM +0200, Tomas Vondra wrote: > > On 4/4/24 00:57, Melanie Plageman wrote: > >> On Sun, Mar 31, 2024 at 11:45:51AM -0400, Melanie Plageman wrote: > > I'd focus on the first ~8-9 commits or so for now, we can commit more if > > things go reasonably well. > > Sounds good. I will spend cleanup time on 0010-0013 tomorrow but would > love to know if you agree with the direction before I spend more time. > >>> > >>> In attached v16, I've split out 0010-0013 into 0011-0017. I think it is > >>> much easier to understand. > >>> > >> > >> Anyway, I've attached it as .tgz in order to not confuse cfbot. All the > >> review comments are marked with XXX, so grep for that in the patches. > >> There's two separate patches - the first one suggests a code change, so > >> it was better to not merge that with your code. The second has just a > >> couple XXX comments, I'm not sure why I kept it separate. > >> > >> A couple review comments: > >> > >> * I think 0001-0009 are 99% ready to. I reworded some of the commit > >> messages a bit - I realize it's a bit bold, considering you're native > >> speaker and I'm not. If you could check I didn't make it worse, that > >> would be great. > > > > Attached v17 has *only* patches 0001-0009 with these changes. I will > > work on applying the remaining patches, addressing feedback, and adding > > comments next. > > > > I have reviewed and incorporated all of your feedback on these patches. > > Attached v17 is your exact patches with 1 or 2 *very* slight tweaks to > > commit messages (comma splice removal and single word adjustments) as > > well as the changes listed below: > > > > I have open questions on the following: > > > > - 0003: should it be SO_NEED_TUPLES and need_tuples (instead of > > SO_NEED_TUPLE and need_tuple)? > > > > I think SO_NEED_TUPLES is more accurate, as we need all tuples from the > block. But either would work. Attached v18 changes it to TUPLES/tuples > > > - 0009 (your 0010) > > - Should I mention in the commit message that we added blockno and > > pfblockno in the BitmapHeapScanState only for validation or is > > that > > too specific? > > > > For the commit message I'd say it's too specific, I'd put it in the > comment before the struct. It is in the comment for the struct > > - I'm worried this comment is vague and or potentially not totally > > correct. Should we remove it? I don't think we have conclusive > > proof > > that this is true. > > /* > >* Adjusting the prefetch iterator before invoking > >* table_scan_bitmap_next_block() keeps prefetch distance higher across > >* the parallel workers. > >*/ > > > > TBH it's not clear to me what "higher across parallel workers" means. > But it sure shouldn't claim things that we think may not be correct. I > don't have a good idea how to reword it, though. I realized it makes more sense to add a FIXME (I used XXX. I'm not when to use what) with a link to the message where Andres describes why he thinks it is a bug. If we plan on fixing it, it is good to have a record of that. And it makes it easier to put a clear and accurate comment. Done in 0009. > OK, thanks. If think 0001-0008 are ready to go, with some minor tweaks > per above (tuple vs. tuples etc.), and the question about the recheck > flag. If you can do these tweaks, I'll get that committed today and we > can try to get a couple more patches in tomorrow. Sounds good. - Melanie >From 3e80ba8914dd5876ab921ca39540be3b639236f7 Mon Sep 17 00:00:00 2001 From: Melanie Plageman Date: Mon, 12 Feb 2024 18:50:29 -0500 Subject: [PATCH v18 01/10] BitmapHeapScan: begin scan after bitmap creation It makes more sense for BitmapHeapScan to scan the index, build the bitmap, and only then begin the scan on the underlying table. This is mostly a cosmetic change for now, but later commits will need to pass parameters to table_beginscan_bm() that are unavailable in ExecInitBitmapHeapScan(). Author: Melanie Plageman Reviewed-by: Tomas Vondra, Andres Freund, Heikki Linnakangas Discussion: https://postgr.es/m/CAAKRu_ZwCwWFeL_H3ia26bP2e7HiKLWt0ZmGXPVwPO6uXq0vaA%40mail.gmail.com --- src/backend/executor/nodeBitmapHeapscan.c | 27 +-- 1 file changed, 20 insertions(+), 7 deletions(-) diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c index cee7f45aabe..c8c466e3c5c 100644 --- a/src/backend/executor/nodeBitmapHeapscan.c +++ b/src/backend/executor/nodeBitmapHeapscan.c @@ -178,6 +178,21 @@ BitmapHeapNext(BitmapHeapScanState *node) } #endif /*
Re: BitmapHeapScan streaming read user and prelim refactoring
Melanie Plageman writes: > On Sat, Apr 06, 2024 at 02:51:45AM +0200, Tomas Vondra wrote: >> * The one question I'm somewhat unsure about is why Tom chose to use the >> "wrong" recheck flag in the 2017 commit, when the correct recheck flag >> is readily available. Surely that had a reason, right? But I can't think >> of one ... > See above. Hi, I hadn't been paying attention to this thread, but Melanie pinged me off-list about this question. I think it's just a flat-out oversight in 7c70996eb. Looking at the mailing list thread (particularly [1][2]), it seems that Alexander hadn't really addressed the question of when to prefetch at all, but just skipped prefetch if the current page was skippable: + /* +* If we did not need to fetch the current page, +* we probably will not need to fetch the next. +*/ + return; It looks like I noticed that we could check the appropriate VM bits, but failed to notice that we could easily check the appropriate recheck flag as well. Feel free to change it. regards, tom lane [1] https://www.postgresql.org/message-id/a6434d5c-ed8d-b09c-a7c3-b2d1677e35b3%40postgrespro.ru [2] https://www.postgresql.org/message-id/5974.1509573988%40sss.pgh.pa.us
Re: BitmapHeapScan streaming read user and prelim refactoring
On 4/6/24 15:40, Melanie Plageman wrote: > On Sat, Apr 06, 2024 at 02:51:45AM +0200, Tomas Vondra wrote: >> >> >> On 4/6/24 01:53, Melanie Plageman wrote: >>> On Fri, Apr 05, 2024 at 04:06:34AM -0400, Melanie Plageman wrote: On Thu, Apr 04, 2024 at 04:35:45PM +0200, Tomas Vondra wrote: > On 4/4/24 00:57, Melanie Plageman wrote: >> On Sun, Mar 31, 2024 at 11:45:51AM -0400, Melanie Plageman wrote: > I'd focus on the first ~8-9 commits or so for now, we can commit more if > things go reasonably well. Sounds good. I will spend cleanup time on 0010-0013 tomorrow but would love to know if you agree with the direction before I spend more time. >>> >>> In attached v16, I've split out 0010-0013 into 0011-0017. I think it is >>> much easier to understand. >>> >> >> Anyway, I've attached it as .tgz in order to not confuse cfbot. All the >> review comments are marked with XXX, so grep for that in the patches. >> There's two separate patches - the first one suggests a code change, so >> it was better to not merge that with your code. The second has just a >> couple XXX comments, I'm not sure why I kept it separate. >> >> A couple review comments: >> >> * I think 0001-0009 are 99% ready to. I reworded some of the commit >> messages a bit - I realize it's a bit bold, considering you're native >> speaker and I'm not. If you could check I didn't make it worse, that >> would be great. > > Attached v17 has *only* patches 0001-0009 with these changes. I will > work on applying the remaining patches, addressing feedback, and adding > comments next. > > I have reviewed and incorporated all of your feedback on these patches. > Attached v17 is your exact patches with 1 or 2 *very* slight tweaks to > commit messages (comma splice removal and single word adjustments) as > well as the changes listed below: > > I have changed the following: > > - 0003 added an assert that rs_empty_tuples_pending is 0 on rescan and > endscan > OK > - 0004 (your 0005)-- I followed up with Tom, but for now I have just > removed the XXX and also reworded the message a bit > After the exercise I described a couple minutes ago, I think I'm convinced the assumption is unnecessary and we should use the correct recheck. Not that it'd make any difference in practice, considering none of the opclasses ever changes the recheck. Maybe the most prudent thing would be to skip this commit and maybe leave this for later, but I'm not forcing you to do that if it would mean a lot of disruption for the following patches. > - 0006 (your 0007) fixed up the variable name (you changed valid -> > valid_block but it had gotten changed back) > OK > I have open questions on the following: > > - 0003: should it be SO_NEED_TUPLES and need_tuples (instead of > SO_NEED_TUPLE and need_tuple)? > I think SO_NEED_TUPLES is more accurate, as we need all tuples from the block. But either would work. > - 0009 (your 0010) > - Should I mention in the commit message that we added blockno and > pfblockno in the BitmapHeapScanState only for validation or is > that > too specific? > For the commit message I'd say it's too specific, I'd put it in the comment before the struct. > - Should I mention that a future (imminent) commit will remove the > iterators from TableScanDescData and put them in > HeapScanDescData? I > imagine folks don't want those there, but it is easier for the > progression of commits to put them there first and then move > them > I'd try not to mention future commits as justification too often, if we don't know that the future commit lands shortly after. > - I'm worried this comment is vague and or potentially not totally > correct. Should we remove it? I don't think we have conclusive > proof > that this is true. > /* >* Adjusting the prefetch iterator before invoking >* table_scan_bitmap_next_block() keeps prefetch distance higher across >* the parallel workers. >*/ > TBH it's not clear to me what "higher across parallel workers" means. But it sure shouldn't claim things that we think may not be correct. I don't have a good idea how to reword it, though. > >> * I'm not sure extra_flags is the right way to pass the flag in 0003. >> The "extra_" name is a bit weird, and no other table AM functions do it >> this way and pass explicit bool flags instead. So my first "review" >> commit does it like that. Do you agree it's better that way? > > Yes. > Cool >> * The one question I'm somewhat unsure about is why Tom chose to use the >> "wrong" recheck flag in the 2017 commit, when the correct recheck flag >> is readily available. Surely that had a reason, right? But I can't think >> of one ... > > See above. > >>> While I was doing that, I realized that I should remove the call to >>> table_rescan() from ExecReScanBitmapHeapScan() and just rely on the
Re: BitmapHeapScan streaming read user and prelim refactoring
On 4/6/24 02:51, Tomas Vondra wrote: > > * The one question I'm somewhat unsure about is why Tom chose to use the > "wrong" recheck flag in the 2017 commit, when the correct recheck flag > is readily available. Surely that had a reason, right? But I can't think > of one ... > I've been wondering about this a bit more, so I decided to experiment and try to construct a case for which the current code prefetches the wrong blocks, and the patch fixes that. But I haven't been very successful so far :-( My understanding was that the current code should do the wrong thing if I alternate all-visible and not-all-visible pages. This understanding is not correct, as I learned, because the thing that needs to change is the recheck flag, not visibility :-( I'm still posting what I tried, perhaps you will have an idea how to alter it to demonstrate the incorrect behavior with current master. The test was very simple: create table t (a int, b int) with (fillfactor=10); insert into t select mod((i/22),2), (i/22) from generate_series(0,1000) s(i); create index on t (a); which creates a table with 46 pages, 22 rows per page, column "a" alternates between 0/1 on pages, column "b" increments on each page (so "b" identifies page). and then delete from t where mod(b,8) = 0; which deletes tuples on pages 0, 8, 16, 24, 32, 40, so these pages will need to be prefetched as not-all-visible by this query explain analyze select count(1) from t where a = 0 when forced to do bitmap heap scan. The other even-numbered pages remain all-visible. I added a bit of logging into BitmapPrefetch(), but even with master I get this: LOG: prefetching block 8 0 current block 6 0 LOG: prefetching block 16 0 current block 14 0 LOG: prefetching block 24 0 current block 22 0 LOG: prefetching block 32 0 current block 30 0 LOG: prefetching block 40 0 current block 38 0 So it prefetches the correct pages (the other value is the recheck flag for that block from the iterator result). Turns out (and I realize the comment about the assumption actually states that, I just failed to understand it) the thing that would have to differ for the blocks is the recheck flag. But that can't actually happen because that's set by the AM/opclass and the built-in ones do essentially this: .../hash.c: scan->xs_recheck = true; .../nbtree.c: scan->xs_recheck = false; gist opclasses (e.g. btree_gist): /* All cases served by this function are exact */ *recheck = false; spgist opclasses (e.g. geo_spgist): /* All tests are exact. */ out->recheck = false; If there's an opclass that alters the recheck flag, it's well hidden and I missed it. Anyway, after this exercise and learning more about the recheck flag, I think I agree the assumption is unnecessary. It's pretty harmless because none of the built-in opclasses alters the recheck flag, but the correct recheck flag is readily available. I'm still a bit puzzled why the 2017 commit even relied on this assumption, though. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: BitmapHeapScan streaming read user and prelim refactoring
On Sat, Apr 06, 2024 at 02:51:45AM +0200, Tomas Vondra wrote: > > > On 4/6/24 01:53, Melanie Plageman wrote: > > On Fri, Apr 05, 2024 at 04:06:34AM -0400, Melanie Plageman wrote: > >> On Thu, Apr 04, 2024 at 04:35:45PM +0200, Tomas Vondra wrote: > >>> On 4/4/24 00:57, Melanie Plageman wrote: > On Sun, Mar 31, 2024 at 11:45:51AM -0400, Melanie Plageman wrote: > >>> I'd focus on the first ~8-9 commits or so for now, we can commit more if > >>> things go reasonably well. > >> > >> Sounds good. I will spend cleanup time on 0010-0013 tomorrow but would > >> love to know if you agree with the direction before I spend more time. > > > > In attached v16, I've split out 0010-0013 into 0011-0017. I think it is > > much easier to understand. > > > > Anyway, I've attached it as .tgz in order to not confuse cfbot. All the > review comments are marked with XXX, so grep for that in the patches. > There's two separate patches - the first one suggests a code change, so > it was better to not merge that with your code. The second has just a > couple XXX comments, I'm not sure why I kept it separate. > > A couple review comments: > > * I think 0001-0009 are 99% ready to. I reworded some of the commit > messages a bit - I realize it's a bit bold, considering you're native > speaker and I'm not. If you could check I didn't make it worse, that > would be great. Attached v17 has *only* patches 0001-0009 with these changes. I will work on applying the remaining patches, addressing feedback, and adding comments next. I have reviewed and incorporated all of your feedback on these patches. Attached v17 is your exact patches with 1 or 2 *very* slight tweaks to commit messages (comma splice removal and single word adjustments) as well as the changes listed below: I have changed the following: - 0003 added an assert that rs_empty_tuples_pending is 0 on rescan and endscan - 0004 (your 0005)-- I followed up with Tom, but for now I have just removed the XXX and also reworded the message a bit - 0006 (your 0007) fixed up the variable name (you changed valid -> valid_block but it had gotten changed back) I have open questions on the following: - 0003: should it be SO_NEED_TUPLES and need_tuples (instead of SO_NEED_TUPLE and need_tuple)? - 0009 (your 0010) - Should I mention in the commit message that we added blockno and pfblockno in the BitmapHeapScanState only for validation or is that too specific? - Should I mention that a future (imminent) commit will remove the iterators from TableScanDescData and put them in HeapScanDescData? I imagine folks don't want those there, but it is easier for the progression of commits to put them there first and then move them - I'm worried this comment is vague and or potentially not totally correct. Should we remove it? I don't think we have conclusive proof that this is true. /* * Adjusting the prefetch iterator before invoking * table_scan_bitmap_next_block() keeps prefetch distance higher across * the parallel workers. */ > * I'm not sure extra_flags is the right way to pass the flag in 0003. > The "extra_" name is a bit weird, and no other table AM functions do it > this way and pass explicit bool flags instead. So my first "review" > commit does it like that. Do you agree it's better that way? Yes. > * The one question I'm somewhat unsure about is why Tom chose to use the > "wrong" recheck flag in the 2017 commit, when the correct recheck flag > is readily available. Surely that had a reason, right? But I can't think > of one ... See above. > > While I was doing that, I realized that I should remove the call to > > table_rescan() from ExecReScanBitmapHeapScan() and just rely on the new > > table_rescan_bm() invoked from BitmapHeapNext(). That is done in the > > attached. > > > > 0010-0018 still need comments updated but I focused on getting the split > > out, reviewable version of them ready. I'll add comments (especially to > > 0011 table AM functions) tomorrow. I also have to double-check if I > > should add any asserts for table AMs about having implemented all of the > > new begin/re/endscan() functions. > > > > I added a couple more comments for those patches (10-12). Chances are > the split in v16 clarifies some of my questions, but it'll have to wait > till the morning ... Will address this in next mail. - Melanie >From e65a9baa52c99c93a9a9a281aebacd38b8299721 Mon Sep 17 00:00:00 2001 From: Melanie Plageman Date: Mon, 12 Feb 2024 18:50:29 -0500 Subject: [PATCH v17 1/9] BitmapHeapScan: begin scan after bitmap creation It makes more sense for BitmapHeapScan to scan the index, build the bitmap, and only then begin the scan on the underlying table. This is mostly a cosmetic change for now, but later commits will need to pass parameters to table_beginscan_bm() that are
Re: BitmapHeapScan streaming read user and prelim refactoring
On 4/6/24 01:53, Melanie Plageman wrote: > On Fri, Apr 05, 2024 at 04:06:34AM -0400, Melanie Plageman wrote: >> On Thu, Apr 04, 2024 at 04:35:45PM +0200, Tomas Vondra wrote: >>> >>> >>> On 4/4/24 00:57, Melanie Plageman wrote: On Sun, Mar 31, 2024 at 11:45:51AM -0400, Melanie Plageman wrote: > On Fri, Mar 29, 2024 at 12:05:15PM +0100, Tomas Vondra wrote: >> >> >> On 3/29/24 02:12, Thomas Munro wrote: >>> On Fri, Mar 29, 2024 at 10:43 AM Tomas Vondra >>> wrote: I think there's some sort of bug, triggering this assert in heapam Assert(BufferGetBlockNumber(hscan->rs_cbuf) == tbmres->blockno); >>> >>> Thanks for the repro. I can't seem to reproduce it (still trying) but >>> I assume this is with Melanie's v11 patch set which had >>> v11-0016-v10-Read-Stream-API.patch. >>> >>> Would you mind removing that commit and instead applying the v13 >>> stream_read.c patches[1]? v10 stream_read.c was a little confused >>> about random I/O combining, which I fixed with a small adjustment to >>> the conditions for the "if" statement right at the end of >>> read_stream_look_ahead(). Sorry about that. The fixed version, with >>> eic=4, with your test query using WHERE a < a, ends its scan with: >>> >> >> I'll give that a try. Unfortunately unfortunately the v11 still has the >> problem I reported about a week ago: >> >> ERROR: prefetch and main iterators are out of sync >> >> So I can't run the full benchmarks :-( but master vs. streaming read API >> should work, I think. > > Odd, I didn't notice you reporting this ERROR popping up. Now that I > take a look, v11 (at least, maybe also v10) had this very sill mistake: > > if (scan->bm_parallel == NULL && > scan->rs_pf_bhs_iterator && > hscan->pfblockno > hscan->rs_base.blockno) > elog(ERROR, "prefetch and main iterators are out of sync"); > > It errors out if the prefetch block is ahead of the current block -- > which is the opposite of what we want. I've fixed this in attached v12. > > This version also has v13 of the streaming read API. I noticed one > mistake in my bitmapheap scan streaming read user -- it freed the > streaming read object at the wrong time. I don't know if this was > causing any other issues, but it at least is fixed in this version. Attached v13 is rebased over master (which includes the streaming read API now). I also reset the streaming read object on rescan instead of creating a new one each time. I don't know how much chance any of this has of going in to 17 now, but I thought I would start looking into the regression repro Tomas provided in [1]. >>> >>> My personal opinion is that we should try to get in as many of the the >>> refactoring patches as possible, but I think it's probably too late for >>> the actual switch to the streaming API. >> >> Cool. In the attached v15, I have dropped all commits that are related >> to the streaming read API and included *only* commits that are >> beneficial to master. A few of the commits are merged or reordered as >> well. >> >> While going through the commits with this new goal in mind (forget about >> the streaming read API for now), I realized that it doesn't make much >> sense to just eliminate the layering violation for the current block and >> leave it there for the prefetch block. I had de-prioritized solving this >> when I thought we would just delete the prefetch code and replace it >> with the streaming read. >> >> Now that we aren't doing that, I've spent the day trying to resolve the >> issues with pushing the prefetch code into heapam.c that I cited in [1]. >> 0010 - 0013 are the result of this. They are not very polished yet and >> need more cleanup and review (especially 0011, which is probably too >> large), but I am happy with the solution I came up with. >> >> Basically, there are too many members needed for bitmap heap scan to put >> them all in the HeapScanDescData (don't want to bloat it). So, I've made >> a new BitmapHeapScanDescData and associated begin/rescan/end() functions >> >> In the end, with all patches applied, BitmapHeapNext() loops invoking >> table_scan_bitmap_next_tuple() and table AMs can implement that however >> they choose. >> I'm also not sure if I should try and group the commits into fewer commits now or wait until I have some idea of whether or not the approach in 0013 and 0014 is worth pursuing. >>> >>> You mean whether to pursue the approach in general, or for v17? I think >>> it looks like the right approach, but for v17 see above :-( >>> >>> As for merging, I wouldn't do that. I looked at the commits and while >>> some of them seem somewhat "trivial", I really like how you organized >>> the commits, and kept those that just "move" code around, and those that >>> actually
Re: BitmapHeapScan streaming read user and prelim refactoring
On Sat, Mar 30, 2024 at 12:34 PM Tomas Vondra wrote: > Hmmm. I admit I didn't think about the "always prefetch" flag too much, > but I did imagine it'd only affect some places (e.g. BHS, but not for > sequential scans). If it could be done by lowering the combine limit, > that could work too - in fact, I was wondering if we should have combine > limit as a tablespace parameter too. Good idea! Will add. Planning to commit the basic patch very soon, I'm just thinking about what to do with Heikki's recent optimisation feedback (ie can it be done as follow-up, he thinks so, I'm thinking about that today as time is running short). > But I think adding such knobs should be only the last resort - I myself > don't know how to set these parameters, how could we expect users to > pick good values? Better to have something that "just works". Agreed. > I admit I never 100% understood when exactly the kernel RA kicks in, but > I always thought it's enough for the patterns to be only "close enough" > to sequential. Isn't the problem that this only skips fadvise for 100% > sequential patterns, but keeps prefetching for cases the RA would deal > on it's own? So maybe we should either relax the conditions when to skip > fadvise, or combine even pages that are not perfectly sequential (I'm > not sure if that's possible only for fadvise), though. Yes that might be worth considering, if we know/guess what the OS RA window size is for a tablespace. I will post a patch for that for consideration/testing as a potential follow-up as it's super easy, just for experimentation. I just fear that it's getting into the realms of "hard to explain/understand" but on the other hand I guess we already have the mechanism and have to explain it.
Re: BitmapHeapScan streaming read user and prelim refactoring
On Sat, Mar 30, 2024 at 12:40 PM Tomas Vondra wrote: > Sorry, I meant the prefetch (readahead) built into ZFS. I may be wrong > but I don't think the regular RA (in linux kernel) works for ZFS, right? Right, it separate page cache ("ARC") and prefetch settings: https://openzfs.github.io/openzfs-docs/Performance%20and%20Tuning/Module%20Parameters.html That's probably why Linux posix_fadvise didn't affect it, well that and the fact, at a wild guess, that Solaris didn't have that system call... > I was wondering if we could use this (posix_fadvise) to improve that, > essentially by issuing fadvise even for sequential patterns. But now > that I think about that, if posix_fadvise works since 2.2, maybe RA > works too now?) It should work fine. I am planning to look into this a bit some day soon -- I think there may be some interesting interactions between systems with big pages/records like ZFS/BTRFS/... and io_combine_limit that might offer interesting optimisation tweak opportunities, but first things first...
Re: BitmapHeapScan streaming read user and prelim refactoring
On 3/29/24 22:39, Thomas Munro wrote: > ... > >> I don't recall seeing a system with disabled readahead, but I'm >> sure there are cases where it may not really work - it clearly can't >> work with direct I/O, ... > > Right, for direct I/O everything is slow right now including seq scan. > We need to start asynchronous reads in the background (imagine > literally just a bunch of background "I/O workers" running preadv() on > your behalf to get your future buffers ready for you, or equivalently > Linux io_uring). That's the real goal of this project: restructuring > so we have the information we need to do that, ie teach every part of > PostgreSQL to predict the future in a standard and centralised way. > Should work out better than RA heuristics, because we're not just > driving in a straight line, we can turn corners too. > >> ... but I've also not been very successful with >> prefetching on ZFS. > > posix_favise() did not do anything in OpenZFS before 2.2, maybe you > have an older version? > Sorry, I meant the prefetch (readahead) built into ZFS. I may be wrong but I don't think the regular RA (in linux kernel) works for ZFS, right? I was wondering if we could use this (posix_fadvise) to improve that, essentially by issuing fadvise even for sequential patterns. But now that I think about that, if posix_fadvise works since 2.2, maybe RA works too now?) regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: BitmapHeapScan streaming read user and prelim refactoring
On 3/29/24 23:03, Thomas Munro wrote: > On Sat, Mar 30, 2024 at 10:39 AM Thomas Munro wrote: >> On Sat, Mar 30, 2024 at 4:53 AM Tomas Vondra >> wrote: >>> ... Maybe there should be some flag to force >>> issuing fadvise even for sequential patterns, perhaps at the tablespace >>> level? ... >> >> Yeah, I've wondered about trying harder to "second guess" the Linux >> RA. At the moment, read_stream.c detects *exactly* sequential reads >> (see seq_blocknum) to suppress advice, but if we knew/guessed the RA >> window size, we could (1) detect it with the same window that Linux >> will use to detect it, and (2) [new realisation from yesterday's >> testing] we could even "tickle" it to wake it up in certain cases >> where it otherwise wouldn't, by temporarily using a smaller >> io_combine_limit if certain patterns come along. I think that sounds >> like madness (I suspect that any place where the latter would help is >> a place where you could turn RA up a bit higher for the same effect >> without weird kludges), or another way to put it would be to call it >> "overfitting" to the pre-existing quirks; but maybe it's a future >> research idea... > I don't know if I'd call this overfitting - yes, we certainly don't want to tailor this code to only work with the linux RA, but OTOH it's the RA is what most systems do. And if we plan to rely on that, we probably have to "respect" how it works ... Moving to a "clean" approach that however triggers regressions does not seem like a great thing for users. I'm not saying the goal has to be "no regressions", that would be rather impossible. At this point I still try to understand what's causing this. BTW are you suggesting that increasing the RA distance could maybe fix the regressions? I can give it a try, but I was assuming that 128kB readahead would be enough for combine_limit=8kB. > I guess I missed a step when responding that suggestion: I don't think > we could have an "issue advice always" flag, because it doesn't seem > to work out as well as letting the kernel do it, and a global flag > like that would affect everything else including sequential scans > (once the streaming seq scan patch goes in). But suppose we could do > that, maybe even just for BHS. In my little test yesterday had to > issue a lot of them, patched eic=4, to beat the kernel's RA with > unpatched eic=0: > > eic unpatched patched > 041729572 > 1 30846 10376 > 2 184355562 > 4 189803503 > > So if we forced fadvise to be issued with a GUC, it still wouldn't be > good enough in this case. So we might need to try to understand what > exactly is waking the RA up for unpatched but not patched, and try to > tickle it by doing a little less I/O combining (for example just > setting io_combine_limit=1 gives the same number for eic=0, a major > clue), but that seems to be going down a weird path, and tuning such a > copying algorithm seems too hard. Hmmm. I admit I didn't think about the "always prefetch" flag too much, but I did imagine it'd only affect some places (e.g. BHS, but not for sequential scans). If it could be done by lowering the combine limit, that could work too - in fact, I was wondering if we should have combine limit as a tablespace parameter too. But I think adding such knobs should be only the last resort - I myself don't know how to set these parameters, how could we expect users to pick good values? Better to have something that "just works". I admit I never 100% understood when exactly the kernel RA kicks in, but I always thought it's enough for the patterns to be only "close enough" to sequential. Isn't the problem that this only skips fadvise for 100% sequential patterns, but keeps prefetching for cases the RA would deal on it's own? So maybe we should either relax the conditions when to skip fadvise, or combine even pages that are not perfectly sequential (I'm not sure if that's possible only for fadvise), though. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: BitmapHeapScan streaming read user and prelim refactoring
On Sat, Mar 30, 2024 at 10:39 AM Thomas Munro wrote: > On Sat, Mar 30, 2024 at 4:53 AM Tomas Vondra > wrote: > > ... Maybe there should be some flag to force > > issuing fadvise even for sequential patterns, perhaps at the tablespace > > level? ... > > Yeah, I've wondered about trying harder to "second guess" the Linux > RA. At the moment, read_stream.c detects *exactly* sequential reads > (see seq_blocknum) to suppress advice, but if we knew/guessed the RA > window size, we could (1) detect it with the same window that Linux > will use to detect it, and (2) [new realisation from yesterday's > testing] we could even "tickle" it to wake it up in certain cases > where it otherwise wouldn't, by temporarily using a smaller > io_combine_limit if certain patterns come along. I think that sounds > like madness (I suspect that any place where the latter would help is > a place where you could turn RA up a bit higher for the same effect > without weird kludges), or another way to put it would be to call it > "overfitting" to the pre-existing quirks; but maybe it's a future > research idea... I guess I missed a step when responding that suggestion: I don't think we could have an "issue advice always" flag, because it doesn't seem to work out as well as letting the kernel do it, and a global flag like that would affect everything else including sequential scans (once the streaming seq scan patch goes in). But suppose we could do that, maybe even just for BHS. In my little test yesterday had to issue a lot of them, patched eic=4, to beat the kernel's RA with unpatched eic=0: eic unpatched patched 041729572 1 30846 10376 2 184355562 4 189803503 So if we forced fadvise to be issued with a GUC, it still wouldn't be good enough in this case. So we might need to try to understand what exactly is waking the RA up for unpatched but not patched, and try to tickle it by doing a little less I/O combining (for example just setting io_combine_limit=1 gives the same number for eic=0, a major clue), but that seems to be going down a weird path, and tuning such a copying algorithm seems too hard.
Re: BitmapHeapScan streaming read user and prelim refactoring
On Sat, Mar 30, 2024 at 4:53 AM Tomas Vondra wrote: > Two observations: > > * The combine limit seems to have negligible impact. There's no visible > difference between combine_limit=8kB and 128kB. > > * Parallel queries seem to work about the same as master (especially for > optimal cases, but even for not optimal ones). > > > The optimal plans with kernel readahead (two charts in the first row) > look fairly good. There are a couple regressed cases, but a bunch of > faster ones too. Thanks for doing this! > The optimal plans without kernel read ahead (two charts in the second > row) perform pretty poorly - there are massive regressions. But I think > the obvious reason is that the streaming read API skips prefetches for > sequential access patterns, relying on kernel to do the readahead. But > if the kernel readahead is disabled for the device, that obviously can't > happen ... Right, it does seem that this whole concept is sensitive on the 'borderline' between sequential and random, and this patch changes that a bit and we lose some. It's becoming much clearer to me that master is already exposing weird kinks, and the streaming version is mostly better, certainly on low IOPS systems. I suspect that there must be queries in the wild that would run much faster with eic=0 than eic=1 today due to that, and while the streaming version also loses in some cases, it seems that it mostly loses because of not triggering RA, which can at least be improved by increasing the RA window. On the flip side, master is more prone to running out of IOPS and there is no way to tune your way out of that. > I think the question is how much we can (want to) rely on the readahead > to be done by the kernel. ... We already rely on it everywhere, for basic things like sequential scan. > ... Maybe there should be some flag to force > issuing fadvise even for sequential patterns, perhaps at the tablespace > level? ... Yeah, I've wondered about trying harder to "second guess" the Linux RA. At the moment, read_stream.c detects *exactly* sequential reads (see seq_blocknum) to suppress advice, but if we knew/guessed the RA window size, we could (1) detect it with the same window that Linux will use to detect it, and (2) [new realisation from yesterday's testing] we could even "tickle" it to wake it up in certain cases where it otherwise wouldn't, by temporarily using a smaller io_combine_limit if certain patterns come along. I think that sounds like madness (I suspect that any place where the latter would help is a place where you could turn RA up a bit higher for the same effect without weird kludges), or another way to put it would be to call it "overfitting" to the pre-existing quirks; but maybe it's a future research idea... > I don't recall seeing a system with disabled readahead, but I'm > sure there are cases where it may not really work - it clearly can't > work with direct I/O, ... Right, for direct I/O everything is slow right now including seq scan. We need to start asynchronous reads in the background (imagine literally just a bunch of background "I/O workers" running preadv() on your behalf to get your future buffers ready for you, or equivalently Linux io_uring). That's the real goal of this project: restructuring so we have the information we need to do that, ie teach every part of PostgreSQL to predict the future in a standard and centralised way. Should work out better than RA heuristics, because we're not just driving in a straight line, we can turn corners too. > ... but I've also not been very successful with > prefetching on ZFS. posix_favise() did not do anything in OpenZFS before 2.2, maybe you have an older version? > I certainly admit the data sets are synthetic and perhaps adversarial. > My intent was to cover a wide range of data sets, to trigger even less > common cases. It's certainly up to debate how serious the regressions on > those data sets are in practice, I'm not suggesting "this strange data > set makes it slower than master, so we can't commit this". Right, yeah. Thanks! Your initial results seemed discouraging, but looking closer I'm starting to feel a lot more positive about streaming BHS.
Re: BitmapHeapScan streaming read user and prelim refactoring
On Sat, Mar 30, 2024 at 12:17 AM Thomas Munro wrote: > eic unpatched patched > 041729572 > 1 30846 10376 > 2 184355562 > 4 189803503 > 8 189802680 > 16 189763233 ... but the patched version gets down to a low number for eic=0 too if you turn up the blockdev --setra so that it also gets Linux RA treatment, making it the clear winner on all eic settings. Patched doesn't improve. So, for low IOPS storage at least, when you're on the borderline between random and sequential, ie bitmap with a lot of 1s in it, it seems there are cases where patched doesn't trigger Linux RA but unpatched does, and you can tune your way out of that, and then there are cases where the IOPS limit is reached due to small reads, but patched does better because of larger I/Os that are likely under the same circumstances. Does that make sense?
Re: BitmapHeapScan streaming read user and prelim refactoring
I spent a bit of time today testing Melanie's v11, except with read_stream.c v13, on Linux, ext4, and 3000 IOPS cloud storage. I think I now know roughly what's going on. Here are some numbers, using your random table from above and a simple SELECT * FROM t WHERE a < 100 OR a = 123456. I'll keep parallelism out of this for now. These are milliseconds: eic unpatched patched 041729572 1 30846 10376 2 184355562 4 189803503 8 189802680 16 189763233 So with eic=0, unpatched wins. The reason is that Linux readahead wakes up and scans the table at 150MB/s, because there are enough clusters to trigger it. But patched doesn't look quite so sequential because we removed the sequential accesses by I/O combining... At eic=1, unpatched completely collapses. I'm not sure why exactly. Once you go above eic=1, Linux seems to get out of the way and just do what we asked it to do: iostat shows exactly 3000 IOPS, exactly 8KB avg read size, and (therefore) throughput of 24MB/sec, though you can see the queue depth being exactly what we asked it to do,eg 7.9 or whatever for eic=8, while patched eats it for breakfast because it issues wide requests, averaging around 27KB. It seems more informative to look at the absolute numbers rather than the A/B ratios, because then you can see how the numbers themselves are already completely nuts, sort of interference patterns from interaction with kernel heuristics. On the other hand this might be a pretty unusual data distribution. People who store random numbers or hashes or whatever probably don't really search for ranges of them (unless they're trying to mine bitcoins in SQL). I dunno. Maybe we need more realistic tests, or maybe we're just discovering all the things that are bad about the pre-existing code.
Re: BitmapHeapScan streaming read user and prelim refactoring
On 3/29/24 02:12, Thomas Munro wrote: > On Fri, Mar 29, 2024 at 10:43 AM Tomas Vondra > wrote: >> I think there's some sort of bug, triggering this assert in heapam >> >> Assert(BufferGetBlockNumber(hscan->rs_cbuf) == tbmres->blockno); > > Thanks for the repro. I can't seem to reproduce it (still trying) but > I assume this is with Melanie's v11 patch set which had > v11-0016-v10-Read-Stream-API.patch. > > Would you mind removing that commit and instead applying the v13 > stream_read.c patches[1]? v10 stream_read.c was a little confused > about random I/O combining, which I fixed with a small adjustment to > the conditions for the "if" statement right at the end of > read_stream_look_ahead(). Sorry about that. The fixed version, with > eic=4, with your test query using WHERE a < a, ends its scan with: > I'll give that a try. Unfortunately unfortunately the v11 still has the problem I reported about a week ago: ERROR: prefetch and main iterators are out of sync So I can't run the full benchmarks :-( but master vs. streaming read API should work, I think. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: BitmapHeapScan streaming read user and prelim refactoring
On Fri, Mar 29, 2024 at 10:43 AM Tomas Vondra wrote: > I think there's some sort of bug, triggering this assert in heapam > > Assert(BufferGetBlockNumber(hscan->rs_cbuf) == tbmres->blockno); Thanks for the repro. I can't seem to reproduce it (still trying) but I assume this is with Melanie's v11 patch set which had v11-0016-v10-Read-Stream-API.patch. Would you mind removing that commit and instead applying the v13 stream_read.c patches[1]? v10 stream_read.c was a little confused about random I/O combining, which I fixed with a small adjustment to the conditions for the "if" statement right at the end of read_stream_look_ahead(). Sorry about that. The fixed version, with eic=4, with your test query using WHERE a < a, ends its scan with: ... posix_fadvise(32,0x28aee000,0x4000,POSIX_FADV_WILLNEED) = 0 (0x0) pread(32,"\0\0\0\0@4\M-5:\0\0\^D\0\M-x\^A"...,40960,0x28acc000) = 40960 (0xa000) posix_fadvise(32,0x28af4000,0x4000,POSIX_FADV_WILLNEED) = 0 (0x0) pread(32,"\0\0\0\0\^XC\M-6:\0\0\^D\0\M-x"...,32768,0x28ad8000) = 32768 (0x8000) posix_fadvise(32,0x28afc000,0x4000,POSIX_FADV_WILLNEED) = 0 (0x0) pread(32,"\0\0\0\0\M-XQ\M-7:\0\0\^D\0\M-x"...,24576,0x28ae4000) = 24576 (0x6000) posix_fadvise(32,0x28b02000,0x8000,POSIX_FADV_WILLNEED) = 0 (0x0) pread(32,"\0\0\0\0\M^@3\M-8:\0\0\^D\0\M-x"...,16384,0x28aee000) = 16384 (0x4000) pread(32,"\0\0\0\0\M-`\M-:\M-8:\0\0\^D\0"...,16384,0x28af4000) = 16384 (0x4000) pread(32,"\0\0\0\0po\M-9:\0\0\^D\0\M-x\^A"...,16384,0x28afc000) = 16384 (0x4000) pread(32,"\0\0\0\0\M-P\M-v\M-9:\0\0\^D\0"...,32768,0x28b02000) = 32768 (0x8000) In other words it's able to coalesce, but v10 was a bit b0rked in that respect and wouldn't do as well at that. Then if you set io_combine_limit = 1, it looks more like master, eg lots of little reads, but not as many fadvises as master because of sequential access: ... posix_fadvise(32,0x28af4000,0x2000,POSIX_FADV_WILLNEED) = 0 (0x0) -+ pread(32,...,8192,0x28ae8000) = 8192 (0x2000) | pread(32,...,8192,0x28aee000) = 8192 (0x2000) | posix_fadvise(32,0x28afc000,0x2000,POSIX_FADV_WILLNEED) = 0 (0x0) ---+ pread(32,...,8192,0x28af) = 8192 (0x2000) | | pread(32,...,8192,0x28af4000) = 8192 (0x2000) <+ | posix_fadvise(32,0x28b02000,0x2000,POSIX_FADV_WILLNEED) = 0 (0x0) -+ pread(32,...,8192,0x28af6000) = 8192 (0x2000)| | pread(32,...,8192,0x28afc000) = 8192 (0x2000) <--+ | pread(32,...,8192,0x28afe000) = 8192 (0x2000) }-- no advice | pread(32,...,8192,0x28b02000) = 8192 (0x2000) <+ pread(32,...,8192,0x28b04000) = 8192 (0x2000) } pread(32,...,8192,0x28b06000) = 8192 (0x2000) }-- no advice pread(32,...,8192,0x28b08000) = 8192 (0x2000) } It becomes slightly less eager to start I/Os as soon as io_combine_limit > 1, because when it has hit max_ios, if ... yeah if the average block that it can combine is bigger than 4, an arbitrary number from: max_pinned_buffers = Max(max_ios * 4, io_combine_limit); then it can run out of look ahead window before it can reach max_ios (aka eic), so that's a kind of arbitrary/bogus I/O depth constraint, which is another way of saying what I was saying earlier: maybe it just needs more distance. So let's see the average combined I/O length in your test query... for me it works out to 27,169 bytes. But I think there must be times when it runs out of window due to clustering. So you could also try increasing that 4->8 to see what happens to performance. [1] https://www.postgresql.org/message-id/CA%2BhUKG%2B5UofvseJWv6YqKmuc_%3Drguc7VqKcNEG1eawKh3MzHXQ%40mail.gmail.com
Re: BitmapHeapScan streaming read user and prelim refactoring
On 3/27/24 20:37, Melanie Plageman wrote: > On Mon, Mar 25, 2024 at 12:07:09PM -0400, Melanie Plageman wrote: >> On Sun, Mar 24, 2024 at 06:37:20PM -0400, Melanie Plageman wrote: >>> On Sun, Mar 24, 2024 at 5:59 PM Tomas Vondra >>> wrote: BTW when you say "up to 'Make table_scan_bitmap_next_block() async friendly'" do you mean including that patch, or that this is the first patch that is not one of the independently useful patches. >>> >>> I think the code is easier to understand with "Make >>> table_scan_bitmap_next_block() async friendly". Prior to that commit, >>> table_scan_bitmap_next_block() could return false even when the bitmap >>> has more blocks and expects the caller to handle this and invoke it >>> again. I think that interface is very confusing. The downside of the >>> code in that state is that the code for prefetching is still in the >>> BitmapHeapNext() code and the code for getting the current block is in >>> the heap AM-specific code. I took a stab at fixing this in v9's 0013, >>> but the outcome wasn't very attractive. >>> >>> What I will do tomorrow is reorder and group the commits such that all >>> of the commits that are useful independent of streaming read are first >>> (I think 0014 and 0015 are independently valuable but they are on top >>> of some things that are only useful to streaming read because they are >>> more recently requested changes). I think I can actually do a bit of >>> simplification in terms of how many commits there are and what is in >>> each. Just to be clear, v9 is still reviewable. I am just going to go >>> back and change what is included in each commit. >> >> So, attached v10 does not include the new version of streaming read API. >> I focused instead on the refactoring patches commit regrouping I >> mentioned here. > > Attached v11 has the updated Read Stream API Thomas sent this morning > [1]. No other changes. > I think there's some sort of bug, triggering this assert in heapam Assert(BufferGetBlockNumber(hscan->rs_cbuf) == tbmres->blockno); I haven't looked for the root cause, and it's not exactly deterministic, but try this: create table t (a int, b text); insert into t select 1 * random(), md5(i::text) from generate_series(1,1000) s(i);^C create index on t (a); explain analyze select * from t where a = 200; explain analyze select * from t where a < 200; and then vary the condition a bit (different values, inequalities, etc.). For me it hits the assert in a couple tries. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company0x79ff48a73d37 in epoll_wait () from /lib64/libc.so.6 Missing separate debuginfos, use: dnf debuginfo-install glibc-2.37-18.fc38.x86_64 libgcc-13.2.1-4.fc38.x86_64 libicu-72.1-2.fc38.x86_64 libstdc++-13.2.1-4.fc38.x86_64 zlib-1.2.13-3.fc38.x86_64 (gdb) c Continuing. Program received signal SIGABRT, Aborted. 0x79ff489ef884 in __pthread_kill_implementation () from /lib64/libc.so.6 (gdb) bt #0 0x79ff489ef884 in __pthread_kill_implementation () from /lib64/libc.so.6 #1 0x79ff4899eafe in raise () from /lib64/libc.so.6 #2 0x79ff4898787f in abort () from /lib64/libc.so.6 #3 0x009ba563 in ExceptionalCondition (conditionName=conditionName@entry=0xa209e0 "BufferGetBlockNumber(hscan->rs_cbuf) == tbmres->blockno", fileName=fileName@entry=0xa205b4 "heapam_handler.c", lineNumber=lineNumber@entry=2221) at assert.c:66 #4 0x0057759f in heapam_scan_bitmap_next_block (exact_pages=, lossy_pages=, recheck=, scan=0x1e73548) at heapam_handler.c:2221 #5 heapam_scan_bitmap_next_tuple (scan=0x1e73548, slot=, recheck=, lossy_pages=, exact_pages=) at heapam_handler.c:2359 #6 0x006f58bb in table_scan_bitmap_next_tuple (exact_pages=, lossy_pages=, recheck=, slot=, scan=) at ../../../src/include/access/tableam.h:2022 #7 BitmapHeapNext (node=0x1e71fc8) at nodeBitmapHeapscan.c:202 #8 0x006e46b8 in ExecProcNodeInstr (node=0x1e71fc8) at execProcnode.c:480 #9 0x006dd6fa in ExecProcNode (node=0x1e71fc8) at ../../../src/include/executor/executor.h:274 #10 ExecutePlan (execute_once=, dest=0xb755e0 , direction=, numberTuples=0, sendTuples=true, operation=CMD_SELECT, use_parallel_mode=, planstate=0x1e71fc8, estate=0x1e71d80) at execMain.c:1644 #11 standard_ExecutorRun (queryDesc=0x1e6c500, direction=, count=0, execute_once=) at execMain.c:363 #12 0x0067af84 in ExplainOnePlan (plannedstmt=plannedstmt@entry=0x1e6c3f0, into=into@entry=0x0, es=es@entry=0x1db3138, queryString=queryString@entry=0x1d87f60 "explain analyze select * from t where a = 200;", params=params@entry=0x0, queryEnv=queryEnv@entry=0x0, planduration=0x7ffcbe111ac8, bufusage=0x0, mem_counters=0x0) at explain.c:645 #13 0x0067b417 in standard_ExplainOneQuery (query=, cursorOptions=2048, into=0x0, es=0x1db3138, queryString=0x1d87f60 "explain analyze select * from t where a = 200;", params=0x0,
Re: BitmapHeapScan streaming read user and prelim refactoring
On Fri, Mar 29, 2024 at 7:01 AM Tomas Vondra wrote: > On 3/28/24 06:20, Thomas Munro wrote: > > With the unexplained but apparently somewhat systematic regression > > patterns on certain tests and settings, I wonder if they might be due > > to read_stream.c trying to form larger reads, making it a bit lazier. > > It tries to see what the next block will be before issuing the > > fadvise. I think that means that with small I/O concurrency settings, > > there might be contrived access patterns where it loses, and needs > > effective_io_concurrency to be set one notch higher to keep up, or > > something like that. > > Yes, I think we've speculated this might be the root cause before, but > IIRC we didn't manage to verify it actually is the problem. Another factor could be the bug in master that allows it to get out of sync -- can it allow *more* concurrency than it intended to? Or fewer hints, but somehow that goes faster because of the stepping-on-kernel-toes problem? > > One way to test that idea would be to run the > > tests with io_combine_limit = 1 (meaning 1 block). It issues advise > > eagerly when io_combine_limit is reached, so I suppose it should be > > exactly as eager as master. The only difference then should be that > > it automatically suppresses sequential fadvise calls. > > Sure, I'll give that a try. What are some good values to test? Perhaps > 32 and 1, i.e. the default and "no coalescing"? Thanks! Yeah. The default is actually 16, computed backwards from 128kB. (Explanation: POSIX requires 16 as minimum IOV_MAX, ie number of vectors acceptable to writev/readv and related functions, though actual acceptable number is usually much higher, and it also seems to be a conservative acceptable number for hardware scatter/gather lists in various protocols, ie if doing direct I/O, the transfer won't be chopped up into more than one physical I/O command because the disk and DMA engine can handle it as a single I/O in theory at least. Actual limit on random SSDs might be more like 33, a weird number but that's what I'm seeing; Mr Axboe wrote a nice short article[1] to get some starting points for terminology on that topic on Linux. Also, just anecdotally, returns seem to diminish after that with huge transfers of buffered I/O so it seems like an OK number if you have to pick one; but IDK, YMMV, subject for future research as direct I/O grows in relevance, hence GUC.) > If this turns out to be the problem, does that mean we would consider > using a more conservative default value? Is there some "auto tuning" we > could do? For example, could we reduce the value combine limit if we > start not finding buffers in memory, or something like that? Hmm, not sure... I like that number for seq scans. I also like auto-tuning. But it seems to me that *if* the problem is that we're not allowing ourselves as many concurrent I/Os as master BHS because we're waiting to see if the next block is consecutive, that might indicate that the distance needs to be higher so that we can have a better chance to see the 'edge' (the non-contiguous next block) and start the I/O, not that the io_combine_limit needs to be lower. But I could be way off, given the fuzziness on this problem so far... > Anyway, doesn't the combine limit work against the idea that > effective_io_concurrency is "prefetch distance"? With eic=32 I'd expect > we issue prefetch 32 pages ahead, i.e. if we prefetch page X, we should > then process 32 pages before we actually need X (and we expect the page > to already be in memory, thanks to the gap). But with the combine limit > set to 32, is this still true? Hmm. It's different. (1) Master BHS has prefetch_maximum, which is indeed directly taken from the eic setting, while read_stream.c is prepared to look much ahead further than that (potentially as far as max_pinned_buffers) if it's been useful recently, to find opportunities to coalesce and start I/O. (2) Master BHS has prefetch_target to control the look-ahead window, which starts small and ramps up until it hits prefetch_maximum, while read_stream.c has distance which goes up and down according to a more complex algorithm described at the top. > I've tried going through read_stream_* to determine how this will > behave, but read_stream_look_ahead/read_stream_start_pending_read does > not make this very clear. I'll have to experiment with some tracing. I'm going to try to set up something like your experiment here too, and figure out some way to visualise or trace what's going on... The differences come from (1) multi-block I/Os, requiring two separate numbers: how many blocks ahead we're looking, and how many I/Os are running, and (2) being more aggressive about trying to reach the desired I/O level. Let me try to describe the approach again. "distance" is the size of a window that we're searching for opportunities to start I/Os. read_stream_look_ahead() will keep looking ahead until we already have max_ios I/Os running, or
Re: BitmapHeapScan streaming read user and prelim refactoring
On 3/28/24 06:20, Thomas Munro wrote: > With the unexplained but apparently somewhat systematic regression > patterns on certain tests and settings, I wonder if they might be due > to read_stream.c trying to form larger reads, making it a bit lazier. > It tries to see what the next block will be before issuing the > fadvise. I think that means that with small I/O concurrency settings, > there might be contrived access patterns where it loses, and needs > effective_io_concurrency to be set one notch higher to keep up, or > something like that. Yes, I think we've speculated this might be the root cause before, but IIRC we didn't manage to verify it actually is the problem. FWIW I don't think the tests use synthetic data, but I don't think it's particularly contrived. > One way to test that idea would be to run the > tests with io_combine_limit = 1 (meaning 1 block). It issues advise > eagerly when io_combine_limit is reached, so I suppose it should be > exactly as eager as master. The only difference then should be that > it automatically suppresses sequential fadvise calls. Sure, I'll give that a try. What are some good values to test? Perhaps 32 and 1, i.e. the default and "no coalescing"? If this turns out to be the problem, does that mean we would consider using a more conservative default value? Is there some "auto tuning" we could do? For example, could we reduce the value combine limit if we start not finding buffers in memory, or something like that? I recognize this may not be possible with buffered I/O, due to not having any insight into page cache. And maybe it's misguided anyway, because how would we know if the right response is to increase or reduce the combine limit? Anyway, doesn't the combine limit work against the idea that effective_io_concurrency is "prefetch distance"? With eic=32 I'd expect we issue prefetch 32 pages ahead, i.e. if we prefetch page X, we should then process 32 pages before we actually need X (and we expect the page to already be in memory, thanks to the gap). But with the combine limit set to 32, is this still true? I've tried going through read_stream_* to determine how this will behave, but read_stream_look_ahead/read_stream_start_pending_read does not make this very clear. I'll have to experiment with some tracing. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: BitmapHeapScan streaming read user and prelim refactoring
With the unexplained but apparently somewhat systematic regression patterns on certain tests and settings, I wonder if they might be due to read_stream.c trying to form larger reads, making it a bit lazier. It tries to see what the next block will be before issuing the fadvise. I think that means that with small I/O concurrency settings, there might be contrived access patterns where it loses, and needs effective_io_concurrency to be set one notch higher to keep up, or something like that. One way to test that idea would be to run the tests with io_combine_limit = 1 (meaning 1 block). It issues advise eagerly when io_combine_limit is reached, so I suppose it should be exactly as eager as master. The only difference then should be that it automatically suppresses sequential fadvise calls.
Re: BitmapHeapScan streaming read user and prelim refactoring
On Sun, Mar 24, 2024 at 5:59 PM Tomas Vondra wrote: > > BTW when you say "up to 'Make table_scan_bitmap_next_block() async > friendly'" do you mean including that patch, or that this is the first > patch that is not one of the independently useful patches. I think the code is easier to understand with "Make table_scan_bitmap_next_block() async friendly". Prior to that commit, table_scan_bitmap_next_block() could return false even when the bitmap has more blocks and expects the caller to handle this and invoke it again. I think that interface is very confusing. The downside of the code in that state is that the code for prefetching is still in the BitmapHeapNext() code and the code for getting the current block is in the heap AM-specific code. I took a stab at fixing this in v9's 0013, but the outcome wasn't very attractive. What I will do tomorrow is reorder and group the commits such that all of the commits that are useful independent of streaming read are first (I think 0014 and 0015 are independently valuable but they are on top of some things that are only useful to streaming read because they are more recently requested changes). I think I can actually do a bit of simplification in terms of how many commits there are and what is in each. Just to be clear, v9 is still reviewable. I am just going to go back and change what is included in each commit. > (I took a quick look at the first couple patches and I appreciate that > you keep separate patches with small cosmetic changes to keep the actual > patch smaller and easier to understand.) Thanks!
Re: BitmapHeapScan streaming read user and prelim refactoring
On 3/24/24 21:12, Melanie Plageman wrote: > On Sun, Mar 24, 2024 at 2:22 PM Tomas Vondra > wrote: >> >> On 3/24/24 18:38, Melanie Plageman wrote: >>> I haven't had a chance yet to reproduce the regressions you saw in the >>> streaming read user patch or to look closely at the performance results. >> >> So you tried to reproduce it and didn't hit the issue? Or didn't have >> time to look into that yet? FWIW with v7 it failed almost immediately >> (only a couple queries until hitting one triggering the issue), but v9 >> that's not the case (hundreds of queries without an error). > > I haven't started trying to reproduce it yet. > >> I however wonder what the plan with these patches is - do we still plan >> to get some of this into v17? It seems to me we're getting uncomfortably >> close to the end of the cycle, with a fairly incomplete idea of how it >> affects performance. >> >> Which is why I've been focusing more on the refactoring patches (up to >> 0015), to make sure those don't cause regressions if committed. And I >> think that's generally true. > > Thank you for testing the refactoring patches with this in mind! Out > of the refactoring patches, I think there is a subset of them that > have independent value without the streaming read user. I think it is > worth committing the first few patches because they remove a table AM > layering violation. IMHO, all of the patches up to "Make > table_scan_bitmap_next_block() async friendly" make the code nicer and > better. And, if folks like the patch "Remove > table_scan_bitmap_next_block()", then I think I could rebase that back > in on top of "Make table_scan_bitmap_next_block() async friendly". > This would mean table AMs would only have to implement one callback > (table_scan_bitmap_next_tuple()) which I also think is a net > improvement and simplification. > > The other refactoring patches may not be interesting without the > streaming read user. > I admit not reviewing the individual patches very closely yet, but this matches how I understood them - that at least some are likely an improvement on their own, not just as a refactoring preparing for the switch to streaming reads. We only have ~2 weeks left, so it's probably time to focus on getting at least those improvements committed. I see Heikki was paying way more attention to the patches than me, though ... BTW when you say "up to 'Make table_scan_bitmap_next_block() async friendly'" do you mean including that patch, or that this is the first patch that is not one of the independently useful patches. (I took a quick look at the first couple patches and I appreciate that you keep separate patches with small cosmetic changes to keep the actual patch smaller and easier to understand.) >> But for the main StreamingRead API the situation is very different. > > My intent for the bitmapheapscan streaming read user was to get it > into 17, but I'm not sure that looks likely. The main issues Thomas is > looking into right now are related to regressions for a fully cached > scan (noticeable with the pg_prewarm streaming read user). With all of > these fixed, I anticipate we will still see enough behavioral > differences with the bitmapheap scan streaming read user that it may > not be committable in time. Though, I have yet to work on reproducing > the regressions with the BHS streaming read user mostly because I was > focused on getting the refactoring ready and not as much because the > streaming read API is unstable. > I don't have a very good intuition regarding impact of the streaming API patch on performance. I haven't been following that thread very closely, but AFAICS there wasn't much discussion about that - perhaps it happened offlist, not sure. So who knows, really? Which is why I started looking at this patch instead - it seemed easier to benchmark with a somewhat realistic workload. But yeah, there certainly were significant behavior changes, and it's unlikely that whatever Thomas did in v8 made them go away. FWIW I certainly am *not* suggesting there must be no behavior changes, that's simply not possible. I'm not even suggesting no queries must get slower - given the dependence on storage, I think some regressions are pretty much inevitable. But it's still be good to know the regressions are reasonably rare exceptions rather than the common case, and that's not what I'm seeing ... regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: BitmapHeapScan streaming read user and prelim refactoring
On Sun, Mar 24, 2024 at 2:22 PM Tomas Vondra wrote: > > On 3/24/24 18:38, Melanie Plageman wrote: > > I haven't had a chance yet to reproduce the regressions you saw in the > > streaming read user patch or to look closely at the performance results. > > So you tried to reproduce it and didn't hit the issue? Or didn't have > time to look into that yet? FWIW with v7 it failed almost immediately > (only a couple queries until hitting one triggering the issue), but v9 > that's not the case (hundreds of queries without an error). I haven't started trying to reproduce it yet. > I however wonder what the plan with these patches is - do we still plan > to get some of this into v17? It seems to me we're getting uncomfortably > close to the end of the cycle, with a fairly incomplete idea of how it > affects performance. > > Which is why I've been focusing more on the refactoring patches (up to > 0015), to make sure those don't cause regressions if committed. And I > think that's generally true. Thank you for testing the refactoring patches with this in mind! Out of the refactoring patches, I think there is a subset of them that have independent value without the streaming read user. I think it is worth committing the first few patches because they remove a table AM layering violation. IMHO, all of the patches up to "Make table_scan_bitmap_next_block() async friendly" make the code nicer and better. And, if folks like the patch "Remove table_scan_bitmap_next_block()", then I think I could rebase that back in on top of "Make table_scan_bitmap_next_block() async friendly". This would mean table AMs would only have to implement one callback (table_scan_bitmap_next_tuple()) which I also think is a net improvement and simplification. The other refactoring patches may not be interesting without the streaming read user. > But for the main StreamingRead API the situation is very different. My intent for the bitmapheapscan streaming read user was to get it into 17, but I'm not sure that looks likely. The main issues Thomas is looking into right now are related to regressions for a fully cached scan (noticeable with the pg_prewarm streaming read user). With all of these fixed, I anticipate we will still see enough behavioral differences with the bitmapheap scan streaming read user that it may not be committable in time. Though, I have yet to work on reproducing the regressions with the BHS streaming read user mostly because I was focused on getting the refactoring ready and not as much because the streaming read API is unstable. - Melanie
Re: BitmapHeapScan streaming read user and prelim refactoring
On 3/24/24 18:38, Melanie Plageman wrote: > On Sun, Mar 24, 2024 at 01:36:19PM +0100, Tomas Vondra wrote: >> >> >> On 3/23/24 01:26, Melanie Plageman wrote: >>> On Fri, Mar 22, 2024 at 08:22:11PM -0400, Melanie Plageman wrote: On Tue, Mar 19, 2024 at 02:33:35PM +0200, Heikki Linnakangas wrote: > On 18/03/2024 17:19, Melanie Plageman wrote: >> I've attached v7 rebased over this commit. > > If we delayed table_beginscan_bm() call further, after starting the TBM > iterator, we could skip it altogether when the iterator is empty. > > That's a further improvement, doesn't need to be part of this patch set. > Just caught my eye while reading this. Hmm. You mean like until after the first call to tbm_[shared]_iterate()? AFAICT, tbm_begin_iterate() doesn't tell us anything about whether or not the iterator is "empty". Do you mean cases when the bitmap has no blocks in it? It seems like we should be able to tell that from the TIDBitmap. > >> v7-0003-Push-BitmapHeapScan-skip-fetch-optimization-into-.patch > > I suggest to avoid the double negative with SO_CAN_SKIP_FETCH, and call > the > flag e.g. SO_NEED_TUPLE. Agreed. Done in attached v8. Though I wondered if it was a bit weird that the flag is set in the common case and not set in the uncommon case... >>> >>> v8 actually attached this time >> >> I tried to run the benchmarks with v8, but unfortunately it crashes for >> me very quickly (I've only seen 0015 to crash, so I guess the bug is in >> that patch). >> >> The backtrace attached, this doesn't seem right: >> >> (gdb) p hscan->rs_cindex >> $1 = 543516018 > > Thanks for reporting this! I hadn't seen it crash on my machine, so I > didn't realize that I was no longer initializing rs_cindex and > rs_ntuples on the first call to heapam_bitmap_next_tuple() (since > heapam_bitmap_next_block() wasn't being called first). I've done this in > attached v9. > OK, I've restarted the tests with v9. > I haven't had a chance yet to reproduce the regressions you saw in the > streaming read user patch or to look closely at the performance results. So you tried to reproduce it and didn't hit the issue? Or didn't have time to look into that yet? FWIW with v7 it failed almost immediately (only a couple queries until hitting one triggering the issue), but v9 that's not the case (hundreds of queries without an error). > I don't anticipate the streaming read user will have any performance > differences in this v9 from v6, since I haven't yet rebased in Thomas' > latest streaming read API changes nor addressed any other potential > regression sources. > OK, understood. It'll be interesting to see the behavior with the new version of Thomas' patch. I however wonder what the plan with these patches is - do we still plan to get some of this into v17? It seems to me we're getting uncomfortably close to the end of the cycle, with a fairly incomplete idea of how it affects performance. Which is why I've been focusing more on the refactoring patches (up to 0015), to make sure those don't cause regressions if committed. And I think that's generally true. But for the main StreamingRead API the situation is very different. > I tried rebasing in Thomas' latest version today and something is > causing a crash that I have yet to figure out. v10 of this patchset will > have his latest version once I get that fixed. I wanted to share this > version with what I think is a bug fix for the crash you saw first. > Understood. I'll let the tests with v9 run for now. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: BitmapHeapScan streaming read user and prelim refactoring
On 3/23/24 01:26, Melanie Plageman wrote: > On Fri, Mar 22, 2024 at 08:22:11PM -0400, Melanie Plageman wrote: >> On Tue, Mar 19, 2024 at 02:33:35PM +0200, Heikki Linnakangas wrote: >>> On 18/03/2024 17:19, Melanie Plageman wrote: I've attached v7 rebased over this commit. >>> >>> If we delayed table_beginscan_bm() call further, after starting the TBM >>> iterator, we could skip it altogether when the iterator is empty. >>> >>> That's a further improvement, doesn't need to be part of this patch set. >>> Just caught my eye while reading this. >> >> Hmm. You mean like until after the first call to tbm_[shared]_iterate()? >> AFAICT, tbm_begin_iterate() doesn't tell us anything about whether or >> not the iterator is "empty". Do you mean cases when the bitmap has no >> blocks in it? It seems like we should be able to tell that from the >> TIDBitmap. >> >>> v7-0003-Push-BitmapHeapScan-skip-fetch-optimization-into-.patch >>> >>> I suggest to avoid the double negative with SO_CAN_SKIP_FETCH, and call the >>> flag e.g. SO_NEED_TUPLE. >> >> Agreed. Done in attached v8. Though I wondered if it was a bit weird >> that the flag is set in the common case and not set in the uncommon >> case... > > v8 actually attached this time I tried to run the benchmarks with v8, but unfortunately it crashes for me very quickly (I've only seen 0015 to crash, so I guess the bug is in that patch). The backtrace attached, this doesn't seem right: (gdb) p hscan->rs_cindex $1 = 543516018 regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL CompanyCore was generated by `postgres: postgres test-100 [local] SELECT '. Program terminated with signal SIGSEGV, Segmentation fault. #0 0x556f99ca7cd0 in heapam_scan_bitmap_next_tuple (scan=0x556f9b9e6960, slot=0x556f9b9ef440, recheck=0x556f9b9eec10, lossy_pages=0x556f9b9eebf8, exact_pages=0x556f9b9eebf0) at heapam_handler.c:2576 warning: Source file is more recent than executable. 2576targoffset = hscan->rs_vistuples[hscan->rs_cindex]; (gdb) bt #0 0x556f99ca7cd0 in heapam_scan_bitmap_next_tuple (scan=0x556f9b9e6960, slot=0x556f9b9ef440, recheck=0x556f9b9eec10, lossy_pages=0x556f9b9eebf8, exact_pages=0x556f9b9eebf0) at heapam_handler.c:2576 #1 0x556f99e17adb in table_scan_bitmap_next_tuple (exact_pages=0x556f9b9eebf0, lossy_pages=0x556f9b9eebf8, recheck=0x556f9b9eec10, slot=0x556f9b9ef440, scan=0x556f9b9e6960) at ../../../src/include/access/tableam.h:2003 #2 BitmapHeapNext (node=0x556f9b9eeb00) at nodeBitmapHeapscan.c:227 #3 0x556f99e1a331 in ExecProcNode (node=0x556f9b9eeb00) at ../../../src/include/executor/executor.h:274 #4 gather_getnext (gatherstate=0x556f9b9ee970) at nodeGather.c:287 #5 ExecGather (pstate=0x556f9b9ee970) at nodeGather.c:222 #6 0x556f99e24ac3 in ExecProcNode (node=0x556f9b9ee970) at ../../../src/include/executor/executor.h:274 #7 ExecLimit (pstate=0x556f9b9ee698) at nodeLimit.c:95 #8 0x556f99e0174a in ExecProcNode (node=0x556f9b9ee698) at ../../../src/include/executor/executor.h:274 #9 ExecutePlan (execute_once=, dest=0x556f9b9f2360, direction=, numberTuples=0, sendTuples=true, operation=CMD_SELECT, use_parallel_mode=, planstate=0x556f9b9ee698, estate=0x556f9b9ee458) at execMain.c:1644 #10 standard_ExecutorRun (queryDesc=0x556f9b944f88, direction=, count=0, execute_once=) at execMain.c:363 #11 0x556f99f9cc7f in PortalRunSelect (portal=portal@entry=0x556f9b997008, forward=forward@entry=true, count=0, count@entry=9223372036854775807, dest=dest@entry=0x556f9b9f2360) at pquery.c:924 #12 0x556f99f9dffa in PortalRun (portal=portal@entry=0x556f9b997008, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true, run_once=run_once@entry=true, dest=dest@entry=0x556f9b9f2360, altdest=altdest@entry=0x556f9b9f2360, qc=0x7ffc123a0b60) at pquery.c:768 #13 0x556f99f9a57c in exec_simple_query (query_string=0x556f9b91ab18 "SELECT * FROM linear WHERE (a BETWEEN 2013 AND 8061) OFFSET 100;") at postgres.c:1274 #14 0x556f99f9c051 in PostgresMain (dbname=, username=) at postgres.c:4680 #15 0x556f99f96def in BackendMain (startup_data=, startup_data_len=) at backend_startup.c:101 #16 0x556f99f0c564 in postmaster_child_launch (child_type=child_type@entry=B_BACKEND, startup_data=startup_data@entry=0x7ffc123a0f90 "", startup_data_len=startup_data_len@entry=4, client_sock=client_sock@entry=0x7ffc123a0fb0) at launch_backend.c:265 #17 0x556f99f0fea9 in BackendStartup (client_sock=0x7ffc123a0fb0) at postmaster.c:3593 #18 ServerLoop () at postmaster.c:1674 #19 0x556f99f11b50 in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0x556f9b915260) at postmaster.c:1372 #20 0x556f99c5b0c3 in main (argc=3, argv=0x556f9b915260) at main.c:197 (gdb) p hscan->rs_cindex $1 = 543516018 (gdb) p hscan $2 = (HeapScanDesc) 0x556f9b9e6960 (gdb) p *hscan $3 = {rs_base = {rs_rd =
Re: BitmapHeapScan streaming read user and prelim refactoring
On Tue, Mar 19, 2024 at 02:33:35PM +0200, Heikki Linnakangas wrote: > On 18/03/2024 17:19, Melanie Plageman wrote: > > I've attached v7 rebased over this commit. > > If we delayed table_beginscan_bm() call further, after starting the TBM > iterator, we could skip it altogether when the iterator is empty. > > That's a further improvement, doesn't need to be part of this patch set. > Just caught my eye while reading this. Hmm. You mean like until after the first call to tbm_[shared]_iterate()? AFAICT, tbm_begin_iterate() doesn't tell us anything about whether or not the iterator is "empty". Do you mean cases when the bitmap has no blocks in it? It seems like we should be able to tell that from the TIDBitmap. > > > v7-0003-Push-BitmapHeapScan-skip-fetch-optimization-into-.patch > > I suggest to avoid the double negative with SO_CAN_SKIP_FETCH, and call the > flag e.g. SO_NEED_TUPLE. Agreed. Done in attached v8. Though I wondered if it was a bit weird that the flag is set in the common case and not set in the uncommon case... > As yet another preliminary patch before the streaming read API, it would be > nice to move the prefetching code to heapam.c too. I've done this, but I can say it is not very pretty. see 0013. I had to add a bunch of stuff to TableScanDescData and HeapScanDescData which are only used for bitmapheapscans. I don't know if it makes the BHS streaming read user patch easier to review, but I don't think what I have in 0013 is committable to Postgres. Maybe there was another way I could have approached it. Let me know what you think. In addition to bloating the table descriptors, note that it was difficult to avoid one semantic change -- with 0013, we no longer prefetch or adjust prefetch target when emitting each empty tuple -- though I think this is could actually be desirable. > What's the point of having separate table_scan_bitmap_next_block() and > table_scan_bitmap_next_tuple() functions anymore? The AM owns the TBM > iterator now. The executor node updates the lossy/exact page counts, but > that's the only per-page thing it does now. Oh, interesting. Good point. I've done this in 0015. If you like the way it turned out, I can probably rebase this back into an earlier point in the set and end up dropping some of the other incremental changes (e.g. 0008). > > /* > > * If this is the first scan of the underlying table, create > > the table > > * scan descriptor and begin the scan. > > */ > > if (!scan) > > { > > uint32 extra_flags = 0; > > > > /* > > * We can potentially skip fetching heap pages if we do > > not need > > * any columns of the table, either for checking > > non-indexable > > * quals or for returning data. This test is a bit > > simplistic, as > > * it checks the stronger condition that there's no > > qual or return > > * tlist at all. But in most cases it's probably not > > worth working > > * harder than that. > > */ > > if (node->ss.ps.plan->qual == NIL && > > node->ss.ps.plan->targetlist == NIL) > > extra_flags |= SO_CAN_SKIP_FETCH; > > > > scan = node->ss.ss_currentScanDesc = table_beginscan_bm( > > > > > > node->ss.ss_currentRelation, > > > > > > node->ss.ps.state->es_snapshot, > > > > 0, > > > > NULL, > > > > extra_flags); > > } > > > > scan->tbmiterator = tbmiterator; > > scan->shared_tbmiterator = shared_tbmiterator; > > How about passing the iterator as an argument to table_beginscan_bm()? You'd > then need some other function to change the iterator on rescan, though. Not > sure what exactly to do here, but feels that this part of the API is not > fully thought-out. Needs comments at least, to explain who sets tbmiterator > / shared_tbmiterator and when. For comparison, for a TID scan there's a > separate scan_set_tidrange() table AM function. Maybe follow that example > and introduce scan_set_tbm_iterator(). I've spent quite a bit of time playing around with the code trying to make it less terrible than what I had
Re: BitmapHeapScan streaming read user and prelim refactoring
On 3/18/24 16:19, Melanie Plageman wrote: > On Mon, Mar 18, 2024 at 02:10:28PM +0200, Heikki Linnakangas wrote: >> On 14/02/2024 21:42, Andres Freund wrote: >>> On 2024-02-13 18:11:25 -0500, Melanie Plageman wrote: patch 0004 is, I think, a bug fix. see [2]. >>> >>> I'd not quite call it a bugfix, it's not like it leads to wrong >>> behaviour. Seems more like an optimization. But whatever :) >> >> It sure looks like bug to me, albeit a very minor one. Certainly not an >> optimization, it doesn't affect performance in any way, only what EXPLAIN >> reports. So committed and backported that to all supported branches. > > I've attached v7 rebased over this commit. > I've started a new set of benchmarks with v7 (on top of f69319f2f1), but unfortunately that results in about 15% of the queries failing with: ERROR: prefetch and main iterators are out of sync Reproducing it is pretty simple (at least on my laptop). Simply apply 0001-0011, and then do this: == create table test_table (a bigint, b bigint, c text) with (fillfactor = 25); insert into test_table select 1 * random(), i, md5(random()::text) from generate_series(1, 100) s(i); create index on test_table(a); vacuum analyze ; checkpoint; set work_mem = '128kB'; set effective_io_concurrency = 8; set random_page_cost = 2; set max_parallel_workers_per_gather = 0; explain select * from test_table where a >= 1 and a <= 512; explain analyze select * from test_table where a >= 1 and a <= 512; ERROR: prefetch and main iterators are out of sync == I haven't investigated this, but it seems to get broken by this patch: v7-0009-Make-table_scan_bitmap_next_block-async-friendly.patch I wonder if there are some additional changes aside from the rebase. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: BitmapHeapScan streaming read user and prelim refactoring
On 3/18/24 16:55, Tomas Vondra wrote: > > ... > > OK, I've restarted the tests for only 0012 and 0014 patches, and I'll > wait for these to complete - I don't want to be looking for patterns > until we have enough data to smooth this out. > > I now have results for 1M and 10M runs on the two builds (0012 and 0014), attached is a chart for relative performance plotting (0014 timing) / (0012 timing) for "optimal' runs that would pick bitmapscan on their own. There's nothing special about the config - I reduced the random_page_cost to 1.5-2.0 to reflect both machines have flash storage, etc. Overall, the chart is pretty consistent with what I shared on Sunday. Most of the results are fine (0014 is close to 0012 or faster), but there's a bunch of cases that are much slower. Interestingly enough, almost all of them are on the i5 machine, almost none of the xeon. My guess is this is about the SSD type (SATA vs. NVMe). Attached if table of ~50 worst regressions (by the metric above), and it's interesting the worst regressions are with eic=0 and eic=1. I decided to look at the first case (eic=0), and the timings are quite stable - there are three runs for each build, with timings close to the average (see below the table). Attached is a script that reproduces this on both machines, but the difference is much more significant on i5 (~5x) compared to xeon (~2x). I haven't investigated what exactly is happening and why, hopefully the script will allow you to reproduce this independently. I plan to take a look, but I don't know when I'll have time for this. FWIW if the script does not reproduce this on your machines, I might be able to give you access to the i5 machine. Let me know. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Companycreate table test_table (a bigint, b bigint, c text) with (fillfactor = 25); insert into test_table select a, b, c from (select a, b, c, generate_series(1,24) from (select a, b, c from (select (416 * random())::int a, i b, md5(random()::text) c from generate_series(1, 100/24) s(i)) foo) bar) baz; create index on test_table (a); vacuum analyze; checkpoint; -- 0012 set effective_io_concurrency = 0; set work_mem = '4MB'; explain analyze select * from test_table where (a >= 53) AND (a <= 264); QUERY PLAN --- Bitmap Heap Scan on test_table (cost=6030.91..55240.07 rows=502877 width=49) (actual time=21.821..406.664 rows=508440 loops=1) Recheck Cond: ((a >= 53) AND (a <= 264)) Heap Blocks: exact=21185 -> Bitmap Index Scan on test_table_a_idx (cost=0.00..5905.20 rows=502877 width=0) (actual time=18.599..18.599 rows=508440 loops=1) Index Cond: ((a >= 53) AND (a <= 264)) Planning Time: 5.235 ms Execution Time: 421.185 ms (7 rows) -- 0014 set effective_io_concurrency = 0; set work_mem = '4MB'; explain analyze select * from test_table where (a >= 53) AND (a <= 264); QUERY PLAN --- Bitmap Heap Scan on test_table (cost=6030.91..55240.07 rows=502877 width=49) (actual time=21.894..2697.074 rows=508440 loops=1) Recheck Cond: ((a >= 53) AND (a <= 264)) Heap Blocks: exact=21185 -> Bitmap Index Scan on test_table_a_idx (cost=0.00..5905.20 rows=502877 width=0) (actual time=18.714..18.715 rows=508440 loops=1) Index Cond: ((a >= 53) AND (a <= 264)) Planning Time: 4.943 ms Execution Time: 2730.409 ms (7 rows) with data as ( select machine, build, rows, dataset, workers, wm, eic, matches, caching, count(*), round(avg(timing),2) as timing from results where optimal = 'bitmapscan' group by 1, 2, 3, 4, 5, 6, 7, 8, 9 ) select d1.*, d2.timing as timing_0014, round(d2.timing / d1.timing,2) AS change from data d1 join data d2 on ((d1.machine, d1.rows, d1.dataset, d1.workers, d1.wm, d1.eic, d1.matches, d1.caching) = (d2.machine, d2.rows, d2.dataset, d2.workers, d2.wm, d2.eic, d2.matches, d2.caching)) where d1.build = 'patched-0012' and d2.build = 'patched-0014' order by d2.timing / d1.timing desc; machine |build | rows |dataset| workers | wm | eic | matches | caching | count | timing | timing_0014 | change -+--+--+---+-+---+-+-+---+---+--+-+ i5 | patched-0012 | 100 | uniform_pages | 0 | 4096 | 0 | 211 | uncached | 3 | 449.78 | 2650.20 | 5.89 i5 | patched-0012 | 100 |
Re: BitmapHeapScan streaming read user and prelim refactoring
On 18/03/2024 17:19, Melanie Plageman wrote: I've attached v7 rebased over this commit. Thanks! v7-0001-BitmapHeapScan-begin-scan-after-bitmap-creation.patch If we delayed table_beginscan_bm() call further, after starting the TBM iterator, we could skip it altogether when the iterator is empty. That's a further improvement, doesn't need to be part of this patch set. Just caught my eye while reading this. v7-0003-Push-BitmapHeapScan-skip-fetch-optimization-into-.patch I suggest to avoid the double negative with SO_CAN_SKIP_FETCH, and call the flag e.g. SO_NEED_TUPLE. As yet another preliminary patch before the streaming read API, it would be nice to move the prefetching code to heapam.c too. What's the point of having separate table_scan_bitmap_next_block() and table_scan_bitmap_next_tuple() functions anymore? The AM owns the TBM iterator now. The executor node updates the lossy/exact page counts, but that's the only per-page thing it does now. /* * If this is the first scan of the underlying table, create the table * scan descriptor and begin the scan. */ if (!scan) { uint32 extra_flags = 0; /* * We can potentially skip fetching heap pages if we do not need * any columns of the table, either for checking non-indexable * quals or for returning data. This test is a bit simplistic, as * it checks the stronger condition that there's no qual or return * tlist at all. But in most cases it's probably not worth working * harder than that. */ if (node->ss.ps.plan->qual == NIL && node->ss.ps.plan->targetlist == NIL) extra_flags |= SO_CAN_SKIP_FETCH; scan = node->ss.ss_currentScanDesc = table_beginscan_bm( node->ss.ss_currentRelation, node->ss.ps.state->es_snapshot, 0, NULL, extra_flags); } scan->tbmiterator = tbmiterator; scan->shared_tbmiterator = shared_tbmiterator; How about passing the iterator as an argument to table_beginscan_bm()? You'd then need some other function to change the iterator on rescan, though. Not sure what exactly to do here, but feels that this part of the API is not fully thought-out. Needs comments at least, to explain who sets tbmiterator / shared_tbmiterator and when. For comparison, for a TID scan there's a separate scan_set_tidrange() table AM function. Maybe follow that example and introduce scan_set_tbm_iterator(). It's bit awkward to have separate tbmiterator and shared_tbmiterator fields. Could regular and shared iterators be merged, or wrapped under a common interface? -- Heikki Linnakangas Neon (https://neon.tech)
Re: BitmapHeapScan streaming read user and prelim refactoring
On 3/18/24 15:47, Melanie Plageman wrote: > On Sun, Mar 17, 2024 at 3:21 PM Tomas Vondra > wrote: >> >> On 3/14/24 22:39, Melanie Plageman wrote: >>> On Thu, Mar 14, 2024 at 5:26 PM Tomas Vondra >>> wrote: On 3/14/24 19:16, Melanie Plageman wrote: > On Thu, Mar 14, 2024 at 03:32:04PM +0200, Heikki Linnakangas wrote: >> ... >> >> Ok, committed that for now. Thanks for looking! > > Attached v6 is rebased over your new commit. It also has the "fix" in > 0010 which moves BitmapAdjustPrefetchIterator() back above > table_scan_bitmap_next_block(). I've also updated the Streaming Read API > commit (0013) to Thomas' v7 version from [1]. This has the update that > we theorize should address some of the regressions in the bitmapheapscan > streaming read user in 0014. > Should I rerun the benchmarks with these new patches, to see if it really helps with the regressions? >>> >>> That would be awesome! >>> >> >> OK, here's a couple charts comparing the effect of v6 patches to master. >> These are from 1M and 10M data sets, same as the runs presented earlier >> in this thread (the 10M is still running, but should be good enough for >> this kind of visual comparison). > > Thanks for doing this! > >> What is even more obvious is that 0014 behaves *VERY* differently. I'm >> not sure if this is a good thing or a problem is debatable/unclear. I'm >> sure we don't want to cause regressions, but perhaps those are due to >> the prefetch issue discussed elsewhere in this thread (identified by >> Andres and Melanie). There are also many cases that got much faster, but >> the question is whether this is due to better efficiency or maybe the >> new code being more aggressive in some way (not sure). > > Are these with the default effective_io_concurrency (1)? If so, the > "effective" prefetch distance in many cases will be higher with the > streaming read code applied. With effective_io_concurrency 1, > "max_ios" will always be 1, but the number of blocks prefetched may > exceed this (up to MAX_BUFFERS_PER_TRANSFER) because the streaming > read code is always trying to build bigger IOs. And, if prefetching, > it will prefetch IOs not yet in shared buffers before reading them. > No, it's a mix of runs with random combinations of these parameters: dataset: uniform uniform_pages linear linear_fuzz cyclic cyclic_fuzz workers: 0 4 work_mem: 128kB 4MB 64MB eic: 0 1 8 16 32 selectivity: 0-100% I can either share the data (~70MB of CSV) or generate charts for results with some filter. > It's hard to tell without going into a specific repro why this would > cause some queries to be much slower. In the forced bitmapheapscan, it > would make sense that more prefetching is worse -- which is why a > bitmapheapscan plan wouldn't have been chosen. But in the optimal > cases, it is unclear why it would be worse. > Yes, not sure about the optimal cases. I'll wait for the 10M runs to complete, and then we can look for some patterns. > I don't think there is any way it could be the issue Andres > identified, because there is only one iterator. Nothing to get out of > sync. It could be that the fadvises are being issued too close to the > reads and aren't effective enough at covering up read latency on > slower, older hardware. But that doesn't explain why master would > sometimes be faster. > Ah, right, thanks for the clarification. I forgot the streaming read API does not use the two-iterator approach. > Probably the only thing we can do is get into a repro. It would, of > course, be easiest to do this with a serial query. I can dig into the > scripts you shared earlier and try to find a good repro. Because the > regressions may have shifted with Thomas' new version, it would help > if you shared a category (cyclic/uniform/etc, parallel or serial, eic > value, work mem, etc) where you now see the most regressions. > OK, I've restarted the tests for only 0012 and 0014 patches, and I'll wait for these to complete - I don't want to be looking for patterns until we have enough data to smooth this out. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: BitmapHeapScan streaming read user and prelim refactoring
On Sun, Mar 17, 2024 at 3:21 PM Tomas Vondra wrote: > > On 3/14/24 22:39, Melanie Plageman wrote: > > On Thu, Mar 14, 2024 at 5:26 PM Tomas Vondra > > wrote: > >> > >> On 3/14/24 19:16, Melanie Plageman wrote: > >>> On Thu, Mar 14, 2024 at 03:32:04PM +0200, Heikki Linnakangas wrote: > ... > > Ok, committed that for now. Thanks for looking! > >>> > >>> Attached v6 is rebased over your new commit. It also has the "fix" in > >>> 0010 which moves BitmapAdjustPrefetchIterator() back above > >>> table_scan_bitmap_next_block(). I've also updated the Streaming Read API > >>> commit (0013) to Thomas' v7 version from [1]. This has the update that > >>> we theorize should address some of the regressions in the bitmapheapscan > >>> streaming read user in 0014. > >>> > >> > >> Should I rerun the benchmarks with these new patches, to see if it > >> really helps with the regressions? > > > > That would be awesome! > > > > OK, here's a couple charts comparing the effect of v6 patches to master. > These are from 1M and 10M data sets, same as the runs presented earlier > in this thread (the 10M is still running, but should be good enough for > this kind of visual comparison). Thanks for doing this! > What is even more obvious is that 0014 behaves *VERY* differently. I'm > not sure if this is a good thing or a problem is debatable/unclear. I'm > sure we don't want to cause regressions, but perhaps those are due to > the prefetch issue discussed elsewhere in this thread (identified by > Andres and Melanie). There are also many cases that got much faster, but > the question is whether this is due to better efficiency or maybe the > new code being more aggressive in some way (not sure). Are these with the default effective_io_concurrency (1)? If so, the "effective" prefetch distance in many cases will be higher with the streaming read code applied. With effective_io_concurrency 1, "max_ios" will always be 1, but the number of blocks prefetched may exceed this (up to MAX_BUFFERS_PER_TRANSFER) because the streaming read code is always trying to build bigger IOs. And, if prefetching, it will prefetch IOs not yet in shared buffers before reading them. It's hard to tell without going into a specific repro why this would cause some queries to be much slower. In the forced bitmapheapscan, it would make sense that more prefetching is worse -- which is why a bitmapheapscan plan wouldn't have been chosen. But in the optimal cases, it is unclear why it would be worse. I don't think there is any way it could be the issue Andres identified, because there is only one iterator. Nothing to get out of sync. It could be that the fadvises are being issued too close to the reads and aren't effective enough at covering up read latency on slower, older hardware. But that doesn't explain why master would sometimes be faster. Probably the only thing we can do is get into a repro. It would, of course, be easiest to do this with a serial query. I can dig into the scripts you shared earlier and try to find a good repro. Because the regressions may have shifted with Thomas' new version, it would help if you shared a category (cyclic/uniform/etc, parallel or serial, eic value, work mem, etc) where you now see the most regressions. - Melanie
Re: BitmapHeapScan streaming read user and prelim refactoring
On 14/02/2024 21:42, Andres Freund wrote: On 2024-02-13 18:11:25 -0500, Melanie Plageman wrote: patch 0004 is, I think, a bug fix. see [2]. I'd not quite call it a bugfix, it's not like it leads to wrong behaviour. Seems more like an optimization. But whatever :) It sure looks like bug to me, albeit a very minor one. Certainly not an optimization, it doesn't affect performance in any way, only what EXPLAIN reports. So committed and backported that to all supported branches. -- Heikki Linnakangas Neon (https://neon.tech)
Re: BitmapHeapScan streaming read user and prelim refactoring
On 3/17/24 20:36, Tomas Vondra wrote: > > ... > >> Besides a lot of other things, I finally added debugging fprintfs printing >> the >> pid, (prefetch, read), block number. Even looking at tiny excerpts of the >> large amount of output that generates shows that two iterators were out of >> sync. >> > > Thanks. I did experiment with fprintf, but it's quite cumbersome, so I > was hoping you came up with some smart way to trace this king of stuff. > For example I was wondering if ebpf would be a more convenient way. > FWIW I just realized why I failed to identify this "late prefetch" issue during my investigation. I was experimenting with instrumenting this by adding a LD_PRELOAD library, logging all pread/fadvise calls. But the FilePrefetch call is skipped in the page is already in shared buffers, so this case "disappeared" during processing which matched the two calls by doing an "inner join". That being said, I think tracing this using LD_PRELOAD or perf may be more convenient way to see what's happening. For example I ended up doing this: perf record -a -e syscalls:sys_enter_fadvise64 \ -e syscalls:sys_exit_fadvise64 \ -e syscalls:sys_enter_pread64 \ -e syscalls:sys_exit_pread64 perf script -ns Alternatively, perf-trace can be used and prints the filename too (but time has ms resolution only). Processing this seems comparable to the fprintf approach. It still has the issue that some of the fadvise calls may be absent if the prefetch iterator gets too far behind, but I think that can be detected / measured by simply counting the fadvise calls, and comparing them to pread calls. We expect these to be about the same, so (#pread - #fadvise) / #fadvise is a measure of how many were "late" and skipped. It also seems better than fprintf because it traces the actual syscalls, not just calls to glibc wrappers. For example I saw this postgres 54769 [001] 33768.771524828: syscalls:sys_enter_pread64: ..., pos: 0x30d04000 postgres 54769 [001] 33768.771526867: syscalls:sys_exit_pread64: 0x2000 postgres 54820 [000] 33768.771527473: syscalls:sys_enter_fadvise64: ..., offset: 0x30d04000, ... postgres 54820 [000] 33768.771528320: syscalls:sys_exit_fadvise64: 0x0 which is clearly a case where we issue fadvise after pread of the same block already completed. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: BitmapHeapScan streaming read user and prelim refactoring
On 3/17/24 17:38, Andres Freund wrote: > Hi, > > On 2024-03-16 21:25:18 +0100, Tomas Vondra wrote: >> On 3/16/24 20:12, Andres Freund wrote: >>> That would address some of the worst behaviour, but it doesn't really seem >>> to >>> address the underlying problem of the two iterators being modified >>> independently. ISTM the proper fix would be to protect the state of the >>> iterators with a single lock, rather than pushing down the locking into the >>> bitmap code. OTOH, we'll only need one lock going forward, so being economic >>> in the effort of fixing this is also important. >>> >> >> Can you share some details about how you identified the problem, counted >> the prefetches that happen too late, etc? I'd like to try to reproduce >> this to understand the issue better. > > There's two aspects. Originally I couldn't reliably reproduce the regression > with Melanie's repro on my laptop. I finally was able to do so after I > a) changed the block device's read_ahead_kb to 0 > b) used effective_io_concurrency=1 > > That made the difference between the BitmapAdjustPrefetchIterator() locations > very significant, something like 2.3s vs 12s. > Interesting. I haven't thought about read_ahead_kb, but in hindsight it makes sense it affects these cases. OTOH I did not set it to 0 on either machine (the 6xSATA RAID0 has it at 12288, for example) and yet that's how we found the regressions. For eic it makes perfect sense that setting it to 1 is particularly vulnerable to this issue - it only takes a small "desynchronization" of the two iterators for the prefetch to "fall behind" and frequently prefetch blocks we already read. > Besides a lot of other things, I finally added debugging fprintfs printing the > pid, (prefetch, read), block number. Even looking at tiny excerpts of the > large amount of output that generates shows that two iterators were out of > sync. > Thanks. I did experiment with fprintf, but it's quite cumbersome, so I was hoping you came up with some smart way to trace this king of stuff. For example I was wondering if ebpf would be a more convenient way. > >> If I understand correctly, what may happen is that a worker reads blocks >> from the "prefetch" iterator, but before it manages to issue the >> posix_fadvise, some other worker already did pread. Or can the iterators >> get "out of sync" in a more fundamental way? > > I agree that the current scheme of two shared iterators being used has some > fairly fundamental raciness. But I suspect there's more than that going on > right now. > > Moving BitmapAdjustPrefetchIterator() to later drastically increases the > raciness because it means table_scan_bitmap_next_block() happens between > increasing the "real" and the "prefetch" iterators. > > An example scenario that, I think, leads to the iterators being out of sync, > without there being races between iterator advancement and completing > prefetching: > > start: > real -> block 0 > prefetch -> block 0 > prefetch_pages = 0 > prefetch_target = 1 > > W1: tbm_shared_iterate(real) -> block 0 > W2: tbm_shared_iterate(real) -> block 1 > W1: BitmapAdjustPrefetchIterator() -> tbm_shared_iterate(prefetch) -> 0 > W2: BitmapAdjustPrefetchIterator() -> tbm_shared_iterate(prefetch) -> 1 > W1: read block 0 > W2: read block 1 > W1: BitmapPrefetch() -> prefetch_pages++ -> 1, tbm_shared_iterate(prefetch) > -> 2, prefetch block 2 > W2: BitmapPrefetch() -> nothing, as prefetch_pages == prefetch_target > > W1: tbm_shared_iterate(real) -> block 2 > W2: tbm_shared_iterate(real) -> block 3 > > W2: BitmapAdjustPrefetchIterator() -> prefetch_pages-- > W2: read block 3 > W2: BitmapPrefetch() -> prefetch_pages++, tbm_shared_iterate(prefetch) -> 3, > prefetch block 3 > > So afaict here we end up prefetching a block that the *same process* just had > read. > Uh, that's very weird. I'd understood if there's some cross-process issue, but if this happens in a single process ... strange. > ISTM that the idea of somehow "catching up" in BitmapAdjustPrefetchIterator(), > separately from advancing the "real" iterator, is pretty ugly for non-parallel > BHS and just straight up broken in the parallel case. > Yeah, I agree with the feeling it's an ugly fix. Definitely seems more like fixing symptoms than the actual problem. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: BitmapHeapScan streaming read user and prelim refactoring
Hi, On 2024-03-16 21:25:18 +0100, Tomas Vondra wrote: > On 3/16/24 20:12, Andres Freund wrote: > > That would address some of the worst behaviour, but it doesn't really seem > > to > > address the underlying problem of the two iterators being modified > > independently. ISTM the proper fix would be to protect the state of the > > iterators with a single lock, rather than pushing down the locking into the > > bitmap code. OTOH, we'll only need one lock going forward, so being economic > > in the effort of fixing this is also important. > > > > Can you share some details about how you identified the problem, counted > the prefetches that happen too late, etc? I'd like to try to reproduce > this to understand the issue better. There's two aspects. Originally I couldn't reliably reproduce the regression with Melanie's repro on my laptop. I finally was able to do so after I a) changed the block device's read_ahead_kb to 0 b) used effective_io_concurrency=1 That made the difference between the BitmapAdjustPrefetchIterator() locations very significant, something like 2.3s vs 12s. Besides a lot of other things, I finally added debugging fprintfs printing the pid, (prefetch, read), block number. Even looking at tiny excerpts of the large amount of output that generates shows that two iterators were out of sync. > If I understand correctly, what may happen is that a worker reads blocks > from the "prefetch" iterator, but before it manages to issue the > posix_fadvise, some other worker already did pread. Or can the iterators > get "out of sync" in a more fundamental way? I agree that the current scheme of two shared iterators being used has some fairly fundamental raciness. But I suspect there's more than that going on right now. Moving BitmapAdjustPrefetchIterator() to later drastically increases the raciness because it means table_scan_bitmap_next_block() happens between increasing the "real" and the "prefetch" iterators. An example scenario that, I think, leads to the iterators being out of sync, without there being races between iterator advancement and completing prefetching: start: real -> block 0 prefetch -> block 0 prefetch_pages = 0 prefetch_target = 1 W1: tbm_shared_iterate(real) -> block 0 W2: tbm_shared_iterate(real) -> block 1 W1: BitmapAdjustPrefetchIterator() -> tbm_shared_iterate(prefetch) -> 0 W2: BitmapAdjustPrefetchIterator() -> tbm_shared_iterate(prefetch) -> 1 W1: read block 0 W2: read block 1 W1: BitmapPrefetch() -> prefetch_pages++ -> 1, tbm_shared_iterate(prefetch) -> 2, prefetch block 2 W2: BitmapPrefetch() -> nothing, as prefetch_pages == prefetch_target W1: tbm_shared_iterate(real) -> block 2 W2: tbm_shared_iterate(real) -> block 3 W2: BitmapAdjustPrefetchIterator() -> prefetch_pages-- W2: read block 3 W2: BitmapPrefetch() -> prefetch_pages++, tbm_shared_iterate(prefetch) -> 3, prefetch block 3 So afaict here we end up prefetching a block that the *same process* just had read. ISTM that the idea of somehow "catching up" in BitmapAdjustPrefetchIterator(), separately from advancing the "real" iterator, is pretty ugly for non-parallel BHS and just straight up broken in the parallel case. Greetings, Andres Freund
Re: BitmapHeapScan streaming read user and prelim refactoring
On 3/16/24 20:12, Andres Freund wrote: > Hi, > > On 2024-03-15 18:42:29 -0400, Melanie Plageman wrote: >> On Fri, Mar 15, 2024 at 5:14 PM Andres Freund wrote: >>> On 2024-03-14 17:39:30 -0400, Melanie Plageman wrote: >>> I spent a good amount of time looking into this with Melanie. After a bunch >>> of >>> wrong paths I think I found the issue: We end up prefetching blocks we have >>> already read. Notably this happens even as-is on master - just not as >>> frequently as after moving BitmapAdjustPrefetchIterator(). >>> >>> From what I can tell the prefetching in parallel bitmap heap scans is >>> thoroughly broken. I added some tracking of the last block read, the last >>> block prefetched to ParallelBitmapHeapState and found that with a small >>> effective_io_concurrency we end up with ~18% of prefetches being of blocks >>> we >>> already read! After moving the BitmapAdjustPrefetchIterator() to rises to >>> 86%, >>> no wonder it's slower... >>> >>> The race here seems fairly substantial - we're moving the two iterators >>> independently from each other, in multiple processes, without useful >>> locking. >>> >>> I'm inclined to think this is a bug we ought to fix in the backbranches. >> >> Thinking about how to fix this, perhaps we could keep the current max >> block number in the ParallelBitmapHeapState and then when prefetching, >> workers could loop calling tbm_shared_iterate() until they've found a >> block at least prefetch_pages ahead of the current block. They >> wouldn't need to read the current max value from the parallel state on >> each iteration. Even checking it once and storing that value in a >> local variable prevented prefetching blocks after reading them in my >> example repro of the issue. > > That would address some of the worst behaviour, but it doesn't really seem to > address the underlying problem of the two iterators being modified > independently. ISTM the proper fix would be to protect the state of the > iterators with a single lock, rather than pushing down the locking into the > bitmap code. OTOH, we'll only need one lock going forward, so being economic > in the effort of fixing this is also important. > Can you share some details about how you identified the problem, counted the prefetches that happen too late, etc? I'd like to try to reproduce this to understand the issue better. If I understand correctly, what may happen is that a worker reads blocks from the "prefetch" iterator, but before it manages to issue the posix_fadvise, some other worker already did pread. Or can the iterators get "out of sync" in a more fundamental way? If my understanding is correct, why would a single lock solve that? Yes, we'd advance the iterators at the same time, but surely we'd not issue the fadvise calls while holding the lock, and the prefetch/fadvise for a particular block could still happen in different workers. I suppose a dirty PoC fix should not be too difficult, and it'd allow us to check if it works. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: BitmapHeapScan streaming read user and prelim refactoring
Hi, On 2024-03-15 18:42:29 -0400, Melanie Plageman wrote: > On Fri, Mar 15, 2024 at 5:14 PM Andres Freund wrote: > > On 2024-03-14 17:39:30 -0400, Melanie Plageman wrote: > > I spent a good amount of time looking into this with Melanie. After a bunch > > of > > wrong paths I think I found the issue: We end up prefetching blocks we have > > already read. Notably this happens even as-is on master - just not as > > frequently as after moving BitmapAdjustPrefetchIterator(). > > > > From what I can tell the prefetching in parallel bitmap heap scans is > > thoroughly broken. I added some tracking of the last block read, the last > > block prefetched to ParallelBitmapHeapState and found that with a small > > effective_io_concurrency we end up with ~18% of prefetches being of blocks > > we > > already read! After moving the BitmapAdjustPrefetchIterator() to rises to > > 86%, > > no wonder it's slower... > > > > The race here seems fairly substantial - we're moving the two iterators > > independently from each other, in multiple processes, without useful > > locking. > > > > I'm inclined to think this is a bug we ought to fix in the backbranches. > > Thinking about how to fix this, perhaps we could keep the current max > block number in the ParallelBitmapHeapState and then when prefetching, > workers could loop calling tbm_shared_iterate() until they've found a > block at least prefetch_pages ahead of the current block. They > wouldn't need to read the current max value from the parallel state on > each iteration. Even checking it once and storing that value in a > local variable prevented prefetching blocks after reading them in my > example repro of the issue. That would address some of the worst behaviour, but it doesn't really seem to address the underlying problem of the two iterators being modified independently. ISTM the proper fix would be to protect the state of the iterators with a single lock, rather than pushing down the locking into the bitmap code. OTOH, we'll only need one lock going forward, so being economic in the effort of fixing this is also important. Greetings, Andres Freund
Re: BitmapHeapScan streaming read user and prelim refactoring
On Fri, Mar 15, 2024 at 5:14 PM Andres Freund wrote: > > Hi, > > On 2024-03-14 17:39:30 -0400, Melanie Plageman wrote: > > I will soon send out a summary of what we investigated off-list about > > 0010 (though we didn't end up concluding anything). My "fix" (leaving > > BitmapAdjustPrefetchIterator() above table_scan_bitmap_next_block()) > > eliminates the regression in 0010 on the one example that I repro'd > > upthread, but it would be good to know if it eliminates the > > regressions across some other tests. > > I spent a good amount of time looking into this with Melanie. After a bunch of > wrong paths I think I found the issue: We end up prefetching blocks we have > already read. Notably this happens even as-is on master - just not as > frequently as after moving BitmapAdjustPrefetchIterator(). > > From what I can tell the prefetching in parallel bitmap heap scans is > thoroughly broken. I added some tracking of the last block read, the last > block prefetched to ParallelBitmapHeapState and found that with a small > effective_io_concurrency we end up with ~18% of prefetches being of blocks we > already read! After moving the BitmapAdjustPrefetchIterator() to rises to 86%, > no wonder it's slower... > > The race here seems fairly substantial - we're moving the two iterators > independently from each other, in multiple processes, without useful locking. > > I'm inclined to think this is a bug we ought to fix in the backbranches. Thinking about how to fix this, perhaps we could keep the current max block number in the ParallelBitmapHeapState and then when prefetching, workers could loop calling tbm_shared_iterate() until they've found a block at least prefetch_pages ahead of the current block. They wouldn't need to read the current max value from the parallel state on each iteration. Even checking it once and storing that value in a local variable prevented prefetching blocks after reading them in my example repro of the issue. - Melanie
Re: BitmapHeapScan streaming read user and prelim refactoring
Hi, On 2024-03-14 17:39:30 -0400, Melanie Plageman wrote: > I will soon send out a summary of what we investigated off-list about > 0010 (though we didn't end up concluding anything). My "fix" (leaving > BitmapAdjustPrefetchIterator() above table_scan_bitmap_next_block()) > eliminates the regression in 0010 on the one example that I repro'd > upthread, but it would be good to know if it eliminates the > regressions across some other tests. I spent a good amount of time looking into this with Melanie. After a bunch of wrong paths I think I found the issue: We end up prefetching blocks we have already read. Notably this happens even as-is on master - just not as frequently as after moving BitmapAdjustPrefetchIterator(). >From what I can tell the prefetching in parallel bitmap heap scans is thoroughly broken. I added some tracking of the last block read, the last block prefetched to ParallelBitmapHeapState and found that with a small effective_io_concurrency we end up with ~18% of prefetches being of blocks we already read! After moving the BitmapAdjustPrefetchIterator() to rises to 86%, no wonder it's slower... The race here seems fairly substantial - we're moving the two iterators independently from each other, in multiple processes, without useful locking. I'm inclined to think this is a bug we ought to fix in the backbranches. Greetings, Andres Freund
Re: BitmapHeapScan streaming read user and prelim refactoring
On Thu, Mar 14, 2024 at 5:26 PM Tomas Vondra wrote: > > On 3/14/24 19:16, Melanie Plageman wrote: > > On Thu, Mar 14, 2024 at 03:32:04PM +0200, Heikki Linnakangas wrote: > >> ... > >> > >> Ok, committed that for now. Thanks for looking! > > > > Attached v6 is rebased over your new commit. It also has the "fix" in > > 0010 which moves BitmapAdjustPrefetchIterator() back above > > table_scan_bitmap_next_block(). I've also updated the Streaming Read API > > commit (0013) to Thomas' v7 version from [1]. This has the update that > > we theorize should address some of the regressions in the bitmapheapscan > > streaming read user in 0014. > > > > Should I rerun the benchmarks with these new patches, to see if it > really helps with the regressions? That would be awesome! I will soon send out a summary of what we investigated off-list about 0010 (though we didn't end up concluding anything). My "fix" (leaving BitmapAdjustPrefetchIterator() above table_scan_bitmap_next_block()) eliminates the regression in 0010 on the one example that I repro'd upthread, but it would be good to know if it eliminates the regressions across some other tests. I think it would be worthwhile to run the subset of tests which seemed to fare the worst on 0010 against the patches 0001-0010-- cyclic uncached on your xeon machine with 4 parallel workers, IIRC -- even the 1 million scale would do the trick, I think. And then separately run the subset of tests which seemed to do the worst on 0014. There were several groups of issues across the different tests, but I think that the uniform pages data test would be relevant to use. It showed the regressions with eic 0. As for the other regressions showing with 0014, I think we would want to see at least one with fully-in-shared-buffers and one with fully uncached. Some of the fixes were around pinning fewer buffers when the blocks were already in shared buffers. - Melanie
Re: BitmapHeapScan streaming read user and prelim refactoring
On 3/14/24 19:16, Melanie Plageman wrote: > On Thu, Mar 14, 2024 at 03:32:04PM +0200, Heikki Linnakangas wrote: >> ... >> >> Ok, committed that for now. Thanks for looking! > > Attached v6 is rebased over your new commit. It also has the "fix" in > 0010 which moves BitmapAdjustPrefetchIterator() back above > table_scan_bitmap_next_block(). I've also updated the Streaming Read API > commit (0013) to Thomas' v7 version from [1]. This has the update that > we theorize should address some of the regressions in the bitmapheapscan > streaming read user in 0014. > Should I rerun the benchmarks with these new patches, to see if it really helps with the regressions? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: BitmapHeapScan streaming read user and prelim refactoring
On Fri, Mar 15, 2024 at 3:18 AM Tomas Vondra wrote: > So, IIUC this means (1) the patched code is more aggressive wrt > prefetching (because we prefetch more data overall, because master would > prefetch N pages and patched prefetches N ranges, each of which may be > multiple pages. And (2) it's not easy to quantify how much more > aggressive it is, because it depends on how we happen to coalesce the > pages into ranges. > > Do I understand this correctly? Yes. Parallelism must prevent coalescing here though. Any parallel aware executor node that allocates block numbers to workers without trying to preserve ranges will. That not only hides the opportunity to coalesce reads, it also makes (globally) sequential scans look random (ie locally they are more random), so that our logic to avoid issuing advice for sequential scan won't work, and we'll inject extra useless or harmful (?) fadvise calls. I don't know what to do about that yet, but it seems like a subject for future research. Should we recognise sequential scans with a window (like Linux does), instead of strictly next-block detection (like some other OSes do)? Maybe a shared streaming read that all workers pull blocks from, so it can see what's going on? I think the latter would be strictly more like what the ad hoc BHS prefetching code in master is doing, but I don't know if it'd be over-engineering, or hard to do for some reason. Another aspect of per-backend streaming reads in one parallel query that don't know about each other is that they will all have their own effective_io_concurrency limit. That is a version of a problem that comes up again and again in parallel query, to be solved by the grand unified resource control system of the future.
Re: BitmapHeapScan streaming read user and prelim refactoring
On 3/13/24 23:38, Thomas Munro wrote: > On Sun, Mar 3, 2024 at 11:41 AM Tomas Vondra > wrote: >> On 3/2/24 23:28, Melanie Plageman wrote: >>> On Sat, Mar 2, 2024 at 10:05 AM Tomas Vondra >>> wrote: With the current "master" code, eic=1 means we'll issue a prefetch for B and then read+process A. And then issue prefetch for C and read+process B, and so on. It's always one page ahead. >>> >>> Yes, that is what I mean for eic = 1 > > I spent quite a few days thinking about the meaning of eic=0 and eic=1 > for streaming_read.c v7[1], to make it agree with the above and with > master. Here's why I was confused: > > Both eic=0 and eic=1 are expected to generate at most 1 physical I/O > at a time, or I/O queue depth 1 if you want to put it that way. But > this isn't just about concurrency of I/O, it's also about computation. > Duh. > > eic=0 means that the I/O is not concurrent with executor computation. > So, to annotate an excerpt from [1]'s random.txt, we have: > > effective_io_concurrency = 0, range size = 1 > unpatched patched > == > pread(43,...,8192,0x58000) = 8192 pread(82,...,8192,0x58000) = 8192 > *** executor now has page at 0x58000 to work on *** > pread(43,...,8192,0xb) = 8192 pread(82,...,8192,0xb) = 8192 > *** executor now has page at 0xb to work on *** > > eic=1 means that a single I/O is started and then control is returned > to the executor code to do useful work concurrently with the > background read that we assume is happening: > > effective_io_concurrency = 1, range size = 1 > unpatched patched > == > pread(43,...,8192,0x58000) = 8192 pread(82,...,8192,0x58000) = 8192 > posix_fadvise(43,0xb,0x2000,...) posix_fadvise(82,0xb,0x2000,...) > *** executor now has page at 0x58000 to work on *** > pread(43,...,8192,0xb) = 8192 pread(82,...,8192,0xb) = 8192 > posix_fadvise(43,0x108000,0x2000,...) posix_fadvise(82,0x108000,0x2000,...) > *** executor now has page at 0xb to work on *** > pread(43,...,8192,0x108000) = 8192 pread(82,...,8192,0x108000) = 8192 > posix_fadvise(43,0x16,0x2000,...) posix_fadvise(82,0x16,0x2000,...) > > In other words, 'concurrency' doesn't mean 'number of I/Os running > concurrently with each other', it means 'number of I/Os running > concurrently with computation', and when you put it that way, 0 and 1 > are different. > Interesting. For some reason I thought with eic=1 we'd issue the fadvise for page #2 before pread of page #1, so that there'd be 2 IO requests in flight at the same time for a bit of time ... it'd give the fadvise more time to actually get the data into page cache. > Note that the first read is a bit special: by the time the consumer is > ready to pull a buffer out of the stream when we don't have a buffer > ready yet, it is too late to issue useful advice, so we don't bother. > FWIW I think even in the AIO future we would have a synchronous read > in that specific place, at least when using io_method=worker, because > it would be stupid to ask another process to read a block for us that > we want right now and then wait for it wake us up when it's done. > > Note that even when we aren't issuing any advice because eic=0 or > because we detected sequential access and we believe the kernel can do > a better job than us, we still 'look ahead' (= call the callback to > see which block numbers are coming down the pipe), but only as far as > we need to coalesce neighbouring blocks. (I deliberately avoid using > the word "prefetch" except in very general discussions because it > means different things to different layers of the code, hence talk of > "look ahead" and "advice".) That's how we get this change: > > effective_io_concurrency = 0, range size = 4 > unpatched patched > == > pread(43,...,8192,0x58000) = 8192 pread(82,...,8192,0x58000) = 8192 > pread(43,...,8192,0x5a000) = 8192 preadv(82,...,2,0x5a000) = 16384 > pread(43,...,8192,0x5c000) = 8192 pread(82,...,8192,0x5e000) = 8192 > pread(43,...,8192,0x5e000) = 8192 preadv(82,...,4,0xb) = 32768 > pread(43,...,8192,0xb) = 8192 preadv(82,...,4,0x108000) = 32768 > pread(43,...,8192,0xb2000) = 8192 preadv(82,...,4,0x16) = 32768 > > And then once we introduce eic > 0 to the picture with neighbouring > blocks that can be coalesced, "patched" starts to diverge even more > from "unpatched" because it tracks the number of wide I/Os in > progress, not the number of single blocks. > So, IIUC this means (1) the patched code is more aggressive wrt prefetching (because we prefetch more data overall, because master would
Re: BitmapHeapScan streaming read user and prelim refactoring
On 14/03/2024 12:55, Dilip Kumar wrote: On Thu, Mar 14, 2024 at 4:07 PM Heikki Linnakangas wrote: _SPI_execute_plan() has code to deal with the possibility that the active snapshot is not set. That seems fishy; do we really support SPI without any snapshot? I'm inclined to turn that into an error. I ran the regression tests with an "Assert(ActiveSnapshotSet())" there, and everything worked. IMHO, we can call SPI_Connect() and SPI_Execute() from any C extension, so I don't think there we can guarantee that the snapshot must be set, do we? I suppose, although the things you could do without a snapshot would be pretty limited. The query couldn't access any tables. Could it even look up functions in the parser? Not sure. Maybe for now we can just handle this specific case to remove the snapshot serializing for the BitmapHeapScan as you are doing in the patch. After looking into the code your theory seems correct that we are just copying the ActiveSnapshot while building the query descriptor and from there we are copying into the Estate so logically there should not be any reason for these two to be different. Ok, committed that for now. Thanks for looking! -- Heikki Linnakangas Neon (https://neon.tech)
Re: BitmapHeapScan streaming read user and prelim refactoring
On Thu, Mar 14, 2024 at 9:00 AM Heikki Linnakangas wrote: > The portal code is pretty explicit about it, the ExecutorRun() call in > PortalRunSelect() looks like this: > > PushActiveSnapshot(queryDesc->snapshot); > ExecutorRun(queryDesc, direction, (uint64) count, > portal->run_once); > nprocessed = queryDesc->estate->es_processed; > PopActiveSnapshot(); > > I looked at all the callers of ExecutorRun(), and they all have the > active snapshot equal to queryDesc->snapshot, either because they called > CreateQueryDesc() with the active snapshot before ExecutorRun(), or they > set the active snapshot like above. Well, maybe there's a bunch of code cleanup possible, then. -- Robert Haas EDB: http://www.enterprisedb.com
Re: BitmapHeapScan streaming read user and prelim refactoring
On 14/03/2024 14:34, Robert Haas wrote: On Thu, Mar 14, 2024 at 6:37 AM Heikki Linnakangas wrote: If es_snapshot was different from the active snapshot, things would get weird, even without parallel query. The scans would use es_snapshot for the visibility checks, but any functions you execute in quals would use the active snapshot. Hmm, that's an interesting point. The case where the query is suspended and resumed - i.e. cursors are used - probably needs more analysis. In that case, perhaps there's more room for the snapshots to diverge. The portal code is pretty explicit about it, the ExecutorRun() call in PortalRunSelect() looks like this: PushActiveSnapshot(queryDesc->snapshot); ExecutorRun(queryDesc, direction, (uint64) count, portal->run_once); nprocessed = queryDesc->estate->es_processed; PopActiveSnapshot(); I looked at all the callers of ExecutorRun(), and they all have the active snapshot equal to queryDesc->snapshot, either because they called CreateQueryDesc() with the active snapshot before ExecutorRun(), or they set the active snapshot like above. -- Heikki Linnakangas Neon (https://neon.tech)
Re: BitmapHeapScan streaming read user and prelim refactoring
On Thu, Mar 14, 2024 at 6:37 AM Heikki Linnakangas wrote: > If es_snapshot was different from the active snapshot, things would get > weird, even without parallel query. The scans would use es_snapshot for > the visibility checks, but any functions you execute in quals would use > the active snapshot. Hmm, that's an interesting point. The case where the query is suspended and resumed - i.e. cursors are used - probably needs more analysis. In that case, perhaps there's more room for the snapshots to diverge. -- Robert Haas EDB: http://www.enterprisedb.com
Re: BitmapHeapScan streaming read user and prelim refactoring
On Thu, Mar 14, 2024 at 4:07 PM Heikki Linnakangas wrote: > > > Yeah, that's a very valid point. So I think now Heikki/Melanie might > > have got an answer to their question, about the thought process behind > > serializing the snapshot for each scan node. And the same thing is > > followed for BitmapHeapNode as well. > > I see. Thanks, understanding the thought process helps. > > So when a parallel table or index scan runs in the executor as part of a > query, we could just use the active snapshot. But there are some other > callers of parallel table scans that don't use the executor, namely > parallel index builds. For those it makes sense to pass the snapshot for > the scan independent of the active snapshot. Right > A parallel bitmap heap scan isn't really a parallel scan as far as the > table AM is concerned, though. It's more like an independent bitmap heap > scan in each worker process, nodeBitmapHeapscan.c does all the > coordination of which blocks to scan. So I think that > table_parallelscan_initialize() was the wrong role model, and we should > still remove the snapshot serialization code from nodeBitmapHeapscan.c. I think that seems right. > Digging deeper into the question of whether es_snapshot == > GetActiveSnapshot() is a valid assumption: > > > > es_snapshot is copied from the QueryDesc in standard_ExecutorStart(). > Looking at the callers of ExecutorStart(), they all get the QueryDesc by > calling CreateQueryDesc() with GetActiveSnapshot(). And I don't see any > callers changing the active snapshot between the ExecutorStart() and > ExecutorRun() calls either. In pquery.c, we explicitly > PushActiveSnapshot(queryDesc->snapshot) before calling ExecutorRun(). So > no live bug here AFAICS, es_snapshot == GetActiveSnapshot() holds. > > _SPI_execute_plan() has code to deal with the possibility that the > active snapshot is not set. That seems fishy; do we really support SPI > without any snapshot? I'm inclined to turn that into an error. I ran the > regression tests with an "Assert(ActiveSnapshotSet())" there, and > everything worked. IMHO, we can call SPI_Connect() and SPI_Execute() from any C extension, so I don't think there we can guarantee that the snapshot must be set, do we? > If es_snapshot was different from the active snapshot, things would get > weird, even without parallel query. The scans would use es_snapshot for > the visibility checks, but any functions you execute in quals would use > the active snapshot. > > We could double down on that assumption, and remove es_snapshot > altogether and use GetActiveSnapshot() instead. And perhaps add > "PushActiveSnapshot(queryDesc->snapshot)" to ExecutorRun(). > > > > In summary, this es_snapshot stuff is a bit confusing and could use some > cleanup. But for now, I'd like to just add some assertions and a > comments about this, and remove the snapshot serialization from bitmap > heap scan node, to make it consistent with other non-parallel scan nodes > (it's not really a parallel scan as far as the table AM is concerned). > See attached patch, which is the same as previous patch with some extra > assertions. Maybe for now we can just handle this specific case to remove the snapshot serializing for the BitmapHeapScan as you are doing in the patch. After looking into the code your theory seems correct that we are just copying the ActiveSnapshot while building the query descriptor and from there we are copying into the Estate so logically there should not be any reason for these two to be different. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: BitmapHeapScan streaming read user and prelim refactoring
On 14/03/2024 06:54, Dilip Kumar wrote: On Wed, Mar 13, 2024 at 9:25 PM Robert Haas wrote: On Wed, Mar 13, 2024 at 11:39 AM Dilip Kumar wrote: Andres already commented on the snapshot stuff on an earlier patch version, and that's much nicer with this version. However, I don't understand why a parallel bitmap heap scan needs to do anything at all with the snapshot, even before these patches. The parallel worker infrastructure already passes the active snapshot from the leader to the parallel worker. Why does bitmap heap scan code need to do that too? Yeah thinking on this now it seems you are right that the parallel infrastructure is already passing the active snapshot so why do we need it again. Then I checked other low scan nodes like indexscan and seqscan and it seems we are doing the same things there as well. Check for SerializeSnapshot() in table_parallelscan_initialize() and index_parallelscan_initialize() which are being called from ExecSeqScanInitializeDSM() and ExecIndexScanInitializeDSM() respectively. I remember thinking about this when I was writing very early parallel query code. It seemed to me that there must be some reason why the EState has a snapshot, as opposed to just using the active snapshot, and so I took care to propagate that snapshot, which is used for the leader's scans, to the worker scans also. Now, if the EState doesn't need to contain a snapshot, then all of that mechanism is unnecessary, but I don't see how it can be right for the leader to do table_beginscan() using estate->es_snapshot and the worker to use the active snapshot. Yeah, that's a very valid point. So I think now Heikki/Melanie might have got an answer to their question, about the thought process behind serializing the snapshot for each scan node. And the same thing is followed for BitmapHeapNode as well. I see. Thanks, understanding the thought process helps. So when a parallel table or index scan runs in the executor as part of a query, we could just use the active snapshot. But there are some other callers of parallel table scans that don't use the executor, namely parallel index builds. For those it makes sense to pass the snapshot for the scan independent of the active snapshot. A parallel bitmap heap scan isn't really a parallel scan as far as the table AM is concerned, though. It's more like an independent bitmap heap scan in each worker process, nodeBitmapHeapscan.c does all the coordination of which blocks to scan. So I think that table_parallelscan_initialize() was the wrong role model, and we should still remove the snapshot serialization code from nodeBitmapHeapscan.c. Digging deeper into the question of whether es_snapshot == GetActiveSnapshot() is a valid assumption: es_snapshot is copied from the QueryDesc in standard_ExecutorStart(). Looking at the callers of ExecutorStart(), they all get the QueryDesc by calling CreateQueryDesc() with GetActiveSnapshot(). And I don't see any callers changing the active snapshot between the ExecutorStart() and ExecutorRun() calls either. In pquery.c, we explicitly PushActiveSnapshot(queryDesc->snapshot) before calling ExecutorRun(). So no live bug here AFAICS, es_snapshot == GetActiveSnapshot() holds. _SPI_execute_plan() has code to deal with the possibility that the active snapshot is not set. That seems fishy; do we really support SPI without any snapshot? I'm inclined to turn that into an error. I ran the regression tests with an "Assert(ActiveSnapshotSet())" there, and everything worked. If es_snapshot was different from the active snapshot, things would get weird, even without parallel query. The scans would use es_snapshot for the visibility checks, but any functions you execute in quals would use the active snapshot. We could double down on that assumption, and remove es_snapshot altogether and use GetActiveSnapshot() instead. And perhaps add "PushActiveSnapshot(queryDesc->snapshot)" to ExecutorRun(). In summary, this es_snapshot stuff is a bit confusing and could use some cleanup. But for now, I'd like to just add some assertions and a comments about this, and remove the snapshot serialization from bitmap heap scan node, to make it consistent with other non-parallel scan nodes (it's not really a parallel scan as far as the table AM is concerned). See attached patch, which is the same as previous patch with some extra assertions. -- Heikki Linnakangas Neon (https://neon.tech) From 44eb77162ccc6f18a0a5eaccf0c083d4fefd076f Mon Sep 17 00:00:00 2001 From: Heikki Linnakangas Date: Thu, 14 Mar 2024 11:56:11 +0200 Subject: [PATCH v2 1/1] Remove redundant snapshot copying from parallel leader to workers The parallel query infrastructure copies the leader backend's active snapshot to the worker processes. But BitmapHeapScan node also had bespoken code to pass the snapshot from leader to the worker. That was redundant, so remove it. The removed code was analogous to the snapshot serialization
Re: BitmapHeapScan streaming read user and prelim refactoring
On Wed, Mar 13, 2024 at 9:25 PM Robert Haas wrote: > > On Wed, Mar 13, 2024 at 11:39 AM Dilip Kumar wrote: > > > Andres already commented on the snapshot stuff on an earlier patch > > > version, and that's much nicer with this version. However, I don't > > > understand why a parallel bitmap heap scan needs to do anything at all > > > with the snapshot, even before these patches. The parallel worker > > > infrastructure already passes the active snapshot from the leader to the > > > parallel worker. Why does bitmap heap scan code need to do that too? > > > > Yeah thinking on this now it seems you are right that the parallel > > infrastructure is already passing the active snapshot so why do we > > need it again. Then I checked other low scan nodes like indexscan and > > seqscan and it seems we are doing the same things there as well. > > Check for SerializeSnapshot() in table_parallelscan_initialize() and > > index_parallelscan_initialize() which are being called from > > ExecSeqScanInitializeDSM() and ExecIndexScanInitializeDSM() > > respectively. > > I remember thinking about this when I was writing very early parallel > query code. It seemed to me that there must be some reason why the > EState has a snapshot, as opposed to just using the active snapshot, > and so I took care to propagate that snapshot, which is used for the > leader's scans, to the worker scans also. Now, if the EState doesn't > need to contain a snapshot, then all of that mechanism is unnecessary, > but I don't see how it can be right for the leader to do > table_beginscan() using estate->es_snapshot and the worker to use the > active snapshot. Yeah, that's a very valid point. So I think now Heikki/Melanie might have got an answer to their question, about the thought process behind serializing the snapshot for each scan node. And the same thing is followed for BitmapHeapNode as well. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: BitmapHeapScan streaming read user and prelim refactoring
On Sun, Mar 3, 2024 at 11:41 AM Tomas Vondra wrote: > On 3/2/24 23:28, Melanie Plageman wrote: > > On Sat, Mar 2, 2024 at 10:05 AM Tomas Vondra > > wrote: > >> With the current "master" code, eic=1 means we'll issue a prefetch for B > >> and then read+process A. And then issue prefetch for C and read+process > >> B, and so on. It's always one page ahead. > > > > Yes, that is what I mean for eic = 1 I spent quite a few days thinking about the meaning of eic=0 and eic=1 for streaming_read.c v7[1], to make it agree with the above and with master. Here's why I was confused: Both eic=0 and eic=1 are expected to generate at most 1 physical I/O at a time, or I/O queue depth 1 if you want to put it that way. But this isn't just about concurrency of I/O, it's also about computation. Duh. eic=0 means that the I/O is not concurrent with executor computation. So, to annotate an excerpt from [1]'s random.txt, we have: effective_io_concurrency = 0, range size = 1 unpatched patched == pread(43,...,8192,0x58000) = 8192 pread(82,...,8192,0x58000) = 8192 *** executor now has page at 0x58000 to work on *** pread(43,...,8192,0xb) = 8192 pread(82,...,8192,0xb) = 8192 *** executor now has page at 0xb to work on *** eic=1 means that a single I/O is started and then control is returned to the executor code to do useful work concurrently with the background read that we assume is happening: effective_io_concurrency = 1, range size = 1 unpatched patched == pread(43,...,8192,0x58000) = 8192 pread(82,...,8192,0x58000) = 8192 posix_fadvise(43,0xb,0x2000,...) posix_fadvise(82,0xb,0x2000,...) *** executor now has page at 0x58000 to work on *** pread(43,...,8192,0xb) = 8192 pread(82,...,8192,0xb) = 8192 posix_fadvise(43,0x108000,0x2000,...) posix_fadvise(82,0x108000,0x2000,...) *** executor now has page at 0xb to work on *** pread(43,...,8192,0x108000) = 8192 pread(82,...,8192,0x108000) = 8192 posix_fadvise(43,0x16,0x2000,...) posix_fadvise(82,0x16,0x2000,...) In other words, 'concurrency' doesn't mean 'number of I/Os running concurrently with each other', it means 'number of I/Os running concurrently with computation', and when you put it that way, 0 and 1 are different. Note that the first read is a bit special: by the time the consumer is ready to pull a buffer out of the stream when we don't have a buffer ready yet, it is too late to issue useful advice, so we don't bother. FWIW I think even in the AIO future we would have a synchronous read in that specific place, at least when using io_method=worker, because it would be stupid to ask another process to read a block for us that we want right now and then wait for it wake us up when it's done. Note that even when we aren't issuing any advice because eic=0 or because we detected sequential access and we believe the kernel can do a better job than us, we still 'look ahead' (= call the callback to see which block numbers are coming down the pipe), but only as far as we need to coalesce neighbouring blocks. (I deliberately avoid using the word "prefetch" except in very general discussions because it means different things to different layers of the code, hence talk of "look ahead" and "advice".) That's how we get this change: effective_io_concurrency = 0, range size = 4 unpatched patched == pread(43,...,8192,0x58000) = 8192 pread(82,...,8192,0x58000) = 8192 pread(43,...,8192,0x5a000) = 8192 preadv(82,...,2,0x5a000) = 16384 pread(43,...,8192,0x5c000) = 8192 pread(82,...,8192,0x5e000) = 8192 pread(43,...,8192,0x5e000) = 8192 preadv(82,...,4,0xb) = 32768 pread(43,...,8192,0xb) = 8192 preadv(82,...,4,0x108000) = 32768 pread(43,...,8192,0xb2000) = 8192 preadv(82,...,4,0x16) = 32768 And then once we introduce eic > 0 to the picture with neighbouring blocks that can be coalesced, "patched" starts to diverge even more from "unpatched" because it tracks the number of wide I/Os in progress, not the number of single blocks. [1] https://www.postgresql.org/message-id/ca+hukglji+c5jb3j6uvkgmyhky-qu+lpcsinahugsa5z4dv...@mail.gmail.com
Re: BitmapHeapScan streaming read user and prelim refactoring
On 3/13/24 14:34, Heikki Linnakangas wrote: > ... > > Lots of discussion happening on the performance results but it seems > that there is no performance impact with the preliminary patches up to > v5-0013-Streaming-Read-API.patch. I'm focusing purely on those > preliminary patches now, because I think they're worthwhile cleanups > independent of the streaming read API. > Not quite true - the comparison I shared on 29/2 [1] shows a serious regression caused by the 0010 patch. We've been investigating this with Melanie off list, but we don't have any clear findings yet (except that it's clearly due to moving BitmapAdjustPrefetchIterator() a bit down. But if we revert this (and move the BitmapAdjustPrefetchIterator back), the regression should disappear, and we can merge these preparatory patches. We'll have to deal with the regression (or something very similar) when merging the remaining patches. regards [1] https://www.postgresql.org/message-id/91090d58-7d3f-4447-9425-f24ba66e292a%40enterprisedb.com -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: BitmapHeapScan streaming read user and prelim refactoring
On Wed, Mar 13, 2024 at 11:39 AM Dilip Kumar wrote: > > Andres already commented on the snapshot stuff on an earlier patch > > version, and that's much nicer with this version. However, I don't > > understand why a parallel bitmap heap scan needs to do anything at all > > with the snapshot, even before these patches. The parallel worker > > infrastructure already passes the active snapshot from the leader to the > > parallel worker. Why does bitmap heap scan code need to do that too? > > Yeah thinking on this now it seems you are right that the parallel > infrastructure is already passing the active snapshot so why do we > need it again. Then I checked other low scan nodes like indexscan and > seqscan and it seems we are doing the same things there as well. > Check for SerializeSnapshot() in table_parallelscan_initialize() and > index_parallelscan_initialize() which are being called from > ExecSeqScanInitializeDSM() and ExecIndexScanInitializeDSM() > respectively. I remember thinking about this when I was writing very early parallel query code. It seemed to me that there must be some reason why the EState has a snapshot, as opposed to just using the active snapshot, and so I took care to propagate that snapshot, which is used for the leader's scans, to the worker scans also. Now, if the EState doesn't need to contain a snapshot, then all of that mechanism is unnecessary, but I don't see how it can be right for the leader to do table_beginscan() using estate->es_snapshot and the worker to use the active snapshot. -- Robert Haas EDB: http://www.enterprisedb.com
Re: BitmapHeapScan streaming read user and prelim refactoring
On Wed, Mar 13, 2024 at 7:04 PM Heikki Linnakangas wrote: > > (Adding Dilip, the original author of the parallel bitmap heap scan > patch all those years ago, in case you remember anything about the > snapshot stuff below.) > > On 27/02/2024 16:22, Melanie Plageman wrote: > Andres already commented on the snapshot stuff on an earlier patch > version, and that's much nicer with this version. However, I don't > understand why a parallel bitmap heap scan needs to do anything at all > with the snapshot, even before these patches. The parallel worker > infrastructure already passes the active snapshot from the leader to the > parallel worker. Why does bitmap heap scan code need to do that too? Yeah thinking on this now it seems you are right that the parallel infrastructure is already passing the active snapshot so why do we need it again. Then I checked other low scan nodes like indexscan and seqscan and it seems we are doing the same things there as well. Check for SerializeSnapshot() in table_parallelscan_initialize() and index_parallelscan_initialize() which are being called from ExecSeqScanInitializeDSM() and ExecIndexScanInitializeDSM() respectively. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: BitmapHeapScan streaming read user and prelim refactoring
(Adding Dilip, the original author of the parallel bitmap heap scan patch all those years ago, in case you remember anything about the snapshot stuff below.) On 27/02/2024 16:22, Melanie Plageman wrote: On Mon, Feb 26, 2024 at 08:50:28PM -0500, Melanie Plageman wrote: On Fri, Feb 16, 2024 at 12:35:59PM -0500, Melanie Plageman wrote: In the attached v3, I've reordered the commits, updated some errant comments, and improved the commit messages. I've also made some updates to the TIDBitmap API that seem like a clarity improvement to the API in general. These also reduce the diff for GIN when separating the TBMIterateResult from the TBM[Shared]Iterator. And these TIDBitmap API changes are now all in their own commits (previously those were in the same commit as adding the BitmapHeapScan streaming read user). The three outstanding issues I see in the patch set are: 1) the lossy and exact page counters issue described in my previous email I've resolved this. I added a new patch to the set which starts counting even pages with no visible tuples toward lossy and exact pages. After an off-list conversation with Andres, it seems that this omission in master may not have been intentional. Once we have only two types of pages to differentiate between (lossy and exact [no longer have to care about "has no visible tuples"]), it is easy enough to pass a "lossy" boolean paramater to table_scan_bitmap_next_block(). I've done this in the attached v4. Thomas posted a new version of the Streaming Read API [1], so here is a rebased v5. This should make it easier to review as it can be applied on top of master. Lots of discussion happening on the performance results but it seems that there is no performance impact with the preliminary patches up to v5-0013-Streaming-Read-API.patch. I'm focusing purely on those preliminary patches now, because I think they're worthwhile cleanups independent of the streaming read API. Andres already commented on the snapshot stuff on an earlier patch version, and that's much nicer with this version. However, I don't understand why a parallel bitmap heap scan needs to do anything at all with the snapshot, even before these patches. The parallel worker infrastructure already passes the active snapshot from the leader to the parallel worker. Why does bitmap heap scan code need to do that too? I disabled that with: --- a/src/backend/executor/nodeBitmapHeapscan.c +++ b/src/backend/executor/nodeBitmapHeapscan.c @@ -874,7 +874,9 @@ ExecBitmapHeapInitializeWorker(BitmapHeapScanState *node, pstate = shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, false); node->pstate = pstate; +#if 0 node->worker_snapshot = RestoreSnapshot(pstate->phs_snapshot_data); Assert(IsMVCCSnapshot(node->worker_snapshot)); RegisterSnapshot(node->worker_snapshot); +#endif } and ran "make check-world". All the tests passed. To be even more sure, I added some code there to assert that the serialized version of node->ss.ps.state->es_snapshot is equal to pstate->phs_snapshot_data, and all the tests passed with that too. I propose that we just remove the code in BitmapHeapScan to serialize the snapshot, per attached patch. -- Heikki Linnakangas Neon (https://neon.tech) From 265d77efa2b56d5cfc3caf7843058b7336524860 Mon Sep 17 00:00:00 2001 From: Heikki Linnakangas Date: Wed, 13 Mar 2024 15:25:07 +0200 Subject: [PATCH 1/1] Remove redundant snapshot copying from parallel leader to workers The parallel query infrastructure copies the leader backend's active snapshot to the worker processes. But BitmapHeapScan node also had bespoken code to pass the snapshot from leader to the worker. That was redundant, so remove it. Discussion: XX --- src/backend/access/table/tableam.c| 10 -- src/backend/executor/nodeBitmapHeapscan.c | 17 ++--- src/include/access/tableam.h | 5 - src/include/nodes/execnodes.h | 4 4 files changed, 2 insertions(+), 34 deletions(-) diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c index 6ed8cca05a1..e57a0b7ea31 100644 --- a/src/backend/access/table/tableam.c +++ b/src/backend/access/table/tableam.c @@ -120,16 +120,6 @@ table_beginscan_catalog(Relation relation, int nkeys, struct ScanKeyData *key) NULL, flags); } -void -table_scan_update_snapshot(TableScanDesc scan, Snapshot snapshot) -{ - Assert(IsMVCCSnapshot(snapshot)); - - RegisterSnapshot(snapshot); - scan->rs_snapshot = snapshot; - scan->rs_flags |= SO_TEMP_SNAPSHOT; -} - /* * Parallel table scan related functions. diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c index 345b67649ea..ca548e44eb4 100644 --- a/src/backend/executor/nodeBitmapHeapscan.c +++ b/src/backend/executor/nodeBitmapHeapscan.c @@ -721,7 +721,6 @@
Re: BitmapHeapScan streaming read user and prelim refactoring
On Sat, Mar 2, 2024 at 6:59 PM Tomas Vondra wrote: > > > > On 3/1/24 17:51, Melanie Plageman wrote: > > On Fri, Mar 1, 2024 at 9:05 AM Tomas Vondra > > wrote: > >> > >> On 3/1/24 02:18, Melanie Plageman wrote: > >>> On Thu, Feb 29, 2024 at 6:44 PM Tomas Vondra > >>> wrote: > > On 2/29/24 23:44, Tomas Vondra wrote: > 1) On master there's clear difference between eic=0 and eic=1 cases, but > on the patched build there's literally no difference - for example the > "uniform" distribution is clearly not great for prefetching, but eic=0 > regresses to eic=1 poor behavior). > >>> > >>> Yes, so eic=0 and eic=1 are identical with the streaming read API. > >>> That is, eic 0 does not disable prefetching. Thomas is going to update > >>> the streaming read API to avoid issuing an fadvise for the last block > >>> in a range before issuing a read -- which would mean no prefetching > >>> with eic 0 and eic 1. Not doing prefetching with eic 1 actually seems > >>> like the right behavior -- which would be different than what master > >>> is doing, right? > >> > >> I don't think we should stop doing prefetching for eic=1, or at least > >> not based just on these charts. I suspect these "uniform" charts are not > >> a great example for the prefetching, because it's about distribution of > >> individual rows, and even a small fraction of rows may match most of the > >> pages. It's great for finding strange behaviors / corner cases, but > >> probably not a sufficient reason to change the default. > > > > Yes, I would like to see results from a data set where selectivity is > > more correlated to pages/heap fetches. But, I'm not sure I see how > > that is related to prefetching when eic = 1. > > > >> I think it makes sense to issue a prefetch one page ahead, before > >> reading/processing the preceding one, and it's fairly conservative > >> setting, and I assume the default was chosen for a reason / after > >> discussion. > > > > Yes, I suppose the overhead of an fadvise does not compare to the IO > > latency of synchronously reading that block. Actually, I bet the > > regression I saw by accidentally moving BitmapAdjustPrefetchIterator() > > after table_scan_bitmap_next_block() would be similar to the > > regression introduced by making eic = 1 not prefetch. > > > > When you think about IO concurrency = 1, it doesn't imply prefetching > > to me. But, I think we want to do the right thing and have parity with > > master. > > > >> My suggestion would be to keep the master behavior unless not practical, > >> and then maybe discuss changing the details later. The patch is already > >> complicated enough, better to leave that discussion for later. > > > > Agreed. Speaking of which, we need to add back use of tablespace IO > > concurrency for the streaming read API (which is used by > > BitmapHeapScan in master). > > > >>> With very low selectivity, you are less likely to get readahead > >>> (right?) and similarly less likely to be able to build up > 8kB IOs -- > >>> which is one of the main value propositions of the streaming read > >>> code. I imagine that this larger read benefit is part of why the > >>> performance is better at higher selectivities with the patch. This > >>> might be a silly experiment, but we could try decreasing > >>> MAX_BUFFERS_PER_TRANSFER on the patched version and see if the > >>> performance gains go away. > >> > >> Sure, I can do that. Do you have any particular suggestion what value to > >> use for MAX_BUFFERS_PER_TRANSFER? > > > > I think setting it to 1 would be the same as always master -- doing > > only 8kB reads. The only thing about that is that I imagine the other > > streaming read code has some overhead which might end up being a > > regression on balance even with the prefetching if we aren't actually > > using the ranges/vectored capabilities of the streaming read > > interface. Maybe if you just run it for one of the very obvious > > performance improvement cases? I can also try this locally. > > > > Here's some results from a build with > > #define MAX_BUFFERS_PER_TRANSFER 1 > > There are three columns: > > - master > - patched (original patches, with MAX_BUFFERS_PER_TRANSFER=128kB) > - patched-single (MAX_BUFFERS_PER_TRANSFER=8kB) > > The color scales are always branch compared to master. > > I think the expectation was that setting the transfer to 1 would make it > closer to master, reducing some of the regressions. But in practice the > effect is the opposite. > > - In "cached" runs, this eliminates the small improvements (light > green), but leaves the regressions behind. For cached runs, I actually would expect that MAX_BUFFERS_PER_TRANSFER would eliminate the regressions. Pinning more buffers will only hurt us for cached workloads. This is evidence that we may need to control the number of pinned buffers differently when there has been a run of fully cached blocks. > - In "uncached" runs, this exacerbates the regressions, particularly for > low
Re: BitmapHeapScan streaming read user and prelim refactoring
On Sat, Mar 2, 2024 at 5:51 PM Tomas Vondra wrote: > > On 3/2/24 23:11, Melanie Plageman wrote: > > On Fri, Mar 1, 2024 at 2:31 PM Melanie Plageman > > wrote: > >> > >> ... > >> > >> Hold the phone on this one. I realized why I moved > >> BitmapAdjustPrefetchIterator after table_scan_bitmap_next_block() in > >> the first place -- master calls BitmapAdjustPrefetchIterator after the > >> tbm_iterate() for the current block -- otherwise with eic = 1, it > >> considers the prefetch iterator behind the current block iterator. I'm > >> going to go through and figure out what order this must be done in and > >> fix it. > > > > So, I investigated this further, and, as far as I can tell, for > > parallel bitmapheapscan the timing around when workers decrement > > prefetch_pages causes the performance differences with patch 0010 > > applied. It makes very little sense to me, but some of the queries I > > borrowed from your regression examples are up to 30% slower when this > > code from BitmapAdjustPrefetchIterator() is after > > table_scan_bitmap_next_block() instead of before it. > > > > SpinLockAcquire(>mutex); > > if (pstate->prefetch_pages > 0) > > pstate->prefetch_pages--; > > SpinLockRelease(>mutex); > > > > I did some stracing and did see much more time spent in futex/wait > > with this code after the call to table_scan_bitmap_next_block() vs > > before it. (table_scan_bitmap_next_block()) calls ReadBuffer()). > > > > In my branch, I've now moved only the parallel prefetch_pages-- code > > to before table_scan_bitmap_next_block(). > > https://github.com/melanieplageman/postgres/tree/bhs_pgsr > > I'd be interested to know if you see the regressions go away with 0010 > > applied (commit message "Make table_scan_bitmap_next_block() async > > friendly" and sha bfdcbfee7be8e2c461). > > > > I'll give this a try once the runs with MAX_BUFFERS_PER_TRANSFER=1 > complete. But it seems really bizarre that simply moving this code a > little bit would cause such a regression ... Yes, it is bizarre. It also might not be a reproducible performance difference on the cases besides the one I was testing (cyclic dataset, uncached, eic=8, matches 16+, distinct=100, rows=1, 4 parallel workers). But even if it only affects that one case, it still had a major, reproducible performance impact to move those 5 lines before and after table_scan_bitmap_next_block(). The same number of reads and fadvises are being issued overall. However, I did notice that the pread calls are skewed when the those lines of code are after table_scan_bitmap_next_block() -- fewer of the workers are doing more of the reads. Perhaps this explains what is taking longer. Why those workers would end up doing more of the reads, I don't quite know. - Melanie
Re: BitmapHeapScan streaming read user and prelim refactoring
On Sat, Mar 2, 2024 at 5:41 PM Tomas Vondra wrote: > > > > On 3/2/24 23:28, Melanie Plageman wrote: > > On Sat, Mar 2, 2024 at 10:05 AM Tomas Vondra > > wrote: > >> > >> Here's a PDF with charts for a dataset where the row selectivity is more > >> correlated to selectivity of pages. I'm attaching the updated script, > >> with the SQL generating the data set. But the short story is all rows on > >> a single page have the same random value, so the selectivity of rows and > >> pages should be the same. > >> > >> The first page has results for the original "uniform", the second page > >> is the new "uniform-pages" data set. There are 4 charts, for > >> master/patched and 0/4 parallel workers. Overall the behavior is the > >> same, but for the "uniform-pages" it's much more gradual (with respect > >> to row selectivity). I think that's expected. > > > > Cool! Thanks for doing this. I have convinced myself that Thomas' > > forthcoming patch which will eliminate prefetching with eic = 0 will > > fix the eic 0 blue line regressions. The eic = 1 with four parallel > > workers is more confusing. And it seems more noticeably bad with your > > randomized-pages dataset. > > > > Regarding your earlier question: > > > >> Just to be sure we're on the same page regarding what eic=1 means, > >> consider a simple sequence of pages: A, B, C, D, E, ... > >> > >> With the current "master" code, eic=1 means we'll issue a prefetch for B > >> and then read+process A. And then issue prefetch for C and read+process > >> B, and so on. It's always one page ahead. > > > > Yes, that is what I mean for eic = 1 > > > >> As for how this is related to eic=1 - I think my point was that these > >> are "adversary" data sets, most likely to show regressions. This applies > >> especially to the "uniform" data set, because as the row selectivity > >> grows, it's more and more likely it's right after to the current one, > >> and so a read-ahead would likely do the trick. > > > > No, I think you're right that eic=1 should prefetch. As you say, with > > high selectivity, a bitmap plan is likely not the best one anyway, so > > not prefetching in order to preserve the performance of those cases > > seems silly. > > > > I was just trying to respond do this from an earlier message: > > > Yes, I would like to see results from a data set where selectivity is > > more correlated to pages/heap fetches. But, I'm not sure I see how > > that is related to prefetching when eic = 1. > > And in that same message you also said "Not doing prefetching with eic 1 > actually seems like the right behavior". Hence my argument we should not > stop prefetching for eic=1. > > But maybe I'm confused - it seems agree eic=1 should prefetch, and that > uniform data set may not be a good argument against that. Yep, we agree. I was being confusing and wrong :) I just wanted to make sure the thread had a clear consensus that, yes, it is the right thing to do to prefetch blocks for bitmap heap scans when effective_io_concurrency = 1. - Melanie
Re: BitmapHeapScan streaming read user and prelim refactoring
On 3/2/24 23:11, Melanie Plageman wrote: > On Fri, Mar 1, 2024 at 2:31 PM Melanie Plageman > wrote: >> >> ... >> >> Hold the phone on this one. I realized why I moved >> BitmapAdjustPrefetchIterator after table_scan_bitmap_next_block() in >> the first place -- master calls BitmapAdjustPrefetchIterator after the >> tbm_iterate() for the current block -- otherwise with eic = 1, it >> considers the prefetch iterator behind the current block iterator. I'm >> going to go through and figure out what order this must be done in and >> fix it. > > So, I investigated this further, and, as far as I can tell, for > parallel bitmapheapscan the timing around when workers decrement > prefetch_pages causes the performance differences with patch 0010 > applied. It makes very little sense to me, but some of the queries I > borrowed from your regression examples are up to 30% slower when this > code from BitmapAdjustPrefetchIterator() is after > table_scan_bitmap_next_block() instead of before it. > > SpinLockAcquire(>mutex); > if (pstate->prefetch_pages > 0) > pstate->prefetch_pages--; > SpinLockRelease(>mutex); > > I did some stracing and did see much more time spent in futex/wait > with this code after the call to table_scan_bitmap_next_block() vs > before it. (table_scan_bitmap_next_block()) calls ReadBuffer()). > > In my branch, I've now moved only the parallel prefetch_pages-- code > to before table_scan_bitmap_next_block(). > https://github.com/melanieplageman/postgres/tree/bhs_pgsr > I'd be interested to know if you see the regressions go away with 0010 > applied (commit message "Make table_scan_bitmap_next_block() async > friendly" and sha bfdcbfee7be8e2c461). > I'll give this a try once the runs with MAX_BUFFERS_PER_TRANSFER=1 complete. But it seems really bizarre that simply moving this code a little bit would cause such a regression ... regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: BitmapHeapScan streaming read user and prelim refactoring
On 3/2/24 23:28, Melanie Plageman wrote: > On Sat, Mar 2, 2024 at 10:05 AM Tomas Vondra > wrote: >> >> Here's a PDF with charts for a dataset where the row selectivity is more >> correlated to selectivity of pages. I'm attaching the updated script, >> with the SQL generating the data set. But the short story is all rows on >> a single page have the same random value, so the selectivity of rows and >> pages should be the same. >> >> The first page has results for the original "uniform", the second page >> is the new "uniform-pages" data set. There are 4 charts, for >> master/patched and 0/4 parallel workers. Overall the behavior is the >> same, but for the "uniform-pages" it's much more gradual (with respect >> to row selectivity). I think that's expected. > > Cool! Thanks for doing this. I have convinced myself that Thomas' > forthcoming patch which will eliminate prefetching with eic = 0 will > fix the eic 0 blue line regressions. The eic = 1 with four parallel > workers is more confusing. And it seems more noticeably bad with your > randomized-pages dataset. > > Regarding your earlier question: > >> Just to be sure we're on the same page regarding what eic=1 means, >> consider a simple sequence of pages: A, B, C, D, E, ... >> >> With the current "master" code, eic=1 means we'll issue a prefetch for B >> and then read+process A. And then issue prefetch for C and read+process >> B, and so on. It's always one page ahead. > > Yes, that is what I mean for eic = 1 > >> As for how this is related to eic=1 - I think my point was that these >> are "adversary" data sets, most likely to show regressions. This applies >> especially to the "uniform" data set, because as the row selectivity >> grows, it's more and more likely it's right after to the current one, >> and so a read-ahead would likely do the trick. > > No, I think you're right that eic=1 should prefetch. As you say, with > high selectivity, a bitmap plan is likely not the best one anyway, so > not prefetching in order to preserve the performance of those cases > seems silly. > I was just trying to respond do this from an earlier message: > Yes, I would like to see results from a data set where selectivity is > more correlated to pages/heap fetches. But, I'm not sure I see how > that is related to prefetching when eic = 1. And in that same message you also said "Not doing prefetching with eic 1 actually seems like the right behavior". Hence my argument we should not stop prefetching for eic=1. But maybe I'm confused - it seems agree eic=1 should prefetch, and that uniform data set may not be a good argument against that. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: BitmapHeapScan streaming read user and prelim refactoring
On Sat, Mar 2, 2024 at 10:05 AM Tomas Vondra wrote: > > Here's a PDF with charts for a dataset where the row selectivity is more > correlated to selectivity of pages. I'm attaching the updated script, > with the SQL generating the data set. But the short story is all rows on > a single page have the same random value, so the selectivity of rows and > pages should be the same. > > The first page has results for the original "uniform", the second page > is the new "uniform-pages" data set. There are 4 charts, for > master/patched and 0/4 parallel workers. Overall the behavior is the > same, but for the "uniform-pages" it's much more gradual (with respect > to row selectivity). I think that's expected. Cool! Thanks for doing this. I have convinced myself that Thomas' forthcoming patch which will eliminate prefetching with eic = 0 will fix the eic 0 blue line regressions. The eic = 1 with four parallel workers is more confusing. And it seems more noticeably bad with your randomized-pages dataset. Regarding your earlier question: > Just to be sure we're on the same page regarding what eic=1 means, > consider a simple sequence of pages: A, B, C, D, E, ... > > With the current "master" code, eic=1 means we'll issue a prefetch for B > and then read+process A. And then issue prefetch for C and read+process > B, and so on. It's always one page ahead. Yes, that is what I mean for eic = 1 > As for how this is related to eic=1 - I think my point was that these > are "adversary" data sets, most likely to show regressions. This applies > especially to the "uniform" data set, because as the row selectivity > grows, it's more and more likely it's right after to the current one, > and so a read-ahead would likely do the trick. No, I think you're right that eic=1 should prefetch. As you say, with high selectivity, a bitmap plan is likely not the best one anyway, so not prefetching in order to preserve the performance of those cases seems silly. - Melanie
Re: BitmapHeapScan streaming read user and prelim refactoring
On Fri, Mar 1, 2024 at 2:31 PM Melanie Plageman wrote: > > On Thu, Feb 29, 2024 at 7:29 PM Melanie Plageman > wrote: > > > > On Thu, Feb 29, 2024 at 5:44 PM Tomas Vondra > > wrote: > > > > > > > > > > > > On 2/29/24 22:19, Melanie Plageman wrote: > > > > On Thu, Feb 29, 2024 at 7:54 AM Tomas Vondra > > > > wrote: > > > >> > > > >> > > > >> > > > >> On 2/29/24 00:40, Melanie Plageman wrote: > > > >>> On Wed, Feb 28, 2024 at 6:17 PM Tomas Vondra > > > >>> wrote: > > > > > > > > > > > > On 2/28/24 21:06, Melanie Plageman wrote: > > > > On Wed, Feb 28, 2024 at 2:23 PM Tomas Vondra > > > > wrote: > > > >> > > > >> On 2/28/24 15:56, Tomas Vondra wrote: > > > ... > > > >>> > > > >>> Sure, I can do that. It'll take a couple hours to get the > > > >>> results, I'll > > > >>> share them when I have them. > > > >>> > > > >> > > > >> Here are the results with only patches 0001 - 0012 applied (i.e. > > > >> without > > > >> the patch introducing the streaming read API, and the patch > > > >> switching > > > >> the bitmap heap scan to use it). > > > >> > > > >> The changes in performance don't disappear entirely, but the scale > > > >> is > > > >> certainly much smaller - both in the complete results for all > > > >> runs, and > > > >> for the "optimal" runs that would actually pick bitmapscan. > > > > > > > > Hmm. I'm trying to think how my refactor could have had this impact. > > > > It seems like all the most notable regressions are with 4 parallel > > > > workers. What do the numeric column labels mean across the top > > > > (2,4,8,16...) -- are they related to "matches"? And if so, what does > > > > that mean? > > > > > > > > > > That's the number of distinct values matched by the query, which > > > should > > > be an approximation of the number of matching rows. The number of > > > distinct values in the data set differs by data set, but for 1M rows > > > it's roughly like this: > > > > > > uniform: 10k > > > linear: 10k > > > cyclic: 100 > > > > > > So for example matches=128 means ~1% of rows for uniform/linear, and > > > 100% for cyclic data sets. > > > >>> > > > >>> Ah, thank you for the explanation. I also looked at your script after > > > >>> having sent this email and saw that it is clear in your script what > > > >>> "matches" is. > > > >>> > > > As for the possible cause, I think it's clear most of the difference > > > comes from the last patch that actually switches bitmap heap scan to > > > the > > > streaming read API. That's mostly expected/understandable, although > > > we > > > probably need to look into the regressions or cases with e_i_c=0. > > > >>> > > > >>> Right, I'm mostly surprised about the regressions for patches > > > >>> 0001-0012. > > > >>> > > > >>> Re eic 0: Thomas Munro and I chatted off-list, and you bring up a > > > >>> great point about eic 0. In old bitmapheapscan code eic 0 basically > > > >>> disabled prefetching but with the streaming read API, it will still > > > >>> issue fadvises when eic is 0. That is an easy one line fix. Thomas > > > >>> prefers to fix it by always avoiding an fadvise for the last buffer in > > > >>> a range before issuing a read (since we are about to read it anyway, > > > >>> best not fadvise it too). This will fix eic 0 and also cut one system > > > >>> call from each invocation of the streaming read machinery. > > > >>> > > > To analyze the 0001-0012 patches, maybe it'd be helpful to run tests > > > for > > > individual patches. I can try doing that tomorrow. It'll have to be a > > > limited set of tests, to reduce the time, but might tell us whether > > > it's > > > due to a single patch or multiple patches. > > > >>> > > > >>> Yes, tomorrow I planned to start trying to repro some of the "red" > > > >>> cases myself. Any one of the commits could cause a slight regression > > > >>> but a 3.5x regression is quite surprising, so I might focus on trying > > > >>> to repro that locally and then narrow down which patch causes it. > > > >>> > > > >>> For the non-cached regressions, perhaps the commit to use the correct > > > >>> recheck flag (0004) when prefetching could be the culprit. And for the > > > >>> cached regressions, my money is on the commit which changes the whole > > > >>> control flow of BitmapHeapNext() and the next_block() and next_tuple() > > > >>> functions (0010). > > > >>> > > > >> > > > >> I do have some partial results, comparing the patches. I only ran one > > > >> of > > > >> the more affected workloads (cyclic) on the xeon, attached is a PDF > > > >> comparing master and the 0001-0014 patches. The percentages are timing > > > >> vs. the preceding patch (green - faster, red - slower). > > > > > > > > Just confirming: the results are for uncached? > > >
Re: BitmapHeapScan streaming read user and prelim refactoring
On Thu, Feb 29, 2024 at 7:29 PM Melanie Plageman wrote: > > On Thu, Feb 29, 2024 at 5:44 PM Tomas Vondra > wrote: > > > > > > > > On 2/29/24 22:19, Melanie Plageman wrote: > > > On Thu, Feb 29, 2024 at 7:54 AM Tomas Vondra > > > wrote: > > >> > > >> > > >> > > >> On 2/29/24 00:40, Melanie Plageman wrote: > > >>> On Wed, Feb 28, 2024 at 6:17 PM Tomas Vondra > > >>> wrote: > > > > > > > > On 2/28/24 21:06, Melanie Plageman wrote: > > > On Wed, Feb 28, 2024 at 2:23 PM Tomas Vondra > > > wrote: > > >> > > >> On 2/28/24 15:56, Tomas Vondra wrote: > > ... > > >>> > > >>> Sure, I can do that. It'll take a couple hours to get the results, > > >>> I'll > > >>> share them when I have them. > > >>> > > >> > > >> Here are the results with only patches 0001 - 0012 applied (i.e. > > >> without > > >> the patch introducing the streaming read API, and the patch switching > > >> the bitmap heap scan to use it). > > >> > > >> The changes in performance don't disappear entirely, but the scale is > > >> certainly much smaller - both in the complete results for all runs, > > >> and > > >> for the "optimal" runs that would actually pick bitmapscan. > > > > > > Hmm. I'm trying to think how my refactor could have had this impact. > > > It seems like all the most notable regressions are with 4 parallel > > > workers. What do the numeric column labels mean across the top > > > (2,4,8,16...) -- are they related to "matches"? And if so, what does > > > that mean? > > > > > > > That's the number of distinct values matched by the query, which should > > be an approximation of the number of matching rows. The number of > > distinct values in the data set differs by data set, but for 1M rows > > it's roughly like this: > > > > uniform: 10k > > linear: 10k > > cyclic: 100 > > > > So for example matches=128 means ~1% of rows for uniform/linear, and > > 100% for cyclic data sets. > > >>> > > >>> Ah, thank you for the explanation. I also looked at your script after > > >>> having sent this email and saw that it is clear in your script what > > >>> "matches" is. > > >>> > > As for the possible cause, I think it's clear most of the difference > > comes from the last patch that actually switches bitmap heap scan to > > the > > streaming read API. That's mostly expected/understandable, although we > > probably need to look into the regressions or cases with e_i_c=0. > > >>> > > >>> Right, I'm mostly surprised about the regressions for patches 0001-0012. > > >>> > > >>> Re eic 0: Thomas Munro and I chatted off-list, and you bring up a > > >>> great point about eic 0. In old bitmapheapscan code eic 0 basically > > >>> disabled prefetching but with the streaming read API, it will still > > >>> issue fadvises when eic is 0. That is an easy one line fix. Thomas > > >>> prefers to fix it by always avoiding an fadvise for the last buffer in > > >>> a range before issuing a read (since we are about to read it anyway, > > >>> best not fadvise it too). This will fix eic 0 and also cut one system > > >>> call from each invocation of the streaming read machinery. > > >>> > > To analyze the 0001-0012 patches, maybe it'd be helpful to run tests > > for > > individual patches. I can try doing that tomorrow. It'll have to be a > > limited set of tests, to reduce the time, but might tell us whether > > it's > > due to a single patch or multiple patches. > > >>> > > >>> Yes, tomorrow I planned to start trying to repro some of the "red" > > >>> cases myself. Any one of the commits could cause a slight regression > > >>> but a 3.5x regression is quite surprising, so I might focus on trying > > >>> to repro that locally and then narrow down which patch causes it. > > >>> > > >>> For the non-cached regressions, perhaps the commit to use the correct > > >>> recheck flag (0004) when prefetching could be the culprit. And for the > > >>> cached regressions, my money is on the commit which changes the whole > > >>> control flow of BitmapHeapNext() and the next_block() and next_tuple() > > >>> functions (0010). > > >>> > > >> > > >> I do have some partial results, comparing the patches. I only ran one of > > >> the more affected workloads (cyclic) on the xeon, attached is a PDF > > >> comparing master and the 0001-0014 patches. The percentages are timing > > >> vs. the preceding patch (green - faster, red - slower). > > > > > > Just confirming: the results are for uncached? > > > > > > > Yes, cyclic data set, uncached case. I picked this because it seemed > > like one of the most affected cases. Do you want me to test some other > > cases too? > > So, I actually may have found the source of at least part of the > regression with 0010. I was able to reproduce the regression with > patch 0010 applied for the unached