from:"Masahiko Sawada"

Re: First draft of PG 17 release notes

2024-05-21 Thread Masahiko Sawada

Hi,

On Thu, May 9, 2024 at 1:03 PM Bruce Momjian  wrote:
>
> I have committed the first draft of the PG 17 release notes;  you can
> see the results here:
>
> https://momjian.us/pgsql_docs/release-17.html
>
> It will be improved until the final release.  The item count is 188,
> which is similar to recent releases:
>

I found a typo:

s/pg_statstatement/pg_stat_statement/

I've attached a patch to fix it.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


fix_pg_stat_statements.patch
Description: Binary data

Re: PostgreSQL 17 Beta 1 release announcement draft

2024-05-20 Thread Masahiko Sawada

On Mon, May 20, 2024 at 8:47 PM Jonathan S. Katz  wrote:
>
> On 5/20/24 2:58 AM, John Naylor wrote:
> > Hi Jon,
> >
> > Regarding vacuum "has shown up to a 6x improvement in overall time to
> > complete its work" -- I believe I've seen reported numbers close to
> > that only 1) when measuring the index phase in isolation or maybe 2)
> > the entire vacuum of unlogged tables with one, perfectly-correlated
> > index (testing has less variance with WAL out of the picture). I
> > believe tables with many indexes would show a lot of improvement, but
> > I'm not aware of testing that case specifically. Can you clarify where
> > 6x came from?
>
> Sawada-san showed me the original context, but I can't rapidly find it
> in the thread. Sawada-san, can you please share the numbers behind this?
>

I referenced the numbers that I measured during the development[1]
(test scripts are here[2]). IIRC I used unlogged tables and indexes,
and these numbers were the entire vacuum execution time including heap
scanning, index vacuuming and heap vacuuming.

FYI today I've run the same script with PG17 and measured the
execution times. Here are results:

monotonically ordered int column index:
system usage: CPU: user: 1.72 s, system: 0.47 s, elapsed: 2.20 s

uuid column index:
system usage: CPU: user: 3.62 s, system: 0.89 s, elapsed: 4.52 s

int & uuid indexes in parallel:
system usage: CPU: user: 2.24 s, system: 0.44 s, elapsed: 5.01 s

These numbers are better than ones I measured with v62 patch set as we
now introduced some optimization into tidstore (8a1b31e6 and f35bd9b).

Regards,

[1] 
https://www.postgresql.org/message-id/CAD21AoBci3Hujzijubomo1tdwH3XtQ9F89cTNQ4bsQijOmqnEw%40mail.gmail.com
[2] 
https://www.postgresql.org/message-id/CANWCAZYqWibTRCWs5mV57mLj1A0nbKX-eV5G%2Bd-KmBOGHTVY-w%40mail.gmail.com

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Pgoutput not capturing the generated columns

2024-05-20 Thread Masahiko Sawada

Hi,

On Wed, May 8, 2024 at 4:14 PM Shubham Khanna
 wrote:
>
> On Wed, May 8, 2024 at 11:39 AM Rajendra Kumar Dangwal
>  wrote:
> >
> > Hi PG Hackers.
> >
> > We are interested in enhancing the functionality of the pgoutput plugin by 
> > adding support for generated columns.
> > Could you please guide us on the necessary steps to achieve this? 
> > Additionally, do you have a platform for tracking such feature requests? 
> > Any insights or assistance you can provide on this matter would be greatly 
> > appreciated.
>
> The attached patch has the changes to support capturing generated
> column data using ‘pgoutput’ and’ test_decoding’ plugin. Now if the
> ‘include_generated_columns’ option is specified, the generated column
> information and generated column data also will be sent.

As Euler mentioned earlier, I think it's a decision not to replicate
generated columns because we don't know the target table on the
subscriber has the same expression and there could be locale issues
even if it looks the same. I can see that a benefit of this proposal
would be to save cost to compute generated column values if the user
wants the target table on the subscriber to have exactly the same data
as the publisher's one. Are there other benefits or use cases?

>
> Usage from pgoutput plugin:
> CREATE TABLE gencoltable (a int PRIMARY KEY, b int GENERATED ALWAYS AS
> (a * 2) STORED);
> CREATE publication pub1 for all tables;
> SELECT 'init' FROM pg_create_logical_replication_slot('slot1', 'pgoutput');
> SELECT * FROM pg_logical_slot_peek_binary_changes('slot1', NULL, NULL,
> 'proto_version', '1', 'publication_names', 'pub1',
> 'include_generated_columns', 'true');
>
> Usage from test_decoding plugin:
> SELECT 'init' FROM pg_create_logical_replication_slot('slot2', 
> 'test_decoding');
> CREATE TABLE gencoltable (a int PRIMARY KEY, b int GENERATED ALWAYS AS
> (a * 2) STORED);
> INSERT INTO gencoltable (a) VALUES (1), (2), (3);
> SELECT data FROM pg_logical_slot_get_changes('slot2', NULL, NULL,
> 'include-xids', '0', 'skip-empty-xacts', '1',
> 'include_generated_columns', '1');
>
> Currently it is not supported as a subscription option because table
> sync for the generated column is not possible as copy command does not
> support getting data for the generated column. If this feature is
> required we can remove this limitation from the copy command and then
> add it as a subscription option later.
> Thoughts?

I think that if we want to support an option to replicate generated
columns, the initial tablesync should support it too. Otherwise, we
end up filling the target columns data with NULL during the initial
tablesync but with replicated data during the streaming changes.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Lowering the minimum value for maintenance_work_mem

2024-05-19 Thread Masahiko Sawada

On Fri, May 17, 2024 at 5:55 AM Andres Freund  wrote:
>
> Hi,
>
> In the subthread at [1] I needed to trigger multiple rounds of index vacuuming
> within one vacuum.
>
> It turns out that with the new dead tuple implementation, that got actually
> somewhat expensive. Particularly if all tuples on all pages get deleted, the
> representation is just "too dense". Normally that's obviously very good, but
> for testing, not so much:
>
> With the minimum setting of maintenance_work_mem=1024kB, a simple table with
> narrow rows, where all rows are deleted, the first cleanup happens after
> 3697812 dead tids. The table for that has to be > ~128MB.
>
> Needing a ~128MB table to be able to test multiple cleanup passes makes it
> much more expensive to test and consequently will lead to worse test coverage.
>
> I think we should consider lowering the minimum setting of
> maintenance_work_mem to the minimum of work_mem.

+1 for lowering the minimum value of maintenance_work_mem. I've faced
the same situation.

Even if a shared tidstore is empty, TidStoreMemoryUsage() returns
256kB because it's the minimum segment size of DSA, i.e.
DSA_MIN_SEGMENT_SIZE. So we can lower the minimum maintenance_work_mem
down to 256kB, from a vacuum perspective.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: pg_trgm comparison bug on cross-architecture replication due to different char implementation

2024-05-14 Thread Masahiko Sawada

On Sat, May 4, 2024 at 7:36 AM Joe Conway  wrote:
>
> On 5/3/24 11:44, Peter Eisentraut wrote:
> > On 03.05.24 16:13, Tom Lane wrote:
> >> Peter Eisentraut  writes:
> >>> On 30.04.24 19:29, Tom Lane wrote:
> >>>> Also, the bigger picture here is the seeming assumption that "if
> >>>> we change pg_trgm then it will be safe to replicate from x86 to
> >>>> arm".  I don't believe that that's a good idea and I'm unwilling
> >>>> to promise that it will work, regardless of what we do about
> >>>> char signedness.  That being the case, I don't want to invest a
> >>>> lot of effort in the signedness issue.  Option (1) is clearly
> >>>> a small change with little if any risk of future breakage.
> >>
> >>> But note that option 1 would prevent some replication that is currently
> >>> working.
> >>
> >> The point of this thread though is that it's working only for small
> >> values of "work".  People are rightfully unhappy if it seems to work
> >> and then later they get bitten by compatibility problems.
> >>
> >> Treating char signedness as a machine property in pg_control would
> >> signal that we don't intend to make it work, and would ensure that
> >> even the most minimal testing would find out that it doesn't work.
> >>
> >> If we do not do that, it seems to me we have to buy into making
> >> it work.  That would mean dealing with the consequences of an
> >> incompatible change in pg_trgm indexes, and then going through
> >> the same dance again the next time(s) similar problems are found.
> >
> > Yes, that is understood.  But anecdotally, replicating between x86-64 arm64 
> > is
> > occasionally used for upgrades or migrations.  In practice, this appears to 
> > have
> > mostly worked.  If we now discover that it won't work with certain index
> > extension modules, it's usable for most users. Even if we say, you have to
> > reindex everything afterwards, it's probably still useful for these 
> > scenarios.
>
> +1

+1

How about extending amcheck to support GIN and GIst indexes so that it
can detect potential data incompatibility due to changing 'char' to
'unsigned char'? I think these new tests would be useful also for
users to check if they really need to reindex indexes due to such
changes. Also we fix pg_trgm so that it uses 'unsigned char' in PG18.
Users who upgraded to PG18 can run the new amcheck tests on the
primary as well as the standby.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: First draft of PG 17 release notes

2024-05-14 Thread Masahiko Sawada

On Thu, May 9, 2024 at 10:48 PM Bruce Momjian  wrote:
>
> On Thu, May  9, 2024 at 02:17:12PM +0900, Masahiko Sawada wrote:
> > Hi,
> >
>
> > Also, please consider the following item:
> >
> > - Improve eviction algorithm in ReorderBuffer using max-heap for many
> > subtransactions (5bec1d6bc)
>
> I looked at that item and I don't have a generic "make logical
> replication apply faster" item to merge it into, and many
> subtransactions seemed like enough of an edge-case that I didn't think
> mentioning it make sense.  Can you see a good place to add it?

I think that since many subtransactions cases are no longer becoming
edge-cases these days, we needed to improve that and it might be
helpful for users to mention it. How about the following item for
example?

Improve logical decoding performance in cases where there are many
subtransactions.

>
> > Finally, should we mention the following commit in the release note?
> > It's not a user-visible change but added a new regression test module.
> >
> > - Add tests for XID wraparound (e255b646a)
>
> I don't normally add testing infrastructure changes unless they are
> major.

I've seen we had such item, for example in PG14 release note:

Add a test module for the regular expression package (Tom Lane)

But if our policy has already changed, I'm okay with not mentioning
the xid_wraparound test in the PG17 release note.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Fix parallel vacuum buffer usage reporting

2024-05-10 Thread Masahiko Sawada

On Fri, May 10, 2024 at 7:26 PM Nazir Bilal Yavuz  wrote:
>
> Hi,
>
> Thank you for working on this!
>
> On Wed, 1 May 2024 at 06:37, Masahiko Sawada  wrote:
> >
> > Thank you for further testing! I've pushed the patch.
>
> I realized a behaviour change while looking at 'Use pgBufferUsage for
> block reporting in analyze' thread [1]. Since that change applies here
> as well, I thought it is better to mention it here.
>
> Before this commit, VacuumPageMiss did not count the blocks if its
> read was already completed by other backends [2]. Now,
> 'pgBufferUsage.local_blks_read + pgBufferUsage.shared_blks_read'
> counts every block attempted to be read; possibly double counting if
> someone else has already completed the read.

True. IIUC there is such a difference only in HEAD but not in v15 and
v16. The following comment in WaitReadBuffers() says that it's a
traditional behavior that we count blocks as read even if someone else
has already completed its I/O:

/*
 * We count all these blocks as read by this backend.  This is traditional
 * behavior, but might turn out to be not true if we find that someone
 * else has beaten us and completed the read of some of these blocks.  In
 * that case the system globally double-counts, but we traditionally don't
 * count this as a "hit", and we don't have a separate counter for "miss,
 * but another backend completed the read".
 */

So I think using pgBufferUsage for (parallel) vacuum is a legitimate
usage and makes it more consistent with other parallel operations.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: First draft of PG 17 release notes

2024-05-08 Thread Masahiko Sawada

Hi,

On Thu, May 9, 2024 at 1:03 PM Bruce Momjian  wrote:
>
> I have committed the first draft of the PG 17 release notes;  you can
> see the results here:
>
> https://momjian.us/pgsql_docs/release-17.html

Thank you for working on that!

I'd like to mention some of my works. I think we can add the vacuum
performance improvements by the following commits:

- Add template for adaptive radix tree (ee1b30f1)
- Add TIDStore, to store sets of TIDs (ItemPointerData) efficiently (30e144287)
- Use TidStore for dead tuple TIDs storage during lazy vacuum (667e65aac)

Also, please consider the following item:

- Improve eviction algorithm in ReorderBuffer using max-heap for many
subtransactions (5bec1d6bc)

Finally, should we mention the following commit in the release note?
It's not a user-visible change but added a new regression test module.

- Add tests for XID wraparound (e255b646a)

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Fix parallel vacuum buffer usage reporting

2024-05-08 Thread Masahiko Sawada

On Fri, May 3, 2024 at 3:41 PM Anthonin Bonnefoy
 wrote:
>
> On Wed, May 1, 2024 at 5:37 AM Masahiko Sawada  wrote:
>>
>> Thank you for further testing! I've pushed the patch.
>
> Thanks!
>
> Here is the rebased version for the follow-up patch removing VacuumPage 
> variables. Though I'm not sure if I should create a dedicated mail thread 
> since the bug was fixed and the follow-up is more of a refactoring. What do 
> you think?

I'd suggest starting a new thread or changing the subject as the
current subject no longer matches what we're discussing.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-05-07 Thread Masahiko Sawada

On Wed, May 1, 2024 at 4:29 PM John Naylor  wrote:
>
> On Thu, Apr 25, 2024 at 8:36 AM Masahiko Sawada  wrote:
> >
> > On Mon, Apr 15, 2024 at 6:12 PM John Naylor  wrote:
>
> > > - RT_KEY_GET_SHIFT is not covered for key=0:
> > >
> > > https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L803
> > >
> > > That should be fairly simple to add to the tests.
> >
> > There are two paths to call RT_KEY_GET_SHIFT():
> >
> > 1. RT_SET() -> RT_KEY_GET_SHIFT()
> > 2. RT_SET() -> RT_EXTEND_UP() -> RT_KEY_GET_SHIFT()
> >
> > In both cases, it's called when key > tree->ctl->max_val. Since the
> > minimum value of max_val is 255, RT_KEY_GET_SHIFT() is never called
> > when key=0.
>
> Ah, right, so it is dead code. Nothing to worry about, but it does
> point the way to some simplifications, which I've put together in the
> attached.

Thank you for the patch. It looks good to me.

+   /* compute the smallest shift that will allowing storing the key */
+   start_shift = pg_leftmost_one_pos64(key) / RT_SPAN * RT_SPAN;

The comment is moved from RT_KEY_GET_SHIFT() but I think s/will
allowing storing/will allow storing/.

>
> > > - RT_DELETE: "if (key > tree->ctl->max_val)" is not covered:
> > >
> > > https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L2644
> > >
> > > That should be easy to add.
> >
> > Agreed. The patch is attached.
>
> LGTM
>
> > > - TidStoreCreate* has some memory clamps that are not covered:
> > >
> > > https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/backend/access/common/tidstore.c.gcov.html#L179
> > > https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/backend/access/common/tidstore.c.gcov.html#L234
> > >
> > > Maybe we could experiment with using 1MB for shared, and something
> > > smaller for local.
> >
> > I've confirmed that the local and shared tidstore with small max sizes
> > such as 4kB and 1MB worked. Currently the max size is hard-coded in
> > test_tidstore.c but if we use work_mem as the max size, we can pass
> > different max sizes for local and shared in the test script.
>
> Seems okay, do you want to try that and see how it looks?

I've attached a simple patch for this. In test_tidstore.sql, we used
to create two local tidstore and one shared tidstore. I thought of
specifying small work_mem values for these three cases but it would
remove the normal test cases. So I created separate tidstore for this
test. Also, the new test is just to check if tidstore can be created
with such a small size, but it might be a good idea to add some TIDs
to check if it really works fine.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


use_work_mem_as_max_bytes.patch
Description: Binary data

Re: Fix parallel vacuum buffer usage reporting

2024-04-30 Thread Masahiko Sawada

On Tue, Apr 30, 2024 at 3:34 PM Anthonin Bonnefoy
 wrote:
>
> I've done some additional tests to validate the reported numbers. Using 
> pg_statio, it's possible to get the minimum number of block hits (Full script 
> attached).
>
> -- Save block hits before vacuum
> SELECT pg_stat_force_next_flush();
> SELECT heap_blks_hit, idx_blks_hit FROM pg_statio_all_tables where 
> relname='vestat' \gset
> vacuum (verbose, index_cleanup on) vestat;
> -- Check the difference
> SELECT pg_stat_force_next_flush();
> SELECT heap_blks_hit - :heap_blks_hit as delta_heap_hit,
>idx_blks_hit - :idx_blks_hit as delta_idx_hit,
>heap_blks_hit - :heap_blks_hit + idx_blks_hit - :idx_blks_hit as sum
> FROM pg_statio_all_tables where relname='vestat';
>
> Output:
> ...
> buffer usage: 14676 hits, 0 misses, 667 dirtied
> buffer usage (new): 16081 hits, 0 misses, 667 dirtied
> ...
>  -[ RECORD 1 ]--+--
> delta_heap_hit | 9747
> delta_idx_hit  | 6325
> sum| 16072
>
> From pg_statio, we had 16072 blocks for the relation + indexes.
> Pre-patch, we are under reporting with 14676.
> Post-patch, we have 16081. The 9 additional block hits come from vacuum 
> accessing catalog tables like pg_class or pg_class_oid_index.
>

Thank you for further testing! I've pushed the patch.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: pg_trgm comparison bug on cross-architecture replication due to different char implementation

2024-04-30 Thread Masahiko Sawada

On Wed, May 1, 2024 at 2:29 AM Tom Lane  wrote:
>
> Alexander Korotkov  writes:
> > I agree that storing char signedness might seem weird.  But it appears
> > that we already store indexes that depend on char signedness.  So,
> > it's effectively property of bits-on-disk even though it affects
> > indirectly.  Then I see two options to make the picture consistent.
> > 1) Assume that char signedness is somehow a property of bits-on-disk
> > even though it's weird.  Then pg_trgm indexes are correct, but we need
> > to store char signedness in pg_control.
> > 2) Assume that char signedness is not a property of bits-on-disk.
> > Then pg_trgm indexes are buggy and need to be fixed.
> > What do you think?
>
> The problem with option (2) is the assumption that pg_trgm's behavior
> is the only bug of this kind, either now or in the future.  I think
> that's just about an impossible standard to meet, because there's no
> realistic way to test whether char signedness is affecting things.
> (Sure, you can compare results across platforms, but maybe you
> just didn't test the right case.)
>
> Also, the bigger picture here is the seeming assumption that "if
> we change pg_trgm then it will be safe to replicate from x86 to
> arm".  I don't believe that that's a good idea and I'm unwilling
> to promise that it will work, regardless of what we do about
> char signedness.  That being the case, I don't want to invest a
> lot of effort in the signedness issue.

I think that the char signedness issue is an issue also for developers
(and extension authors) since it could lead to confusion and potential
bugs in the future due to that. x86 developers would think of char as
always being signed and write code that will misbehave on arm
machines. For example, since logical replication should behave
correctly even in cross-arch replication all developers need to be
aware of that. I thought of using the -funsigned-char (or
-fsigned-char) compiler flag to avoid that but it would have a broader
impact not only on indexes.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: pg_trgm comparison bug on cross-architecture replication due to different char implementation

2024-04-30 Thread Masahiko Sawada

On Tue, Apr 30, 2024 at 12:37 PM Tom Lane  wrote:
>
> Masahiko Sawada  writes:
> > On Tue, Apr 23, 2024 at 11:57 PM Tom Lane  wrote:
> >> Reject as not a bug.  Discourage people from thinking that physical
> >> replication will work across architectures.
>
> > While cross-arch physical replication is not supported, I think having
> > architecture dependent differences is not good and It's legitimate to
> > fix it. FYI the 'char' data type comparison is done as though char is
> > unsigned. I've attached a small patch to fix it. What do you think?
>
> I think this will break existing indexes that are working fine.
> Yeah, it would have been better to avoid the difference, but
> it's too late now.

True. So it will be a PG18 item.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: pg_trgm comparison bug on cross-architecture replication due to different char implementation

2024-04-29 Thread Masahiko Sawada

On Tue, Apr 23, 2024 at 11:57 PM Tom Lane  wrote:
>
> "Guo, Adam"  writes:
> > I would like to report an issue with the pg_trgm extension on
> > cross-architecture replication scenarios. When an x86_64 standby
> > server is replicating from an aarch64 primary server or vice versa,
> > the gist_trgm_ops opclass returns different results on the primary
> > and standby.
>
> I do not think that is a supported scenario.  Hash functions and
> suchlike are not guaranteed to produce the same results on different
> CPU architectures.  As a quick example, I get
>
> regression=# select hashfloat8(34);
>  hashfloat8
> 
>21570837
> (1 row)
>
> on x86_64 but
>
> postgres=# select hashfloat8(34);
>  hashfloat8
> 
>  -602898821
> (1 row)
>
> on ppc32 thanks to the endianness difference.
>
> > Given that this has problem has come up before and seems likely to
> > come up again, I'm curious what other broad solutions there might be
> > to resolve it?
>
> Reject as not a bug.  Discourage people from thinking that physical
> replication will work across architectures.

While cross-arch physical replication is not supported, I think having
architecture dependent differences is not good and It's legitimate to
fix it. FYI the 'char' data type comparison is done as though char is
unsigned. I've attached a small patch to fix it. What do you think?

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


fix_signedness_issue_in_pg_trgm.patch
Description: Binary data

Re: Fix parallel vacuum buffer usage reporting

2024-04-29 Thread Masahiko Sawada

On Fri, Apr 26, 2024 at 9:12 PM Alena Rybakina  wrote:
>
> Hi!
>>
>> The same script was run, but using vacuum verbose analyze, and I saw the 
>> difference again in the fifth step:
>> with your patch: buffer usage: 32312 hits, 607 misses, 1566 dirtied
>> master: buffer usage: 32346 hits, 573 misses, 1360 dirtied
>
> Isn't there a chance for the checkpointer to run during this time? That could 
> make the conditions between the two runs slightly different and explain the 
> change in buffer report.
>
> [0] 
> https://github.com/postgres/postgres/blob/8a1b31e6e59631807a08a4e9465134c343bbdf5e/src/backend/access/heap/vacuumlazy.c#L2826-L2831
>
> Looking at the script, you won't trigger the problem.
>
> Thank you for the link I accounted it in my next experiments.
>
> I repeated the test without processing checkpoints with a single index, and 
> the number of pages in the buffer used almost matched:
>
> master branch: buffer usage: 32315 hits, 606 misses, 4486 dirtied
>
> with applied patch v4 version: buffer usage: 32315 hits, 606 misses, 4489 
> dirtied
>
> I think you are right - the problem was interfering with the checkpoint 
> process, by the way I checked the first version patch. To cut a long story 
> short, everything is fine now with one index.
>
> Just in case, I'll explain: I considered this case because your patch could 
> have impact influenced it too.
>
> On 25.04.2024 10:17, Anthonin Bonnefoy wrote:
>
>
> On Wed, Apr 24, 2024 at 4:01 PM Alena Rybakina  
> wrote:
>>
>> I tested the main postgres branch with and without your fix using a script 
>> that was written by me. It consists of five scenarios and I made a 
>> comparison in the logs between the original version of the master branch and 
>> the master branch with your patch:
>
>  Hi! Thanks for the tests.
>
>> I have attached a test file (vacuum_check_logs.sql)
>
> The reporting issue will only happen if there's a parallel index vacuum and 
> it will only happen if there's at least 2 indexes [0]. You will need to 
> create an additional index.
>
> Speaking of the problem, I added another index and repeated the test and 
> found a significant difference:
>
> I found it when I commited the transaction (3):
>
> master: 2964 hits, 0 misses, 0 dirtied
>
> with applied patch v4 version: buffer usage: 33013 hits, 0 misses, 3 dirtied
>
> When I deleted all the data from the table and later started vacuum verbose 
> again (4):
>
> master: buffer usage: 51486 hits, 0 misses, 0 dirtied
>
> with applied patch v4 version:buffer usage: 77924 hits, 0 misses, 0 dirtied
>
> when I inserted 1 million data into the table and updated it (5):
>
> master:buffer usage: 27904 hits, 5021 misses, 1777 dirtied
>
> with applied patch v4 version:buffer usage: 41051 hits, 9973 misses, 2564 
> dirtied
>
> As I see, the number of pages is significantly more than it was in the master 
> branch and ,frankly, I couldn't fully figure out if it was a mistake or not.

I think that the patch fixes the problem correctly.

I've run pgindent and updated the commit message. I realized that
parallel vacuum was introduced in pg13 but buffer usage reporting in
VACUUM command was implemented in pg15. Therefore, in pg13 and pg14,
VACUUM (PARALLEL) is available but VACUUM (PARALLEL, VERBOSE) doesn't
show the buffer usage report. Autovacuum does show the buffer usage
report but parallel autovacuum is not supported. Therefore, we should
backpatch it down to 15, not 13.

I'm going to push the patch down to pg15, barring any objections.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


v5-0001-Fix-parallel-vacuum-buffer-usage-reporting.patch
Description: Binary data

Re: New committers: Melanie Plageman, Richard Guo

2024-04-27 Thread Masahiko Sawada

On Fri, Apr 26, 2024 at 8:54 PM Jonathan S. Katz  wrote:
>
> The Core Team would like to extend our congratulations to Melanie
> Plageman and Richard Guo, who have accepted invitations to become our
> newest PostgreSQL committers.
>
> Please join us in wishing them much success and few reverts!
>

Congratulations to both!

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-04-25 Thread Masahiko Sawada

On Thu, Apr 25, 2024 at 1:38 PM Masahiko Sawada  wrote:
>
> On Thu, Apr 25, 2024 at 12:17 PM John Naylor  wrote:
> >
> > On Thu, Apr 25, 2024 at 9:50 AM Masahiko Sawada  
> > wrote:
> > >
> > > > I saw a SIGSEGV there when using tidstore to write a fix for something 
> > > > else.
> > > > Patch attached.
> > >
> > > Great find, thank you for the patch!
> >
> > +1
> >
> > (This occurred to me a few days ago, but I was far from my computer.)
> >
> > With the purge function that  Noah proposed, I believe we can also get
> > rid of the comment at the top of the .sql test file warning of a
> > maintenance hazard:
> > ..."To avoid adding duplicates,
> > -- each call to do_set_block_offsets() should use different block
> > -- numbers."
>
> Good point. Removed.
>
> >
> > > of do_gset_block_offset() and check_set_block_offsets(). If these are
> > > annoying, we can remove the cases of array[1] and array[1,2].
> >
> > Let's keep those -- 32-bit platforms should also exercise this path.
>
> Agreed.
>
> I've attached a new patch. I'll push it tonight, if there is no further 
> comment.
>

Pushed.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-04-24 Thread Masahiko Sawada

On Thu, Apr 25, 2024 at 12:17 PM John Naylor  wrote:
>
> On Thu, Apr 25, 2024 at 9:50 AM Masahiko Sawada  wrote:
> >
> > > I saw a SIGSEGV there when using tidstore to write a fix for something 
> > > else.
> > > Patch attached.
> >
> > Great find, thank you for the patch!
>
> +1
>
> (This occurred to me a few days ago, but I was far from my computer.)
>
> With the purge function that  Noah proposed, I believe we can also get
> rid of the comment at the top of the .sql test file warning of a
> maintenance hazard:
> ..."To avoid adding duplicates,
> -- each call to do_set_block_offsets() should use different block
> -- numbers."

Good point. Removed.

>
> > of do_gset_block_offset() and check_set_block_offsets(). If these are
> > annoying, we can remove the cases of array[1] and array[1,2].
>
> Let's keep those -- 32-bit platforms should also exercise this path.

Agreed.

I've attached a new patch. I'll push it tonight, if there is no further comment.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


v2-0001-radixtree-Fix-SIGSEGV-at-update-of-embeddable-val.patch
Description: Binary data

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-04-24 Thread Masahiko Sawada

On Thu, Apr 25, 2024 at 6:03 AM Noah Misch  wrote:
>
> On Mon, Apr 15, 2024 at 04:12:38PM +0700, John Naylor wrote:
> > - Some paths for single-value leaves are not covered:
> >
> > https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L904
> > https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L954
> > https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L2606
> >
> > However, these paths do get regression test coverage on 32-bit
> > machines. 64-bit builds only have leaves in the TID store, which
> > doesn't (currently) delete entries, and doesn't instantiate the tree
> > with the debug option.
> >
> > - In RT_SET "if (found)" is not covered:
> >
> > https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L1768
> >
> > That's because we don't yet have code that replaces an existing value
> > with a value of a different length.
>
> I saw a SIGSEGV there when using tidstore to write a fix for something else.
> Patch attached.

Great find, thank you for the patch!

The fix looks good to me. I think we can improve regression tests for
better coverage. In TidStore on a 64-bit machine, we can store 3
offsets in the header and these values are embedded to the leaf page.
With more than 3 offsets, the value size becomes more than 16 bytes
and a single value leaf. Therefore, if we can add the test with the
array[1,2,3,4,100], we can cover the case of replacing a single-value
leaf with a different size new single-value leaf. Now we add 9 pairs
of do_gset_block_offset() and check_set_block_offsets(). If these are
annoying, we can remove the cases of array[1] and array[1,2].

I've attached a new patch. In addition to the new test case I
mentioned, I've added some new comments and removed an unnecessary
added line in test_tidstore.sql.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

0001-radixtree-Fix-SIGSEGV-at-update-of-embeddable-value-.patch
Description: Binary data

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-04-24 Thread Masahiko Sawada

On Mon, Apr 15, 2024 at 6:12 PM John Naylor  wrote:
>
> I took a look at the coverage report from [1] and it seems pretty
> good, but there are a couple more tests we could do.

Thank you for checking!

>
> - RT_KEY_GET_SHIFT is not covered for key=0:
>
> https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L803
>
> That should be fairly simple to add to the tests.

There are two paths to call RT_KEY_GET_SHIFT():

1. RT_SET() -> RT_KEY_GET_SHIFT()
2. RT_SET() -> RT_EXTEND_UP() -> RT_KEY_GET_SHIFT()

In both cases, it's called when key > tree->ctl->max_val. Since the
minimum value of max_val is 255, RT_KEY_GET_SHIFT() is never called
when key=0.

>
> - Some paths for single-value leaves are not covered:
>
> https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L904
> https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L954
> https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L2606
>
> However, these paths do get regression test coverage on 32-bit
> machines. 64-bit builds only have leaves in the TID store, which
> doesn't (currently) delete entries, and doesn't instantiate the tree
> with the debug option.

Right.

>
> - In RT_SET "if (found)" is not covered:
>
> https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L1768
>
> That's because we don't yet have code that replaces an existing value
> with a value of a different length.

Noah reported an issue around that. We should incorporate the patch
and cover this code path.

>
> - RT_FREE_RECURSE isn't well covered:
>
> https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L1768
>
> The TID store test is pretty simple as far as distribution of block
> keys, and focuses more on the offset bitmaps. We could try to cover
> all branches here, but it would make the test less readable, and it's
> kind of the wrong place to do that anyway. test_radixtree.c does have
> a commented-out option to use shared memory, but that's for local
> testing and won't be reflected in the coverage report. Maybe it's
> enough.

Agreed.

>
> - RT_DELETE: "if (key > tree->ctl->max_val)" is not covered:
>
> https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L2644
>
> That should be easy to add.

Agreed. The patch is attached.

>
> - RT_DUMP_NODE is not covered, and never called by default anyway:
>
> https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L2804
>
> It seems we could just leave it alone since it's debug-only, but it's
> also a lot of lines. One idea is to use elog with DEBUG5 instead of
> commenting out the call sites, but that would cause a lot of noise.

I think we can leave it alone.

>
> - TidStoreCreate* has some memory clamps that are not covered:
>
> https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/backend/access/common/tidstore.c.gcov.html#L179
> https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/backend/access/common/tidstore.c.gcov.html#L234
>
> Maybe we could experiment with using 1MB for shared, and something
> smaller for local.

I've confirmed that the local and shared tidstore with small max sizes
such as 4kB and 1MB worked. Currently the max size is hard-coded in
test_tidstore.c but if we use work_mem as the max size, we can pass
different max sizes for local and shared in the test script.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


improve_code_coverage_radixtree.patch
Description: Binary data

Re: Fix parallel vacuum buffer usage reporting

2024-04-24 Thread Masahiko Sawada

On Mon, Apr 22, 2024 at 5:07 PM Anthonin Bonnefoy
 wrote:
>
> On Sat, Apr 20, 2024 at 2:00 PM Alena Rybakina  
> wrote:
>>
>> Hi, thank you for your work with this subject.
>>
>> While I was reviewing your code, I noticed that your patch conflicts with 
>> another patch [0] that been committed.
>>
>> [0] 
>> https://www.postgresql.org/message-id/flat/CA%2BhUKGJkOiOCa%2Bmag4BF%2BzHo7qo%3Do9CFheB8%3Dg6uT5TUm2gkvA%40mail.gmail.com
>
>
> I've rebased the patch and also split the changes:

Thank you for updating the patch!

> 1: Use pgBufferUsage in Vacuum and Analyze block reporting

I think that if the anayze command doesn't have the same issue, we
don't need to change it. Making the vacuum and the analyze consistent
is a good point but I'd like to avoid doing unnecessary changes in
back branches. I think the patch set would contain:

(a) make lazy vacuum use BufferUsage instead of
VacuumPage{Hit,Miss,Dirty}. (backpatched down to pg13).
(b) make analyze use BufferUsage and remove VacuumPage{Hit,Miss,Dirty}
variables for consistency and simplicity (only for HEAD, if we agree).

BTW I realized that VACUUM VERBOSE running on a temp table always
shows the number of dirtied buffers being 0, which seems to be a bug.
The patch (a) will resolve it as well.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: promotion related handling in pg_sync_replication_slots()

2024-04-23 Thread Masahiko Sawada

On Tue, Apr 23, 2024 at 12:37 PM Amit Kapila  wrote:
>
> On Mon, Apr 22, 2024 at 7:04 PM Masahiko Sawada  wrote:
> >
> > On Mon, Apr 22, 2024 at 9:02 PM shveta malik  wrote:
> > >
> > > Thanks for the patch, the changes look good Amit. Please find the merged 
> > > patch.
> > >
> >
> > I've reviewed the patch and have some comments:
> >
> > ---
> > /*
> > -* Early initialization.
> > +* Register slotsync_worker_onexit() before we register
> > +* ReplicationSlotShmemExit() in BaseInit(), to ensure that during the
> > +* exit of the slot sync worker, ReplicationSlotShmemExit() is called
> > +* first, followed by slotsync_worker_onexit(). The startup process 
> > during
> > +* promotion invokes ShutDownSlotSync() which waits for slot sync to
> > +* finish and it does that by checking the 'syncing' flag. Thus worker
> > +* must be done with the slots' release and cleanup before it marks 
> > itself
> > +* as finished syncing.
> >  */
> >
> > I'm slightly worried that we register the slotsync_worker_onexit()
> > callback before BaseInit(), because it could be a blocker when we want
> > to add more work in the callback, for example sending the stats.
> >
>
> The other possibility is that we do slot release/clean up in the
> slotsync_worker_onexit() call itself and then we can do it after
> BaseInit().

This approach sounds clearer and safer to me. The current approach
relies on the callback registration order of
ReplicationSlotShmemExit(). If it changes in the future, we will
silently have the same problem. Every slot sync related work should be
done before allowing someone to touch synced slots by clearing the
'syncing' flag.

>
> > ---
> > synchronize_slots(wrconn);
> > +
> > +   /* Cleanup the temporary slots */
> > +   ReplicationSlotCleanup();
> > +
> > +   /* We are done with sync, so reset sync flag */
> > +   reset_syncing_flag();
> >
> > I think it ends up removing other temp slots that are created by the
> > same backend process using
> > pg_create_{physical,logical_replication_slots() function, which could
> > be a large side effect of this function for users.
> >
>
> True, I think here we should either remove only temporary and synced
> marked slots. The other possibility is to create slots as RS_EPHEMERAL
> initially when called from the SQL function but that doesn't sound
> like a neat approach.
>
> >
>  Also, if users want
> > to have a process periodically calling pg_sync_replication_slots()
> > instead of the slotsync worker, it doesn't support a case where we
> > create a temp not-ready slot and turn it into a persistent slot if
> > it's ready for sync.
> >
>
> True, but eventually the API should be able to directly create the
> persistent slots and anyway this can happen only for the first time
> (till the slots are created and marked persistent) and one who wants
> to use this function periodically should be able to see regular syncs.

I agree that we remove temp-and-synced slots created via the API at
the end of the API . We end up creating and dropping slots in every
API call but since the pg_sync_replication_slots() function is a kind
of debug-purpose function and it will not be common to call this
function regularly instead of using the slot sync worker, we can live
with such overhead.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-04-22 Thread Masahiko Sawada

Hi,

On Thu, Apr 4, 2024 at 9:23 PM Bharath Rupireddy
 wrote:
>
> On Thu, Apr 4, 2024 at 4:35 PM Amit Kapila  wrote:
> >
> > > Thanks for the changes. v34-0001 LGTM.
> >
> > I was doing a final review before pushing 0001 and found that
> > 'inactive_since' could be set twice during startup after promotion,
> > once while restoring slots and then via ShutDownSlotSync(). The reason
> > is that ShutDownSlotSync() will be invoked in normal startup on
> > primary though it won't do anything apart from setting inactive_since
> > if we have synced slots. I think you need to check 'StandbyMode' in
> > update_synced_slots_inactive_since() and return if the same is not
> > set. We can't use 'InRecovery' flag as that will be set even during
> > crash recovery.
> >
> > Can you please test this once unless you don't agree with the above theory?
>
> Nice catch. I've verified that update_synced_slots_inactive_since is
> called even for normal server startups/crash recovery. I've added a
> check to exit if the StandbyMode isn't set.
>
> Please find the attached v35 patch.
>

The documentation says about both 'active' and 'inactive_since'
columns of pg_replication_slots say:

---
active bool
True if this slot is currently actively being used

inactive_since timestamptz
The time since the slot has become inactive. NULL if the slot is
currently being used. Note that for slots on the standby that are
being synced from a primary server (whose synced field is true), the
inactive_since indicates the last synchronization (see Section 47.2.3)
time.
---

When reading the description I thought if 'active' is true,
'inactive_since' is NULL, but it doesn't seem to apply for temporary
slots. Since we don't reset the active_pid field of temporary slots
when the release, the 'active' is still true in the view but
'inactive_since' is not NULL. Do you think we need to mention it in
the documentation?

As for the timeout-based slot invalidation feature, we could end up
invalidating the temporary slots even if they are shown as active,
which could confuse users. Do we want to somehow deal with it?

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: promotion related handling in pg_sync_replication_slots()

2024-04-22 Thread Masahiko Sawada

On Mon, Apr 22, 2024 at 9:02 PM shveta malik  wrote:
>
> On Mon, Apr 22, 2024 at 5:10 PM Amit Kapila  wrote:
> >
> > On Fri, Apr 19, 2024 at 1:52 PM shveta malik  wrote:
> > >
> > > Please find v9 with the above comments addressed.
> > >
> >
> > I have made minor modifications in the comments and a function name.
> > Please see the attached top-up patch. Apart from this, the patch looks
> > good to me.
>
> Thanks for the patch, the changes look good Amit. Please find the merged 
> patch.
>

I've reviewed the patch and have some comments:

---
/*
-* Early initialization.
+* Register slotsync_worker_onexit() before we register
+* ReplicationSlotShmemExit() in BaseInit(), to ensure that during the
+* exit of the slot sync worker, ReplicationSlotShmemExit() is called
+* first, followed by slotsync_worker_onexit(). The startup process during
+* promotion invokes ShutDownSlotSync() which waits for slot sync to
+* finish and it does that by checking the 'syncing' flag. Thus worker
+* must be done with the slots' release and cleanup before it marks itself
+* as finished syncing.
 */

I'm slightly worried that we register the slotsync_worker_onexit()
callback before BaseInit(), because it could be a blocker when we want
to add more work in the callback, for example sending the stats.

---
synchronize_slots(wrconn);
+
+   /* Cleanup the temporary slots */
+   ReplicationSlotCleanup();
+
+   /* We are done with sync, so reset sync flag */
+   reset_syncing_flag();

I think it ends up removing other temp slots that are created by the
same backend process using
pg_create_{physical,logical_replication_slots() function, which could
be a large side effect of this function for users. Also, if users want
to have a process periodically calling pg_sync_replication_slots()
instead of the slotsync worker, it doesn't support a case where we
create a temp not-ready slot and turn it into a persistent slot if
it's ready for sync.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

2024-04-17 Thread Masahiko Sawada

On Wed, Apr 17, 2024 at 4:28 PM torikoshia  wrote:
>
> On 2024-04-16 13:16, Masahiko Sawada wrote:
> > On Tue, Apr 2, 2024 at 7:34 PM torikoshia 
> > wrote:
> >>
> >> On 2024-04-01 11:31, Masahiko Sawada wrote:
> >> > On Fri, Mar 29, 2024 at 11:54 AM torikoshia
> >> >  wrote:
> >> >>
> >> >> On 2024-03-28 21:54, Masahiko Sawada wrote:
> >> >> > On Thu, Mar 28, 2024 at 9:38 PM torikoshia 
> >> >> > 
> >> >> > wrote:
> >> >> >>
> >> >> >> On 2024-03-28 10:20, Masahiko Sawada wrote:
> >> >> >> > Hi,
> >> >> >> >
> >> >> >> > On Thu, Jan 18, 2024 at 5:33 PM Masahiko Sawada 
> >> >> >> > 
> >> >> >> > wrote:
> >> >> >> >>
> >> >> >> >> On Thu, Jan 18, 2024 at 4:59 PM Alexander Korotkov
> >> >> >> >>  wrote:
> >> >> >> >> >
> >> >> >> >> > On Thu, Jan 18, 2024 at 4:16 AM torikoshia 
> >> >> >> >> >  wrote:
> >> >> >> >> > > On 2024-01-18 10:10, jian he wrote:
> >> >> >> >> > > > On Thu, Jan 18, 2024 at 8:57 AM Masahiko Sawada 
> >> >> >> >> > > > 
> >> >> >> >> > > > wrote:
> >> >> >> >> > > >> On Thu, Jan 18, 2024 at 6:38 AM Tom Lane 
> >> >> >> >> > > >>  wrote:
> >> >> >> >> > > >> > Kyotaro-san's suggestion isn't bad, though I might 
> >> >> >> >> > > >> > shorten it to
> >> >> >> >> > > >> > error_action {error|ignore|log} (or perhaps "stop" 
> >> >> >> >> > > >> > instead of "error")?
> >> >> >> >> > > >> > You will need a separate parameter anyway to specify the 
> >> >> >> >> > > >> > destination
> >> >> >> >> > > >> > of "log", unless "none" became an illegal table name 
> >> >> >> >> > > >> > when I wasn't
> >> >> >> >> > > >> > looking.  I don't buy that one parameter that has some 
> >> >> >> >> > > >> > special values
> >> >> >> >> > > >> > while other values could be names will be a good design. 
> >> >> >> >> > > >> >  Moreover,
> >> >> >> >> > > >> > what if we want to support (say) log-to-file along with 
> >> >> >> >> > > >> > log-to-table?
> >> >> >> >> > > >> > Trying to distinguish a file name from a table name 
> >> >> >> >> > > >> > without any other
> >> >> >> >> > > >> > context seems impossible.
> >> >> >> >> > > >>
> >> >> >> >> > > >> I've been thinking we can add more values to this option 
> >> >> >> >> > > >> to log errors
> >> >> >> >> > > >> not only to the server logs but also to the error table 
> >> >> >> >> > > >> (not sure
> >> >> >> >> > > >> details but I imagined an error table is created for each 
> >> >> >> >> > > >> table on
> >> >> >> >> > > >> error), without an additional option for the destination 
> >> >> >> >> > > >> name. The
> >> >> >> >> > > >> values would be like error_action 
> >> >> >> >> > > >> {error|ignore|save-logs|save-table}.
> >> >> >> >> > > >>
> >> >> >> >> > > >
> >> >> >> >> > > > another idea:
> >> >> >> >> > > > on_error {error|ignore|other_future_option}
> >> >> >> >> > > > if not specified then by default ERROR.
> >> >> >> >> > > > You can also specify ERROR or IGNORE for now.
&

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

2024-04-15 Thread Masahiko Sawada

On Tue, Apr 2, 2024 at 7:34 PM torikoshia  wrote:
>
> On 2024-04-01 11:31, Masahiko Sawada wrote:
> > On Fri, Mar 29, 2024 at 11:54 AM torikoshia
> >  wrote:
> >>
> >> On 2024-03-28 21:54, Masahiko Sawada wrote:
> >> > On Thu, Mar 28, 2024 at 9:38 PM torikoshia 
> >> > wrote:
> >> >>
> >> >> On 2024-03-28 10:20, Masahiko Sawada wrote:
> >> >> > Hi,
> >> >> >
> >> >> > On Thu, Jan 18, 2024 at 5:33 PM Masahiko Sawada 
> >> >> > 
> >> >> > wrote:
> >> >> >>
> >> >> >> On Thu, Jan 18, 2024 at 4:59 PM Alexander Korotkov
> >> >> >>  wrote:
> >> >> >> >
> >> >> >> > On Thu, Jan 18, 2024 at 4:16 AM torikoshia 
> >> >> >> >  wrote:
> >> >> >> > > On 2024-01-18 10:10, jian he wrote:
> >> >> >> > > > On Thu, Jan 18, 2024 at 8:57 AM Masahiko Sawada 
> >> >> >> > > > 
> >> >> >> > > > wrote:
> >> >> >> > > >> On Thu, Jan 18, 2024 at 6:38 AM Tom Lane  
> >> >> >> > > >> wrote:
> >> >> >> > > >> > Kyotaro-san's suggestion isn't bad, though I might shorten 
> >> >> >> > > >> > it to
> >> >> >> > > >> > error_action {error|ignore|log} (or perhaps "stop" instead 
> >> >> >> > > >> > of "error")?
> >> >> >> > > >> > You will need a separate parameter anyway to specify the 
> >> >> >> > > >> > destination
> >> >> >> > > >> > of "log", unless "none" became an illegal table name when I 
> >> >> >> > > >> > wasn't
> >> >> >> > > >> > looking.  I don't buy that one parameter that has some 
> >> >> >> > > >> > special values
> >> >> >> > > >> > while other values could be names will be a good design.  
> >> >> >> > > >> > Moreover,
> >> >> >> > > >> > what if we want to support (say) log-to-file along with 
> >> >> >> > > >> > log-to-table?
> >> >> >> > > >> > Trying to distinguish a file name from a table name without 
> >> >> >> > > >> > any other
> >> >> >> > > >> > context seems impossible.
> >> >> >> > > >>
> >> >> >> > > >> I've been thinking we can add more values to this option to 
> >> >> >> > > >> log errors
> >> >> >> > > >> not only to the server logs but also to the error table (not 
> >> >> >> > > >> sure
> >> >> >> > > >> details but I imagined an error table is created for each 
> >> >> >> > > >> table on
> >> >> >> > > >> error), without an additional option for the destination 
> >> >> >> > > >> name. The
> >> >> >> > > >> values would be like error_action 
> >> >> >> > > >> {error|ignore|save-logs|save-table}.
> >> >> >> > > >>
> >> >> >> > > >
> >> >> >> > > > another idea:
> >> >> >> > > > on_error {error|ignore|other_future_option}
> >> >> >> > > > if not specified then by default ERROR.
> >> >> >> > > > You can also specify ERROR or IGNORE for now.
> >> >> >> > > >
> >> >> >> > > > I agree, the parameter "error_action" is better than 
> >> >> >> > > > "location".
> >> >> >> > >
> >> >> >> > > I'm not sure whether error_action or on_error is better, but 
> >> >> >> > > either way
> >> >> >> > > "error_action error" and "on_error error" seems a bit odd to me.
> >> >> >> > > I feel "stop" is better for both cases as Tom suggested.
> >> >> >> >
>

Re: Improve eviction algorithm in ReorderBuffer

2024-04-11 Thread Masahiko Sawada

On Thu, Apr 11, 2024 at 10:46 AM Hayato Kuroda (Fujitsu)
 wrote:
>
> Dear Heikki,
>
> I also prototyped the idea, which has almost the same shape.
> I attached just in case, but we may not have to see.
>
> Few comments based on the experiment.

Thank you for reviewing the patch. I didn't include the following
suggestions since firstly I wanted to just fix binaryheap part while
keeping other parts. If we need these changes, we can do them in
separate commits as fixes.

>
> ```
> +   /* txn_heap is ordered by transaction size */
> +   buffer->txn_heap = pairingheap_allocate(ReorderBufferTXNSizeCompare, 
> NULL);
> ```
>
> I think the pairing heap should be in the same MemoryContext with the buffer.
> Can we add MemoryContextSwithTo()?

The pairingheap_allocate() allocates a tiny amount of memory for
pairingheap and its memory usage doesn't grow even when adding more
data. And since it's allocated in logical decoding context its
lifetime is also fine. So I'm not sure it's worth including it in
rb->context for better memory accountability.

>
> ```
> +   /* Update the max-heap */
> +   if (oldsize != 0)
> +   pairingheap_remove(rb->txn_heap, >txn_node);
> +   pairingheap_add(rb->txn_heap, >txn_node);
> ...
> +   /* Update the max-heap */
> +   pairingheap_remove(rb->txn_heap, >txn_node);
> +   if (txn->size != 0)
> +   pairingheap_add(rb->txn_heap, >txn_node);
> ```
>
> Since the number of stored transactions does not affect to the insert 
> operation, we may able
> to add the node while creating ReorederBufferTXN and remove while cleaning up 
> it. This can
> reduce branches in ReorderBufferChangeMemoryUpdate().

I think it also means that we need to remove the entry while cleaning
up even if it doesn't have any changes, which is done in O(log n). I
feel the current approach that we don't store transactions with size 0
in the heap is better and I'm not sure that reducing these branches
really contributes to the performance improvements..

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Improve eviction algorithm in ReorderBuffer

2024-04-11 Thread Masahiko Sawada

On Thu, Apr 11, 2024 at 11:52 AM Masahiko Sawada  wrote:
>
> We can see 2% ~ 3% performance regressions compared to the current
> HEAD, but it's much smaller than I expected. Given that we can make
> the code simple, I think we can go with this direction.

Pushed the patch and reverted binaryheap changes.


Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Improve eviction algorithm in ReorderBuffer

2024-04-10 Thread Masahiko Sawada

On Thu, Apr 11, 2024 at 10:32 AM Masahiko Sawada  wrote:
>
> Hi,
>
> Sorry for the late reply, I took two days off.
>
> On Thu, Apr 11, 2024 at 6:20 AM Heikki Linnakangas  wrote:
> >
> > On 10/04/2024 08:31, Amit Kapila wrote:
> > > On Wed, Apr 10, 2024 at 11:00 AM Heikki Linnakangas  
> > > wrote:
> > >>
> > >> On 10/04/2024 07:45, Michael Paquier wrote:
> > >>> On Tue, Apr 09, 2024 at 09:16:53PM -0700, Jeff Davis wrote:
> > >>>> On Wed, 2024-04-10 at 12:13 +0900, Michael Paquier wrote:
> > >>>>> Wouldn't the best way forward be to revert
> > >>>>> 5bec1d6bc5e3 and revisit the whole in v18?
> > >>>>
> > >>>> Also consider commits b840508644 and bcb14f4abc.
> > >>>
> > >>> Indeed.  These are also linked.
> > >>
> > >> I don't feel the urge to revert this:
> > >>
> > >> - It's not broken as such, we're just discussing better ways to
> > >> implement it. We could also do nothing, and revisit this in v18. The
> > >> only must-fix issue is some compiler warnings IIUC.
> > >>
> > >> - It's a pretty localized change in reorderbuffer.c, so it's not in the
> > >> way of other patches or reverts. Nothing else depends on the binaryheap
> > >> changes yet either.
> > >>
> > >> - It seems straightforward to repeat the performance tests with whatever
> > >> alternative implementations we want to consider.
> > >>
> > >> My #1 choice would be to write a patch to switch the pairing heap,
> > >> performance test that, and revert the binary heap changes.
> > >>
> > >
> > > +1.
> >
> > To move this forward, here's a patch to switch to a pairing heap. In my
> > very quick testing, with the performance test cases posted earlier in
> > this thread [1] [2], I'm seeing no meaningful performance difference
> > between this and what's in master currently.
> >
> > Sawada-san, what do you think of this? To be sure, if you could also
> > repeat the performance tests you performed earlier, that'd be great. If
> > you agree with this direction, and you're happy with this patch, feel
> > free take it from here and commit this, and also revert commits
> > b840508644 and bcb14f4abc.
> >
>
> Thank you for the patch!
>
> I agree with the direction that we replace binaryheap + index with the
> existing pairing heap and revert the changes for binaryheap. Regarding
> the patch, I'm not sure we can remove the MAX_HEAP_TXN_COUNT_THRESHOLD
> logic because otherwise we need to remove and add the txn node (i.e.
> O(log n)) for every memory update. I'm concerned it could cause some
> performance degradation in a case where there are not many
> transactions being decoded.
>
> I'll do performance tests, and share the results soon.
>

Here are some performance test results.

* test case 1 (many subtransactions)

test script:

create or replace function testfn (cnt int) returns void as $$
begin
  for i in 1..cnt loop
begin
  insert into test values (i);
exception when division_by_zero then
  raise notice 'caught error';
  return;
end;
  end loop;
end;
$$
language plpgsql;
select pg_create_logical_replication_slot('s', 'test_decoding');
select testfn(100);
set logical_decoding_work_mem to '4MB';
select from pg_logical_slot_peek_changes('s', null, null);

HEAD:

43128.266 ms
40116.313 ms
38790.666 ms

Patched:

43626.316 ms
44456.234 ms
39899.753 ms


* test case 2 (single big insertion)

test script:

create table test (c int);
select pg_create_logical_replication_slot('s', 'test_decoding');
insert into test select generate_series(1, 1000);
set logical_decoding_work_mem to '10GB'; -- avoid data spill
select from pg_logical_slot_peek_changes('s', null, null);

HEAD:

7996.476 ms
8034.022 ms
8005.583 ms

Patched:

8153.500 ms
8121.588 ms
8121.538 ms


* test case 3 (many small transactions)

test script:

pgbench -s -i 300
psql -c "select pg_create_replication_slot('s', 'test_decoding')";
pgbench -t 10 -c 32
psql -c "set logical_decoding_work_mem to '10GB'; select count(*) from
pg_logical_slot_peek_changes('s', null, null)"

HEAD:

22586.343 ms
22507.905 ms
22504.133 ms

Patched:

23365.142 ms
23110.651 ms
23102.170 ms

We can see 2% ~ 3% performance regressions compared to the current
HEAD, but it's much smaller than I expected. Given that we can make
the code simple, I think we can go with this direction.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Improve eviction algorithm in ReorderBuffer

2024-04-10 Thread Masahiko Sawada

Hi,

Sorry for the late reply, I took two days off.

On Thu, Apr 11, 2024 at 6:20 AM Heikki Linnakangas  wrote:
>
> On 10/04/2024 08:31, Amit Kapila wrote:
> > On Wed, Apr 10, 2024 at 11:00 AM Heikki Linnakangas  wrote:
> >>
> >> On 10/04/2024 07:45, Michael Paquier wrote:
> >>> On Tue, Apr 09, 2024 at 09:16:53PM -0700, Jeff Davis wrote:
> >>>> On Wed, 2024-04-10 at 12:13 +0900, Michael Paquier wrote:
> >>>>> Wouldn't the best way forward be to revert
> >>>>> 5bec1d6bc5e3 and revisit the whole in v18?
> >>>>
> >>>> Also consider commits b840508644 and bcb14f4abc.
> >>>
> >>> Indeed.  These are also linked.
> >>
> >> I don't feel the urge to revert this:
> >>
> >> - It's not broken as such, we're just discussing better ways to
> >> implement it. We could also do nothing, and revisit this in v18. The
> >> only must-fix issue is some compiler warnings IIUC.
> >>
> >> - It's a pretty localized change in reorderbuffer.c, so it's not in the
> >> way of other patches or reverts. Nothing else depends on the binaryheap
> >> changes yet either.
> >>
> >> - It seems straightforward to repeat the performance tests with whatever
> >> alternative implementations we want to consider.
> >>
> >> My #1 choice would be to write a patch to switch the pairing heap,
> >> performance test that, and revert the binary heap changes.
> >>
> >
> > +1.
>
> To move this forward, here's a patch to switch to a pairing heap. In my
> very quick testing, with the performance test cases posted earlier in
> this thread [1] [2], I'm seeing no meaningful performance difference
> between this and what's in master currently.
>
> Sawada-san, what do you think of this? To be sure, if you could also
> repeat the performance tests you performed earlier, that'd be great. If
> you agree with this direction, and you're happy with this patch, feel
> free take it from here and commit this, and also revert commits
> b840508644 and bcb14f4abc.
>

Thank you for the patch!

I agree with the direction that we replace binaryheap + index with the
existing pairing heap and revert the changes for binaryheap. Regarding
the patch, I'm not sure we can remove the MAX_HEAP_TXN_COUNT_THRESHOLD
logic because otherwise we need to remove and add the txn node (i.e.
O(log n)) for every memory update. I'm concerned it could cause some
performance degradation in a case where there are not many
transactions being decoded.

I'll do performance tests, and share the results soon.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: PostgreSQL 17 Release Management Team & Feature Freeze

2024-04-08 Thread Masahiko Sawada

On Tue, Apr 9, 2024 at 12:30 AM Andres Freund  wrote:
>
> Hi,
>
> On 2024-04-08 09:26:09 -0400, Robert Haas wrote:
> > On Sun, Apr 7, 2024 at 6:50 PM Michael Paquier  wrote:
> > And maybe we need to think of a way to further mitigate this crush of
> > last minute commits. e.g. In the last week, you can't have more
> > feature commits, or more lines of insertions in your commits, than you
> > did in the prior 3 weeks combined. I don't know. I think this mad rush
> > of last-minute commits is bad for the project.
>
> Some will just polish commits until the last minute, until the
> the dot's on the i's really shine, others will continue picking up more CF
> entries until the freeze is reached, others will push half baked stuff.

I agree with this part.

Aside from considering how to institute some rules for mitigating the
last-minute rush, it might also be a good idea to consider how to
improve testing the new commits during beta. FWIW in each year, after
feature freeze I personally pick some new features that I didn't get
involved with during the development and do intensive reviews in
April. It might be good if more people did things like that. That
might help finding half baked features earlier and improve the quality
in general. So for example, we list features that could require more
reviews (e.g. because of its volume, complexity, and a span of
influence etc.) and we do intensive reviews for these items. Each item
should be reviewed by other than the author and the committer. We may
want to set aside a specific period for intensive testing.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Improve eviction algorithm in ReorderBuffer

2024-04-08 Thread Masahiko Sawada

h. For v17, changes for #2 are smaller, but I'm concerned
that the new API that requires a hash function to be able to use
binaryheap_update_{up|down} might not be user friendly. In terms of
APIs, I prefer #1 idea. And changes for #1 can make the binaryheap
code simple, although it requires adding a variable in
ReorderBufferTXN instead. But overall, it can remove the hash table
and some functions so it looks better to me.

When it comes to performance overhead, I mentioned that there is some
regression in the current binaryheap even without indexing. Since
function calling contributed to the regression, inlining some
functions reduced some overheads. For example, inlining set_node() and
replace_node(), the same benchmark test I used showed:

postgres(1:88476)=# select * from generate_series(1,3) x(x), lateral
(select * from bench_load(false, 1000 * (1+x-x)));
 x |   cnt| load_ms | xx_load_ms | old_load_ms
---+--+-++-
 1 | 1000 | 502 |624 | 427
 2 | 1000 | 503 |622 | 428
 3 | 1000 | 502 |621 | 427
(3 rows)

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


01_use_update_index_func_in_binaryheap.patch
Description: Binary data


02_use_hashfunc_in_binaryheap.patch
Description: Binary data

Re: Improve tab completion for ALTER DEFAULT PRIVILEGE and ALTER TABLE

2024-04-07 Thread Masahiko Sawada

On Fri, Apr 5, 2024 at 1:18 AM vignesh C  wrote:
>
> On Tue, 2 Apr 2024 at 13:08, Masahiko Sawada  wrote:
> >
> > On Mon, Apr 1, 2024 at 10:41 PM vignesh C  wrote:
> > >
> > > On Thu, 28 Mar 2024 at 13:05, Masahiko Sawada  
> > > wrote:
> > > >
> > > > Hi,
> > > >
> > > > Thank you for the patch!
> > > >
> > > > On Mon, Jul 3, 2023 at 12:12 AM vignesh C  wrote:
> > > > >
> > > > > Hi,
> > > > >
> > > > > Improved tab completion for "ALTER DEFAULT PRIVILEGE" and "ALTER 
> > > > > TABLE":
> > > > > 1) GRANT, REVOKE and FOR USER keyword was not displayed in tab
> > > > > completion of alter default privileges like the below statement:
> > > > > ALTER DEFAULT PRIVILEGES GRANT INSERT ON tables TO PUBLIC;
> > > > > ALTER DEFAULT PRIVILEGES REVOKE INSERT ON tables FROM PUBLIC;
> > > > > ALTER DEFAULT PRIVILEGES FOR USER vignesh revoke INSERT ON tables 
> > > > > FROM dba1;
> > > >
> > > > +1
> > > >
> > > > >
> > > > > 2) USER was not displayed for "ALTER DEFAULT PRIVILEGES IN SCHEMA
> > > > > public FOR " like in below statement:
> > > > > ALTER DEFAULT PRIVILEGES IN SCHEMA public FOR USER dba1 GRANT INSERT
> > > > > ON TABLES TO PUBLIC;
> > > >
> > > > Since there is no difference FOR USER and FOR ROLE, I'm not sure we
> > > > really want to support both in tab-completion.
> > >
> > > I have removed this change
> > >
> > > > >
> > > > > 3) "FOR GRANT OPTION" was not display for "ALTER DEFAULT PRIVILEGES
> > > > > REVOKE " like in below statement:
> > > > > alter default privileges revoke grant option for select ON tables 
> > > > > FROM dba1;
> > > >
> > > > +1. But the v3 patch doesn't cover the following case:
> > > >
> > > > =# alter default privileges for role masahiko revoke [tab]
> > > > ALL CREATE  DELETE  EXECUTE INSERT  MAINTAIN
> > > >  REFERENCES  SELECT  TRIGGER TRUNCATEUPDATE  USAGE
> > >
> > > Modified in the updated patch
> > >
> > > > And it doesn't cover MAINTAIN neither:
> > > >
> > > > =# alter default privileges revoke [tab]
> > > > ALL   DELETEGRANT OPTION FOR  REFERENCES
> > > >  TRIGGER   UPDATE
> > > > CREATEEXECUTE   INSERTSELECT
> > > >  TRUNCATE  USAGE
> > >
> > > Modified in the updated patch
> > >
> > > > The patch adds the completions for ALTER DEFAULT PRIVILEGES REVOKE,
> > > > but we handle such case in GRANT and REVOKE part:
> > > >
> > > > (around L3958)
> > > > /*
> > > >  * With ALTER DEFAULT PRIVILEGES, restrict completion to 
> > > > grantable
> > > >  * privileges (can't grant roles)
> > > >  */
> > > > if (HeadMatches("ALTER", "DEFAULT", "PRIVILEGES"))
> > > > COMPLETE_WITH("SELECT", "INSERT", "UPDATE",
> > > >   "DELETE", "TRUNCATE", "REFERENCES", "TRIGGER",
> > > >   "CREATE", "EXECUTE", "USAGE", "MAINTAIN", 
> > > > "ALL");
> > >
> > > The current patch handles the fix from here now.
> > >
> > > > Also, I think we can support WITH GRANT OPTION too. For example,
> > > >
> > > > =# alter default privileges for role masahiko grant all on tables to
> > > > public [tab]
> > >
> > > I have handled this in the updated patch
> > >
> > > > It's already supported in the GRANT statement.
> > > >
> > > > >
> > > > > 4) "DATA TYPE" was missing in "ALTER TABLE table-name ALTER COLUMN
> > > > > column-name SET" like in:
> > > > > ALTER TABLE t1 ALTER COLUMN c1 SET DATA TYPE text;
> > > > >
> > > >
> > > > +1. The patch looks good to me, so pushed.
> > >
> > > Thanks for committing this.
> > >
> > > The updated patch has the changes for the above comments.
> > >
> >
> > Thank you for updating the patch.
> >
> > I think it doesn't work well as "GRANT OPTION FOR" is complemented
> > twice. For example,
> >
> > =# alter default privileges for user masahiko revoke [tab]
> > ALL   DELETEGRANT OPTION FOR  MAINTAIN
> >  SELECTTRUNCATE  USAGE
> > CREATEEXECUTE   INSERTREFERENCES
> >  TRIGGER   UPDATE
> > =# alter default privileges for user masahiko revoke grant option for [tab]
> > ALL   DELETEGRANT OPTION FOR  MAINTAIN
> >  SELECTTRUNCATE  USAGE
> > CREATEEXECUTE   INSERTREFERENCES
> >  TRIGGER   UPDATE
>
> Thanks for finding this issue, the attached v5 version patch has the
> fix for the same.

Thank you for updating the patch! I've pushed with minor adjustments.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Improve eviction algorithm in ReorderBuffer

2024-04-05 Thread Masahiko Sawada

On Fri, Apr 5, 2024 at 2:55 AM Jeff Davis  wrote:
>
> On Thu, 2024-04-04 at 17:28 +0900, Masahiko Sawada wrote:
> > > Perhaps it's not worth the effort though, if performance is already
> > > good enough?
> >
> > Yeah, it would be better to measure the overhead first. I'll do that.
>
> I have some further comments and I believe changes are required for
> v17.
>
> An indexed binary heap API requires both a comparator and a hash
> function to be specified, and has two different kinds of keys: the heap
> key (mutable) and the hash key (immutable). It provides heap methods
> and hashtable methods, and keep the two internal structures (heap and
> HT) in sync.

IIUC for example in ReorderBuffer, the heap key is transaction size
and the hash key is xid.

>
> The implementation in b840508644 uses the bh_node_type as the hash key,
> which is just a Datum, and it just hashes the bytes. I believe the
> implicit assumption is that the Datum is a pointer -- I'm not sure how
> one would use that API if the Datum were a value. Hashing a pointer
> seems strange to me and, while I see why you did it that way, I think
> it reflects that the API boundaries are not quite right.

I see your point. It assumes that the bh_node_type is a pointer or at
least unique. So it cannot work with Datum being a value.

>
> One consequence of using the pointer as the hash key is that you need
> to find the pointer first: you can't change or remove elements based on
> the transaction ID, you have to get the ReorderBufferTXN pointer by
> finding it in another structure, first. Currently, that's being done by
> searching ReorderBuffer->by_txn. So we actually have two hash tables
> for essentially the same purpose: one with xid as the key, and the
> other with the pointer as the key. That makes no sense -- let's have a
> proper indexed binary heap to look things up by xid (the internal HT)
> or by transaction size (using the internal heap).
>
> I suggest:
>
>   * Make a proper indexed binary heap API that accepts a hash function
> and provides both heap methods and HT methods that operate based on
> values (transaction size and transaction ID, respectively).
>   * Get rid of ReorderBuffer->by_txn and use the indexed binary heap
> instead.
>
> This will be a net simplification in reorderbuffer.c, which is good,
> because that file makes use of a *lot* of data strucutres.
>

It sounds like a data structure that mixes the hash table and the
binary heap and we use it as the main storage (e.g. for
ReorderBufferTXN) instead of using the binary heap as the secondary
data structure. IIUC with your idea, the indexed binary heap has a
hash table to store elements each of which has its index within the
heap node array. I guess it's better to create it as a new data
structure rather than extending the existing binaryheap, since APIs
could be very different. I might be missing something, though.

On Fri, Apr 5, 2024 at 3:55 AM Jeff Davis  wrote:
>
> On Thu, 2024-04-04 at 10:55 -0700, Jeff Davis wrote:
> >   * Make a proper indexed binary heap API that accepts a hash
> > function
> > and provides both heap methods and HT methods that operate based on
> > values (transaction size and transaction ID, respectively).
> >   * Get rid of ReorderBuffer->by_txn and use the indexed binary heap
> > instead.
>
> An alternative idea:
>
> * remove the hash table from binaryheap.c
>
> * supply a new callback to the binary heap with type like:
>
>   typedef void (*binaryheap_update_index)(
> bh_node_type node,
> int new_element_index);
>
> * make the remove, update_up, and update_down methods take the element
> index rather than the pointer
>
> reorderbuffer.c would then do something like:
>
>   void
>   txn_update_heap_index(ReorderBufferTXN *txn, int new_element_index)
>   {
>  txn->heap_element_index = new_element_index;
>   }
>
>   ...
>
>   txn_heap = binaryheap_allocate(..., txn_update_heap_index, ...);
>
> and then binaryheap.c would effectively maintain txn-
> >heap_element_index, so reorderbuffer.c can pass that to the APIs that
> require the element index.

Thank you for the idea. I was thinking the same idea when considering
your previous comment. With this idea, we still use the binaryheap for
ReorderBuffer as the second data structure. Since we can implement
this idea with relatively small changes to the current binaryheap,
I've implemented it and measured performances.

I've attached a patch that adds an extension for benchmarking
binaryheap implementations. binaryheap_bench.c is the main test
module. To make the comparison between different binaryheap
implementations, the extension includes two different binaryheap
implementations. Therefore, binaryheap_ben

Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-04-04 Thread Masahiko Sawada

On Thu, Apr 4, 2024 at 5:36 PM Amit Kapila  wrote:
>
> On Thu, Apr 4, 2024 at 1:32 PM Masahiko Sawada  wrote:
> >
> > On Thu, Apr 4, 2024 at 1:34 PM Bharath Rupireddy
> >  wrote:
> > >
> > > On Thu, Apr 4, 2024 at 9:42 AM Masahiko Sawada  
> > > wrote:
> > > >
> > > > @@ -1368,6 +1416,7 @@ ShutDownSlotSync(void)
> > > > if (SlotSyncCtx->pid == InvalidPid)
> > > > {
> > > > SpinLockRelease(>mutex);
> > > > +   update_synced_slots_inactive_since();
> > > > return;
> > > > }
> > > > SpinLockRelease(>mutex);
> > > > @@ -1400,6 +1449,8 @@ ShutDownSlotSync(void)
> > > > }
> > > >
> > > > SpinLockRelease(>mutex);
> > > > +
> > > > +   update_synced_slots_inactive_since();
> > > >  }
> > > >
> > > > Why do we want to update all synced slots' inactive_since values at
> > > > shutdown in spite of updating the value every time when releasing the
> > > > slot? It seems to contradict the fact that inactive_since is updated
> > > > when releasing or restoring the slot.
> > >
> > > It is to get the inactive_since right for the cases where the standby
> > > is promoted without a restart similar to when a standby is promoted
> > > with restart in which case the inactive_since is set to current time
> > > in RestoreSlotFromDisk.
> > >
> > > Imagine the slot is synced last time at time t1 and then a few hours
> > > passed, the standby is promoted without a restart. If we don't set
> > > inactive_since to current time in this case in ShutDownSlotSync, the
> > > inactive timeout invalidation mechanism can kick in immediately.
> > >
> >
> > Thank you for the explanation! I understood the needs.
> >
>
> Do you want to review the v34_0001* further or shall I proceed with
> the commit of the same?

Thanks for asking. The v34-0001 patch looks good to me.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Improve eviction algorithm in ReorderBuffer

2024-04-04 Thread Masahiko Sawada

On Thu, Apr 4, 2024 at 1:54 PM Jeff Davis  wrote:
>
> On Thu, 2024-04-04 at 09:31 +0900, Masahiko Sawada wrote:
> > IIUC, with your suggestion, sift_{up|down} needs to update the
> > heap_index field as well. Does it mean that the caller needs to pass
> > the address of heap_index down to sift_{up|down}?
>
> I'm not sure quite how binaryheap should be changed. Bringing the heap
> implementation into reorderbuffer.c would obviously work, but that
> would be more code.

Right.

>  Another option might be to make the API of
> binaryheap look a little more like simplehash, where some #defines
> control optional behavior and can tell the implementation where to find
> fields in the structure.

Interesting idea.

>
> Perhaps it's not worth the effort though, if performance is already
> good enough?

Yeah, it would be better to measure the overhead first. I'll do that.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Synchronizing slots from primary to standby

2024-04-04 Thread Masahiko Sawada

On Wed, Apr 3, 2024 at 7:06 PM Amit Kapila  wrote:
>
> On Wed, Apr 3, 2024 at 11:13 AM Amit Kapila  wrote:
> >
> > On Wed, Apr 3, 2024 at 9:36 AM Bharath Rupireddy
> >  wrote:
> >
> > > I quickly looked at v8, and have a nit, rest all looks good.
> > >
> > > +if (DecodingContextReady(ctx) && found_consistent_snapshot)
> > > +*found_consistent_snapshot = true;
> > >
> > > Can the found_consistent_snapshot be checked first to help avoid the
> > > function call DecodingContextReady() for pg_replication_slot_advance
> > > callers?
> > >
> >
> > Okay, changed. Additionally, I have updated the comments and commit
> > message. I'll push this patch after some more testing.
> >
>
> Pushed!

While testing this change, I realized that it could happen that the
server logs are flooded with the following logical decoding logs that
are written every 200 ms:

2024-04-04 16:15:19.270 JST [3838739] LOG:  starting logical decoding
for slot "test_sub"
2024-04-04 16:15:19.270 JST [3838739] DETAIL:  Streaming transactions
committing after 0/50006F48, reading WAL from 0/50006F10.
2024-04-04 16:15:19.270 JST [3838739] LOG:  logical decoding found
consistent point at 0/50006F10
2024-04-04 16:15:19.270 JST [3838739] DETAIL:  There are no running
transactions.
2024-04-04 16:15:19.477 JST [3838739] LOG:  starting logical decoding
for slot "test_sub"
2024-04-04 16:15:19.477 JST [3838739] DETAIL:  Streaming transactions
committing after 0/50006F48, reading WAL from 0/50006F10.
2024-04-04 16:15:19.477 JST [3838739] LOG:  logical decoding found
consistent point at 0/50006F10
2024-04-04 16:15:19.477 JST [3838739] DETAIL:  There are no running
transactions.

For example, I could reproduce it with the following steps:

1. create the primary and start.
2. run "pgbench -i -s 100" on the primary.
3. run pg_basebackup to create the standby.
4. configure slotsync setup on the standby and start.
5. create a publication for all tables on the primary.
6. create the subscriber and start.
7. run "pgbench -i -Idtpf" on the subscriber.
8. create a subscription on the subscriber (initial data copy will start).

The logical decoding logs were written every 200 ms during the initial
data synchronization.

Looking at the new changes for update_local_synced_slot():

if (remote_slot->confirmed_lsn != slot->data.confirmed_flush ||
remote_slot->restart_lsn != slot->data.restart_lsn ||
remote_slot->catalog_xmin != slot->data.catalog_xmin)
{
/*
 * We can't directly copy the remote slot's LSN or xmin unless there
 * exists a consistent snapshot at that point. Otherwise, after
 * promotion, the slots may not reach a consistent point before the
 * confirmed_flush_lsn which can lead to a data loss. To avoid data
 * loss, we let slot machinery advance the slot which ensures that
 * snapbuilder/slot statuses are updated properly.
 */
if (SnapBuildSnapshotExists(remote_slot->restart_lsn))
{
/*
 * Update the slot info directly if there is a serialized snapshot
 * at the restart_lsn, as the slot can quickly reach consistency
 * at restart_lsn by restoring the snapshot.
 */
SpinLockAcquire(>mutex);
slot->data.restart_lsn = remote_slot->restart_lsn;
slot->data.confirmed_flush = remote_slot->confirmed_lsn;
slot->data.catalog_xmin = remote_slot->catalog_xmin;
slot->effective_catalog_xmin = remote_slot->catalog_xmin;
SpinLockRelease(>mutex);

if (found_consistent_snapshot)
*found_consistent_snapshot = true;
}
else
{
LogicalSlotAdvanceAndCheckSnapState(remote_slot->confirmed_lsn,
found_consistent_snapshot);
}

ReplicationSlotsComputeRequiredXmin(false);
ReplicationSlotsComputeRequiredLSN();

slot_updated = true;

We call LogicalSlotAdvanceAndCheckSnapState() if one of confirmed_lsn,
restart_lsn, and catalog_xmin is different between the remote slot and
the local slot. In my test case, during the initial sync performing,
only catalog_xmin was different and there was no serialized snapshot
at restart_lsn, and the slotsync worker called
LogicalSlotAdvanceAndCheckSnapState(). However no slot properties were
changed even after the function and it set slot_updated = true. So it
starts the next slot synchronization after 200ms.

It seems to me that we can skip calling
LogicalSlotAdvanceAndCheckSnapState() at least when the remote and
local have the same restart_lsn and confirmed_lsn.

I'm not sure there are other scenarios but is it worth fixing this symptom?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-04-04 Thread Masahiko Sawada

On Thu, Apr 4, 2024 at 1:34 PM Bharath Rupireddy
 wrote:
>
> On Thu, Apr 4, 2024 at 9:42 AM Masahiko Sawada  wrote:
> >
> > @@ -1368,6 +1416,7 @@ ShutDownSlotSync(void)
> > if (SlotSyncCtx->pid == InvalidPid)
> > {
> > SpinLockRelease(>mutex);
> > +   update_synced_slots_inactive_since();
> > return;
> > }
> > SpinLockRelease(>mutex);
> > @@ -1400,6 +1449,8 @@ ShutDownSlotSync(void)
> > }
> >
> > SpinLockRelease(>mutex);
> > +
> > +   update_synced_slots_inactive_since();
> >  }
> >
> > Why do we want to update all synced slots' inactive_since values at
> > shutdown in spite of updating the value every time when releasing the
> > slot? It seems to contradict the fact that inactive_since is updated
> > when releasing or restoring the slot.
>
> It is to get the inactive_since right for the cases where the standby
> is promoted without a restart similar to when a standby is promoted
> with restart in which case the inactive_since is set to current time
> in RestoreSlotFromDisk.
>
> Imagine the slot is synced last time at time t1 and then a few hours
> passed, the standby is promoted without a restart. If we don't set
> inactive_since to current time in this case in ShutDownSlotSync, the
> inactive timeout invalidation mechanism can kick in immediately.
>

Thank you for the explanation! I understood the needs.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-04-03 Thread Masahiko Sawada

On Wed, Apr 3, 2024 at 11:58 PM Bharath Rupireddy
 wrote:
>
>
> Please find the attached v33 patches.

@@ -1368,6 +1416,7 @@ ShutDownSlotSync(void)
if (SlotSyncCtx->pid == InvalidPid)
{
SpinLockRelease(>mutex);
+   update_synced_slots_inactive_since();
return;
}
SpinLockRelease(>mutex);
@@ -1400,6 +1449,8 @@ ShutDownSlotSync(void)
}

SpinLockRelease(>mutex);
+
+   update_synced_slots_inactive_since();
 }

Why do we want to update all synced slots' inactive_since values at
shutdown in spite of updating the value every time when releasing the
slot? It seems to contradict the fact that inactive_since is updated
when releasing or restoring the slot.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Improve eviction algorithm in ReorderBuffer

2024-04-03 Thread Masahiko Sawada

Hi,

On Thu, Apr 4, 2024 at 2:32 AM Jeff Davis  wrote:
>
> On Wed, 2024-04-03 at 01:45 -0700, Jeff Davis wrote:
> > I suggest that you add a "heap_index" field to ReorderBufferTXN that
> > would point to the index into the heap's array (the same as
> > bh_nodeidx_entry.index in your patch). Each time an element moves
> > within the heap array, just follow the pointer to the
> > ReorderBufferTXN
> > object and update the heap_index -- no hash lookup required.
>
> It looks like my email was slightly too late, as the work was already
> committed.

Thank you for the suggestions! I should have informed it earlier.

>
> My suggestion is not required for 17, and so it's fine if this waits
> until the next CF. If it turns out to be a win we can consider
> backporting to 17 just to keep the code consistent, otherwise it can go
> in 18.

IIUC, with your suggestion, sift_{up|down} needs to update the
heap_index field as well. Does it mean that the caller needs to pass
the address of heap_index down to sift_{up|down}?

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Improve tab completion for ALTER DEFAULT PRIVILEGE and ALTER TABLE

2024-04-02 Thread Masahiko Sawada

On Mon, Apr 1, 2024 at 10:41 PM vignesh C  wrote:
>
> On Thu, 28 Mar 2024 at 13:05, Masahiko Sawada  wrote:
> >
> > Hi,
> >
> > Thank you for the patch!
> >
> > On Mon, Jul 3, 2023 at 12:12 AM vignesh C  wrote:
> > >
> > > Hi,
> > >
> > > Improved tab completion for "ALTER DEFAULT PRIVILEGE" and "ALTER TABLE":
> > > 1) GRANT, REVOKE and FOR USER keyword was not displayed in tab
> > > completion of alter default privileges like the below statement:
> > > ALTER DEFAULT PRIVILEGES GRANT INSERT ON tables TO PUBLIC;
> > > ALTER DEFAULT PRIVILEGES REVOKE INSERT ON tables FROM PUBLIC;
> > > ALTER DEFAULT PRIVILEGES FOR USER vignesh revoke INSERT ON tables FROM 
> > > dba1;
> >
> > +1
> >
> > >
> > > 2) USER was not displayed for "ALTER DEFAULT PRIVILEGES IN SCHEMA
> > > public FOR " like in below statement:
> > > ALTER DEFAULT PRIVILEGES IN SCHEMA public FOR USER dba1 GRANT INSERT
> > > ON TABLES TO PUBLIC;
> >
> > Since there is no difference FOR USER and FOR ROLE, I'm not sure we
> > really want to support both in tab-completion.
>
> I have removed this change
>
> > >
> > > 3) "FOR GRANT OPTION" was not display for "ALTER DEFAULT PRIVILEGES
> > > REVOKE " like in below statement:
> > > alter default privileges revoke grant option for select ON tables FROM 
> > > dba1;
> >
> > +1. But the v3 patch doesn't cover the following case:
> >
> > =# alter default privileges for role masahiko revoke [tab]
> > ALL CREATE  DELETE  EXECUTE INSERT  MAINTAIN
> >  REFERENCES  SELECT  TRIGGER TRUNCATEUPDATE  USAGE
>
> Modified in the updated patch
>
> > And it doesn't cover MAINTAIN neither:
> >
> > =# alter default privileges revoke [tab]
> > ALL   DELETEGRANT OPTION FOR  REFERENCES
> >  TRIGGER   UPDATE
> > CREATEEXECUTE   INSERTSELECT
> >  TRUNCATE  USAGE
>
> Modified in the updated patch
>
> > The patch adds the completions for ALTER DEFAULT PRIVILEGES REVOKE,
> > but we handle such case in GRANT and REVOKE part:
> >
> > (around L3958)
> > /*
> >  * With ALTER DEFAULT PRIVILEGES, restrict completion to grantable
> >  * privileges (can't grant roles)
> >  */
> > if (HeadMatches("ALTER", "DEFAULT", "PRIVILEGES"))
> > COMPLETE_WITH("SELECT", "INSERT", "UPDATE",
> >   "DELETE", "TRUNCATE", "REFERENCES", "TRIGGER",
> >   "CREATE", "EXECUTE", "USAGE", "MAINTAIN", "ALL");
>
> The current patch handles the fix from here now.
>
> > Also, I think we can support WITH GRANT OPTION too. For example,
> >
> > =# alter default privileges for role masahiko grant all on tables to
> > public [tab]
>
> I have handled this in the updated patch
>
> > It's already supported in the GRANT statement.
> >
> > >
> > > 4) "DATA TYPE" was missing in "ALTER TABLE table-name ALTER COLUMN
> > > column-name SET" like in:
> > > ALTER TABLE t1 ALTER COLUMN c1 SET DATA TYPE text;
> > >
> >
> > +1. The patch looks good to me, so pushed.
>
> Thanks for committing this.
>
> The updated patch has the changes for the above comments.
>

Thank you for updating the patch.

I think it doesn't work well as "GRANT OPTION FOR" is complemented
twice. For example,

=# alter default privileges for user masahiko revoke [tab]
ALL   DELETEGRANT OPTION FOR  MAINTAIN
 SELECTTRUNCATE  USAGE
CREATEEXECUTE   INSERTREFERENCES
 TRIGGER   UPDATE
=# alter default privileges for user masahiko revoke grant option for [tab]
ALL   DELETEGRANT OPTION FOR  MAINTAIN
 SELECTTRUNCATE  USAGE
CREATEEXECUTE   INSERTREFERENCES
 TRIGGER   UPDATE

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-04-01 Thread Masahiko Sawada

ensure whether we need
> > to update the slot or not.

If we use such pre-checks, another problem might happen; it cannot
handle a case where the slot is acquired on the primary but its LSNs
don't move forward. Imagine a logical replication conflict happened on
the subscriber, and the logical replication enters the retry loop. In
this case, the remote slot's inactive_since gets updated for every
retry, but it looks inactive from the standby since the slot LSNs
don't change. Therefore, only the local slot could be invalidated due
to the timeout but probably we don't want to regard such a slot as
"inactive".

Another idea I came up with is that the slotsync worker updates the
local slot's inactive_since to the local timestamp only when the
remote slot might have got inactive. If the remote slot is acquired by
someone, the local slot's inactive_since is also NULL. If the remote
slot gets inactive, the slotsync worker sets the local timestamp to
the local slot's inactive_since. Since the remote slot could be
acquired and released before the slotsync worker gets the remote slot
data again, if the remote slot's inactive_since > the local slot's
inactive_since, the slotsync worker updates the local one. IOW, we
detect whether the remote slot was acquired and released since the
last synchronization, by checking the remote slot's inactive_since.
This idea seems to handle these cases I mentioned unless I'm missing
something, but it requires for the slotsync worker to update
inactive_since in a different way than other parameters.

Or a simple solution is that the slotsync worker updates
inactive_since as it does for non-synced slots, and disables
timeout-based slot invalidation for synced slots.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Add new error_action COPY ON_ERROR "log"

2024-04-01 Thread Masahiko Sawada

On Mon, Apr 1, 2024 at 10:03 AM Masahiko Sawada  wrote:
>
> On Sat, Mar 30, 2024 at 11:05 PM Bharath Rupireddy
>  wrote:
> >
> > On Thu, Mar 28, 2024 at 6:31 PM Masahiko Sawada  
> > wrote:
> > >
> > > That is,
> > > since the LOG_VERBOSITY option is an enum parameter, it might make
> > > more sense to require the value, instead of making the value optional.
> > > For example, the following command could not be obvious for users:
> > >
> > > COPY test FROM stdin (ON_ERROR ignore, LOG_VERBOSITY);
> >
> > Agreed. Please see the attached v14 patch.
>
> Thank you for updating the patch!
>
> >  The LOG_VERBOSITY now needs
> > a value to be specified. Note that I've not added any test for this
> > case as there seemed to be no such tests so far generating "ERROR:
> > <> requires a parameter". I don't mind adding one for
> > LOG_VERBOSITY though.
>
> +1
>
> One minor point:
>
>  ENCODING 'encoding_name'
> +LOG_VERBOSITY [ mode ]
>  
>
> '[' and ']' are not necessary because the value is no longer optional.
>
> I've attached the updated patch. I'll push it, barring any objections.
>

Pushed.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Improve eviction algorithm in ReorderBuffer

2024-03-31 Thread Masahiko Sawada

On Mon, Apr 1, 2024 at 11:26 AM Masahiko Sawada  wrote:
>
> On Fri, Mar 29, 2024 at 7:37 PM Amit Kapila  wrote:
> >
> > On Fri, Mar 29, 2024 at 12:13 PM Masahiko Sawada  
> > wrote:
> > >
> > > On Fri, Mar 29, 2024 at 2:09 PM vignesh C  wrote:
> > > >
> > > > On Tue, 26 Mar 2024 at 10:05, Masahiko Sawada  
> > > > wrote:
> > > > >
> > > > > On Thu, Mar 14, 2024 at 12:02 PM Masahiko Sawada 
> > > > >  wrote:
> > > > > >
> > > > > >
> > > > > > I've attached new version patches.
> > > > >
> > > > > Since the previous patch conflicts with the current HEAD, I've
> > > > > attached the rebased patches.
> > > >
> > > > Thanks for the updated patch.
> > > > One comment:
> > > > I felt we can mention the improvement where we update memory
> > > > accounting info at transaction level instead of per change level which
> > > > is done in ReorderBufferCleanupTXN, ReorderBufferTruncateTXN, and
> > > > ReorderBufferSerializeTXN also in the commit message:
> > >
> > > Agreed.
> > >
> > > I think the patch is in good shape. I'll push the patch with the
> > > suggestion next week, barring any objections.
> > >
> >
> > Few minor comments:
> > 1.
> > @@ -3636,6 +3801,8 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
> >   Assert(txn->nentries_mem == 0);
> >   }
> >
> > + ReorderBufferMaybeResetMaxHeap(rb);
> > +
> >
> > Can we write a comment about why this reset is required here?
> > Otherwise, the reason is not apparent.
>
> Yes, added.
>
> >
> > 2.
> > Although using max-heap to select the largest
> > + * transaction is effective when there are many transactions being decoded,
> > + * there is generally no need to use it as long as all transactions being
> > + * decoded are top-level transactions. Therefore, we use MaxConnections as 
> > the
> > + * threshold so we can prevent switching to the state unless we use
> > + * subtransactions.
> > + */
> > +#define MAX_HEAP_TXN_COUNT_THRESHOLD MaxConnections
> >
> > Isn't using max-heap equally effective in finding the largest
> > transaction whether there are top-level or top-level plus
> > subtransactions? This comment indicates it is only effective when
> > there are subtransactions.
>
> You're right. Updated the comment.
>
> I've attached the updated patches.
>

While reviewing the patches, I realized the comment of
binearyheap_allocate() should also be updated. So I've attached the
new patches.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


v12-0001-Make-binaryheap-enlargeable.patch
Description: Binary data


v12-0002-Add-functions-to-binaryheap-for-efficient-key-re.patch
Description: Binary data


v12-0003-Improve-eviction-algorithm-in-Reorderbuffer-usin.patch
Description: Binary data

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-31 Thread Masahiko Sawada

On Fri, Mar 29, 2024 at 4:21 PM John Naylor  wrote:
>
> On Thu, Mar 28, 2024 at 12:55 PM Masahiko Sawada  
> wrote:
> > I think the patch is in good shape. Do you have other comments or
> > suggestions, John?
>
> --- a/doc/src/sgml/config.sgml
> +++ b/doc/src/sgml/config.sgml
> @@ -1918,11 +1918,6 @@ include_dir 'conf.d'
>  too high.  It may be useful to control for this by separately
>  setting .
> 
> -   
> -Note that for the collection of dead tuple identifiers,
> -VACUUM is only able to utilize up to a maximum of
> -1GB of memory.
> -   
>
>   
>
> This is mentioned twice for two different GUCs -- need to remove the
> other one, too.

Good catch, removed.

> Other than that, I just have minor nits:
>
> - * The major space usage for vacuuming is storage for the array of dead TIDs
> + * The major space usage for vacuuming is TID store, a storage for dead TIDs
>
> I think I've helped edit this sentence before, but I still don't quite
> like it. I'm thinking now "is storage for the dead tuple IDs".
>
> - * set upper bounds on the number of TIDs we can keep track of at once.
> + * set upper bounds on the maximum memory that can be used for keeping track
> + * of dead TIDs at once.
>
> I think "maximum" is redundant with "upper bounds".

Fixed.

>
> I also feel the commit message needs more "meat" -- we need to clearly
> narrate the features and benefits. I've attached how I would write it,
> but feel free to use what you like to match your taste.

Well, that's much better than mine.

>
> I've marked it Ready for Committer.

Thank you! I've attached the patch that I'm going to push tomorrow.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


v82-0001-Use-TidStore-for-dead-tuple-TIDs-storage-during-.patch
Description: Binary data

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

2024-03-31 Thread Masahiko Sawada

On Fri, Mar 29, 2024 at 11:54 AM torikoshia  wrote:
>
> On 2024-03-28 21:54, Masahiko Sawada wrote:
> > On Thu, Mar 28, 2024 at 9:38 PM torikoshia 
> > wrote:
> >>
> >> On 2024-03-28 10:20, Masahiko Sawada wrote:
> >> > Hi,
> >> >
> >> > On Thu, Jan 18, 2024 at 5:33 PM Masahiko Sawada 
> >> > wrote:
> >> >>
> >> >> On Thu, Jan 18, 2024 at 4:59 PM Alexander Korotkov
> >> >>  wrote:
> >> >> >
> >> >> > On Thu, Jan 18, 2024 at 4:16 AM torikoshia 
> >> >> >  wrote:
> >> >> > > On 2024-01-18 10:10, jian he wrote:
> >> >> > > > On Thu, Jan 18, 2024 at 8:57 AM Masahiko Sawada 
> >> >> > > > 
> >> >> > > > wrote:
> >> >> > > >> On Thu, Jan 18, 2024 at 6:38 AM Tom Lane  
> >> >> > > >> wrote:
> >> >> > > >> > Kyotaro-san's suggestion isn't bad, though I might shorten it 
> >> >> > > >> > to
> >> >> > > >> > error_action {error|ignore|log} (or perhaps "stop" instead of 
> >> >> > > >> > "error")?
> >> >> > > >> > You will need a separate parameter anyway to specify the 
> >> >> > > >> > destination
> >> >> > > >> > of "log", unless "none" became an illegal table name when I 
> >> >> > > >> > wasn't
> >> >> > > >> > looking.  I don't buy that one parameter that has some special 
> >> >> > > >> > values
> >> >> > > >> > while other values could be names will be a good design.  
> >> >> > > >> > Moreover,
> >> >> > > >> > what if we want to support (say) log-to-file along with 
> >> >> > > >> > log-to-table?
> >> >> > > >> > Trying to distinguish a file name from a table name without 
> >> >> > > >> > any other
> >> >> > > >> > context seems impossible.
> >> >> > > >>
> >> >> > > >> I've been thinking we can add more values to this option to log 
> >> >> > > >> errors
> >> >> > > >> not only to the server logs but also to the error table (not sure
> >> >> > > >> details but I imagined an error table is created for each table 
> >> >> > > >> on
> >> >> > > >> error), without an additional option for the destination name. 
> >> >> > > >> The
> >> >> > > >> values would be like error_action 
> >> >> > > >> {error|ignore|save-logs|save-table}.
> >> >> > > >>
> >> >> > > >
> >> >> > > > another idea:
> >> >> > > > on_error {error|ignore|other_future_option}
> >> >> > > > if not specified then by default ERROR.
> >> >> > > > You can also specify ERROR or IGNORE for now.
> >> >> > > >
> >> >> > > > I agree, the parameter "error_action" is better than "location".
> >> >> > >
> >> >> > > I'm not sure whether error_action or on_error is better, but either 
> >> >> > > way
> >> >> > > "error_action error" and "on_error error" seems a bit odd to me.
> >> >> > > I feel "stop" is better for both cases as Tom suggested.
> >> >> >
> >> >> > OK.  What about this?
> >> >> > on_error {stop|ignore|other_future_option}
> >> >> > where other_future_option might be compound like "file 'copy.log'" or
> >> >> > "table 'copy_log'".
> >> >>
> >> >> +1
> >> >>
> >> >
> >> > I realized that ON_ERROR syntax synoposis in the documentation is not
> >> > correct. The option doesn't require the value to be quoted and the
> >> > value can be omitted. The attached patch fixes it.
> >> >
> >> > Regards,
> >>
> >> Thanks!
> >>
> >> Attached patch fixes the doc, but I'm wondering perhaps it might be
> >> better to modify the codes to prohibit abbreviation of the value.
> >>
> >> When seeing the query which abbreviates ON_ERROR value, I feel it's
> >> not
> >> obvious what happens compared to other options which tolerates
> >> abbreviation of the value such as FREEZE or HEADER.
> >>
> >>COPY t1 FROM stdin WITH (ON_ERROR);
> >>
> >> What do you think?
> >
> > Indeed. Looking at options of other commands such as VACUUM and
> > EXPLAIN, I can see that we can omit a boolean value, but non-boolean
> > parameters require its value. The HEADER option is not a pure boolean
> > parameter but we can omit the value. It seems to be for backward
> > compatibility; it used to be a boolean parameter. I agree that the
> > above example would confuse users.
> >
> > Regards,
>
> Thanks for your comment!
>
> Attached a patch which modifies the code to prohibit omission of its
> value.
>
> I was a little unsure about adding a regression test for this, but I
> have not added it since other COPY option doesn't test the omission of
> its value.

Probably should we change the doc as well since ON_ERROR value doesn't
necessarily need to be single-quoted?

The rest looks good to me.

Alexander, what do you think about this change as you're the committer
of this feature?

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Improve eviction algorithm in ReorderBuffer

2024-03-31 Thread Masahiko Sawada

On Fri, Mar 29, 2024 at 8:48 PM Hayato Kuroda (Fujitsu)
 wrote:
>
> Dear Sawada-san,
>
> > Agreed.
> >
> > I think the patch is in good shape. I'll push the patch with the
> > suggestion next week, barring any objections.
>
> Thanks for working on this. Agreed it is committable.
> Few minor comments:

Thank you for the comments!

>
> ```
> + * Either txn or change must be non-NULL at least. We update the memory
> + * counter of txn if it's non-NULL, otherwise change->txn.
> ```
>
> IIUC no one checks the restriction. Should we add Assert() for it, e.g,:
> Assert(txn || change)?

Agreed to add it.

>
> ```
> +/* make sure enough space for a new node */
> ...
> +/* make sure enough space for a new node */
> ```
>
> Should be started with upper case?

I  don't think we need to change it. There are other comments in the
same file that are one line and start with lowercase.

I've just submitted the updated patches[1]

Regards,

[1] 
https://www.postgresql.org/message-id/CAD21AoA6%3D%2BtL%3DbtB_s9N%2BcZK7tKz1W%3DPQyNq72nzjUcdyE%2BwZw%40mail.gmail.com

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Improve eviction algorithm in ReorderBuffer

2024-03-31 Thread Masahiko Sawada

On Fri, Mar 29, 2024 at 7:37 PM Amit Kapila  wrote:
>
> On Fri, Mar 29, 2024 at 12:13 PM Masahiko Sawada  
> wrote:
> >
> > On Fri, Mar 29, 2024 at 2:09 PM vignesh C  wrote:
> > >
> > > On Tue, 26 Mar 2024 at 10:05, Masahiko Sawada  
> > > wrote:
> > > >
> > > > On Thu, Mar 14, 2024 at 12:02 PM Masahiko Sawada 
> > > >  wrote:
> > > > >
> > > > >
> > > > > I've attached new version patches.
> > > >
> > > > Since the previous patch conflicts with the current HEAD, I've
> > > > attached the rebased patches.
> > >
> > > Thanks for the updated patch.
> > > One comment:
> > > I felt we can mention the improvement where we update memory
> > > accounting info at transaction level instead of per change level which
> > > is done in ReorderBufferCleanupTXN, ReorderBufferTruncateTXN, and
> > > ReorderBufferSerializeTXN also in the commit message:
> >
> > Agreed.
> >
> > I think the patch is in good shape. I'll push the patch with the
> > suggestion next week, barring any objections.
> >
>
> Few minor comments:
> 1.
> @@ -3636,6 +3801,8 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
>   Assert(txn->nentries_mem == 0);
>   }
>
> + ReorderBufferMaybeResetMaxHeap(rb);
> +
>
> Can we write a comment about why this reset is required here?
> Otherwise, the reason is not apparent.

Yes, added.

>
> 2.
> Although using max-heap to select the largest
> + * transaction is effective when there are many transactions being decoded,
> + * there is generally no need to use it as long as all transactions being
> + * decoded are top-level transactions. Therefore, we use MaxConnections as 
> the
> + * threshold so we can prevent switching to the state unless we use
> + * subtransactions.
> + */
> +#define MAX_HEAP_TXN_COUNT_THRESHOLD MaxConnections
>
> Isn't using max-heap equally effective in finding the largest
> transaction whether there are top-level or top-level plus
> subtransactions? This comment indicates it is only effective when
> there are subtransactions.

You're right. Updated the comment.

I've attached the updated patches.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


v11-0002-Add-functions-to-binaryheap-for-efficient-key-re.patch
Description: Binary data


v11-0001-Make-binaryheap-enlargeable.patch
Description: Binary data


v11-0003-Improve-eviction-algorithm-in-Reorderbuffer-usin.patch
Description: Binary data

Re: Add new error_action COPY ON_ERROR "log"

2024-03-31 Thread Masahiko Sawada

On Sat, Mar 30, 2024 at 11:05 PM Bharath Rupireddy
 wrote:
>
> On Thu, Mar 28, 2024 at 6:31 PM Masahiko Sawada  wrote:
> >
> > That is,
> > since the LOG_VERBOSITY option is an enum parameter, it might make
> > more sense to require the value, instead of making the value optional.
> > For example, the following command could not be obvious for users:
> >
> > COPY test FROM stdin (ON_ERROR ignore, LOG_VERBOSITY);
>
> Agreed. Please see the attached v14 patch.

Thank you for updating the patch!

>  The LOG_VERBOSITY now needs
> a value to be specified. Note that I've not added any test for this
> case as there seemed to be no such tests so far generating "ERROR:
> <> requires a parameter". I don't mind adding one for
> LOG_VERBOSITY though.

+1

One minor point:

 ENCODING 'encoding_name'
+LOG_VERBOSITY [ mode ]
 

'[' and ']' are not necessary because the value is no longer optional.

I've attached the updated patch. I'll push it, barring any objections.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


v15-0001-Add-new-COPY-option-LOG_VERBOSITY.patch
Description: Binary data

Re: Improve eviction algorithm in ReorderBuffer

2024-03-29 Thread Masahiko Sawada

On Fri, Mar 29, 2024 at 2:09 PM vignesh C  wrote:
>
> On Tue, 26 Mar 2024 at 10:05, Masahiko Sawada  wrote:
> >
> > On Thu, Mar 14, 2024 at 12:02 PM Masahiko Sawada  
> > wrote:
> > >
> > >
> > > I've attached new version patches.
> >
> > Since the previous patch conflicts with the current HEAD, I've
> > attached the rebased patches.
>
> Thanks for the updated patch.
> One comment:
> I felt we can mention the improvement where we update memory
> accounting info at transaction level instead of per change level which
> is done in ReorderBufferCleanupTXN, ReorderBufferTruncateTXN, and
> ReorderBufferSerializeTXN also in the commit message:

Agreed.

I think the patch is in good shape. I'll push the patch with the
suggestion next week, barring any objections.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-29 Thread Masahiko Sawada

On Thu, Mar 28, 2024 at 6:15 PM John Naylor  wrote:
>
> On Thu, Mar 28, 2024 at 12:55 PM Masahiko Sawada  
> wrote:
> >
> > Pushed the refactoring patch.
> >
> > I've attached the rebased vacuum improvement patch for cfbot. I
> > mentioned in the commit message that this patch eliminates the 1GB
> > limitation.
> >
> > I think the patch is in good shape. Do you have other comments or
> > suggestions, John?
>
> I'll do another pass tomorrow, but first I wanted to get in another
> slightly-challenging in-situ test.

Thanks!

>
> About the "tuples missed" -- I didn't expect contention during this
> test. I believe that's completely unrelated behavior, but wanted to
> mention it anyway, since I found it confusing.

I don't investigate it enough but bgwriter might be related to the contention.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Add new error_action COPY ON_ERROR "log"

2024-03-28 Thread Masahiko Sawada

On Thu, Mar 28, 2024 at 5:28 PM Bharath Rupireddy
 wrote:
>
> On Thu, Mar 28, 2024 at 1:43 PM torikoshia  wrote:
> >
> > Attached patch fixes the doc,
>
> May I know which patch you are referring to? And, what do you mean by
> "fixes the doc"?
>
> > but I'm wondering perhaps it might be
> > better to modify the codes to prohibit abbreviation of the value.
>
> Please help me understand the meaning here.
>
> > When seeing the query which abbreviates ON_ERROR value, I feel it's not
> > obvious what happens compared to other options which tolerates
> > abbreviation of the value such as FREEZE or HEADER.
> >
> >COPY t1 FROM stdin WITH (ON_ERROR);
> >
> > What do you think?
>
> So, do you mean to prohibit ON_ERROR being specified without any value
> like in COPY t1 FROM stdin WITH (ON_ERROR);? If yes, I think all the
> other options do allow that [1].
>
> Even if we were to do something like this, shall we discuss this separately?
>
> Having said that, what do you think of the v13 patch posted upthread?
>

This topic accidentally jumped in this thread, but it made me think
that the same might be true for the LOG_VERBOSITY option. That is,
since the LOG_VERBOSITY option is an enum parameter, it might make
more sense to require the value, instead of making the value optional.
For example, the following command could not be obvious for users:

COPY test FROM stdin (ON_ERROR ignore, LOG_VERBOSITY);

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

2024-03-28 Thread Masahiko Sawada

On Thu, Mar 28, 2024 at 9:38 PM torikoshia  wrote:
>
> On 2024-03-28 10:20, Masahiko Sawada wrote:
> > Hi,
> >
> > On Thu, Jan 18, 2024 at 5:33 PM Masahiko Sawada 
> > wrote:
> >>
> >> On Thu, Jan 18, 2024 at 4:59 PM Alexander Korotkov
> >>  wrote:
> >> >
> >> > On Thu, Jan 18, 2024 at 4:16 AM torikoshia  
> >> > wrote:
> >> > > On 2024-01-18 10:10, jian he wrote:
> >> > > > On Thu, Jan 18, 2024 at 8:57 AM Masahiko Sawada 
> >> > > > 
> >> > > > wrote:
> >> > > >> On Thu, Jan 18, 2024 at 6:38 AM Tom Lane  wrote:
> >> > > >> > Kyotaro-san's suggestion isn't bad, though I might shorten it to
> >> > > >> > error_action {error|ignore|log} (or perhaps "stop" instead of 
> >> > > >> > "error")?
> >> > > >> > You will need a separate parameter anyway to specify the 
> >> > > >> > destination
> >> > > >> > of "log", unless "none" became an illegal table name when I wasn't
> >> > > >> > looking.  I don't buy that one parameter that has some special 
> >> > > >> > values
> >> > > >> > while other values could be names will be a good design.  
> >> > > >> > Moreover,
> >> > > >> > what if we want to support (say) log-to-file along with 
> >> > > >> > log-to-table?
> >> > > >> > Trying to distinguish a file name from a table name without any 
> >> > > >> > other
> >> > > >> > context seems impossible.
> >> > > >>
> >> > > >> I've been thinking we can add more values to this option to log 
> >> > > >> errors
> >> > > >> not only to the server logs but also to the error table (not sure
> >> > > >> details but I imagined an error table is created for each table on
> >> > > >> error), without an additional option for the destination name. The
> >> > > >> values would be like error_action 
> >> > > >> {error|ignore|save-logs|save-table}.
> >> > > >>
> >> > > >
> >> > > > another idea:
> >> > > > on_error {error|ignore|other_future_option}
> >> > > > if not specified then by default ERROR.
> >> > > > You can also specify ERROR or IGNORE for now.
> >> > > >
> >> > > > I agree, the parameter "error_action" is better than "location".
> >> > >
> >> > > I'm not sure whether error_action or on_error is better, but either way
> >> > > "error_action error" and "on_error error" seems a bit odd to me.
> >> > > I feel "stop" is better for both cases as Tom suggested.
> >> >
> >> > OK.  What about this?
> >> > on_error {stop|ignore|other_future_option}
> >> > where other_future_option might be compound like "file 'copy.log'" or
> >> > "table 'copy_log'".
> >>
> >> +1
> >>
> >
> > I realized that ON_ERROR syntax synoposis in the documentation is not
> > correct. The option doesn't require the value to be quoted and the
> > value can be omitted. The attached patch fixes it.
> >
> > Regards,
>
> Thanks!
>
> Attached patch fixes the doc, but I'm wondering perhaps it might be
> better to modify the codes to prohibit abbreviation of the value.
>
> When seeing the query which abbreviates ON_ERROR value, I feel it's not
> obvious what happens compared to other options which tolerates
> abbreviation of the value such as FREEZE or HEADER.
>
>COPY t1 FROM stdin WITH (ON_ERROR);
>
> What do you think?

Indeed. Looking at options of other commands such as VACUUM and
EXPLAIN, I can see that we can omit a boolean value, but non-boolean
parameters require its value. The HEADER option is not a pure boolean
parameter but we can omit the value. It seems to be for backward
compatibility; it used to be a boolean parameter. I agree that the
above example would confuse users.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Improve tab completion for ALTER DEFAULT PRIVILEGE and ALTER TABLE

2024-03-28 Thread Masahiko Sawada

Hi,

Thank you for the patch!

On Mon, Jul 3, 2023 at 12:12 AM vignesh C  wrote:
>
> Hi,
>
> Improved tab completion for "ALTER DEFAULT PRIVILEGE" and "ALTER TABLE":
> 1) GRANT, REVOKE and FOR USER keyword was not displayed in tab
> completion of alter default privileges like the below statement:
> ALTER DEFAULT PRIVILEGES GRANT INSERT ON tables TO PUBLIC;
> ALTER DEFAULT PRIVILEGES REVOKE INSERT ON tables FROM PUBLIC;
> ALTER DEFAULT PRIVILEGES FOR USER vignesh revoke INSERT ON tables FROM dba1;

+1

>
> 2) USER was not displayed for "ALTER DEFAULT PRIVILEGES IN SCHEMA
> public FOR " like in below statement:
> ALTER DEFAULT PRIVILEGES IN SCHEMA public FOR USER dba1 GRANT INSERT
> ON TABLES TO PUBLIC;

Since there is no difference FOR USER and FOR ROLE, I'm not sure we
really want to support both in tab-completion.

>
> 3) "FOR GRANT OPTION" was not display for "ALTER DEFAULT PRIVILEGES
> REVOKE " like in below statement:
> alter default privileges revoke grant option for select ON tables FROM dba1;

+1. But the v3 patch doesn't cover the following case:

=# alter default privileges for role masahiko revoke [tab]
ALL CREATE  DELETE  EXECUTE INSERT  MAINTAIN
 REFERENCES  SELECT  TRIGGER TRUNCATEUPDATE  USAGE

And it doesn't cover MAINTAIN neither:

=# alter default privileges revoke [tab]
ALL   DELETEGRANT OPTION FOR  REFERENCES
 TRIGGER   UPDATE
CREATEEXECUTE   INSERTSELECT
 TRUNCATE  USAGE

The patch adds the completions for ALTER DEFAULT PRIVILEGES REVOKE,
but we handle such case in GRANT and REVOKE part:

(around L3958)
/*
 * With ALTER DEFAULT PRIVILEGES, restrict completion to grantable
 * privileges (can't grant roles)
 */
if (HeadMatches("ALTER", "DEFAULT", "PRIVILEGES"))
COMPLETE_WITH("SELECT", "INSERT", "UPDATE",
  "DELETE", "TRUNCATE", "REFERENCES", "TRIGGER",
  "CREATE", "EXECUTE", "USAGE", "MAINTAIN", "ALL");

Also, I think we can support WITH GRANT OPTION too. For example,

=# alter default privileges for role masahiko grant all on tables to
public [tab]

It's already supported in the GRANT statement.

>
> 4) "DATA TYPE" was missing in "ALTER TABLE table-name ALTER COLUMN
> column-name SET" like in:
> ALTER TABLE t1 ALTER COLUMN c1 SET DATA TYPE text;
>

+1. The patch looks good to me, so pushed.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-27 Thread Masahiko Sawada

On Wed, Mar 27, 2024 at 5:43 PM Masahiko Sawada  wrote:
>
> On Wed, Mar 27, 2024 at 9:25 AM John Naylor  wrote:
> >
> > On Mon, Mar 25, 2024 at 8:07 PM Masahiko Sawada  
> > wrote:
> > >
> > > On Mon, Mar 25, 2024 at 3:25 PM John Naylor  
> > > wrote:
> > > >
> > > > On Fri, Mar 22, 2024 at 12:20 PM Masahiko Sawada 
> > > >  wrote:
> >
> > > > - * remaining LP_DEAD line pointers on the page in the dead_items
> > > > - * array. These dead items include those pruned by lazy_scan_prune()
> > > > - * as well we line pointers previously marked LP_DEAD.
> > > > + * remaining LP_DEAD line pointers on the page in the dead_items.
> > > > + * These dead items include those pruned by lazy_scan_prune() as well
> > > > + * we line pointers previously marked LP_DEAD.
> > > >
> > > > Here maybe "into dead_items".
> >
> > - * remaining LP_DEAD line pointers on the page in the dead_items.
> > + * remaining LP_DEAD line pointers on the page into the dead_items.
> >
> > Let me explain. It used to be "in the dead_items array." It is not an
> > array anymore, so it was changed to "in the dead_items". dead_items is
> > a variable name, and names don't take "the". "into dead_items" seems
> > most natural to me, but there are other possible phrasings.
>
> Thanks for the explanation. I was distracted. Fixed in the latest patch.
>
> >
> > > > > > Did you try it with 1MB m_w_m?
> > > > >
> > > > > I've incorporated the above comments and test results look good to me.
> > > >
> > > > Could you be more specific about what the test was?
> > > > Does it work with 1MB m_w_m?
> > >
> > > If m_w_m is 1MB, both the initial and maximum segment sizes are 256kB.
> > >
> > > FYI other test cases I tested were:
> > >
> > > * m_w_m = 2199023254528 (maximum value)
> > > initial: 1MB
> > > max: 128GB
> > >
> > > * m_w_m = 64MB (default)
> > > initial: 1MB
> > > max: 8MB
> >
> > If the test was a vacuum, how big a table was needed to hit 128GB?
>
> I just checked how TIdStoreCreateLocal() calculated the initial and
> max segment sizes while changing m_w_m, so didn't check how big
> segments are actually allocated in the maximum value test case.
>
> >
> > > > The existing comment slipped past my radar, but max_bytes is not a
> > > > limit, it's a hint. Come to think of it, it never was a limit in the
> > > > normal sense, but in earlier patches it was the criteria for reporting
> > > > "I'm full" when asked.
> > >
> > > Updated the comment.
> >
> > + * max_bytes is not a limit; it's used to choose the memory block sizes of
> > + * a memory context for TID storage in order for the total memory 
> > consumption
> > + * not to be overshot a lot. The caller can use the max_bytes as the 
> > criteria
> > + * for reporting whether it's full or not.
> >
> > This is good information. I suggest this edit:
> >
> > "max_bytes" is not an internally-enforced limit; it is used only as a
> > hint to cap the memory block size of the memory context for TID
> > storage. This reduces space wastage due to over-allocation. If the
> > caller wants to monitor memory usage, it must compare its limit with
> > the value reported by TidStoreMemoryUsage().
> >
> > Other comments:
>
> Thanks for the suggestion!
>
> >
> > v79-0002 looks good to me.
> >
> > v79-0003:
> >
> > "With this commit, when creating a shared TidStore, a dedicated DSA
> > area is created for TID storage instead of using the provided DSA
> > area."
> >
> > This is very subtle, but "the provided..." implies there still is one.
> > -> "a provided..."
> >
> > + * Similar to TidStoreCreateLocal() but create a shared TidStore on a
> > + * DSA area. The TID storage will live in the DSA area, and a memory
> > + * context rt_context will have only meta data of the radix tree.
> >
> > -> "the memory context"
>
> Fixed in the latest patch.
>
> >
> > I think you can go ahead and commit 0002 and 0003/4.
>
> I've pushed the 0002 (dsa init and max segment size) patch, and will
> push the attached 0001 patch next.

Pushed the refactoring patch.

I've attached the rebased vacuum improvement patch for cfbot. I
mentioned in the commit message that this patch eliminates the 1GB
limitation.

I think the patch is in good shape. Do you have other comments or
suggestions, John?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


v81-0001-Use-TidStore-for-dead-tuple-TIDs-storage-during-.patch
Description: Binary data

Re: Fix parallel vacuum buffer usage reporting

2024-03-27 Thread Masahiko Sawada

Hi,

Thank you for the report.

On Fri, Feb 9, 2024 at 6:10 PM Anthonin Bonnefoy
 wrote:
>
> Hi,
>
> With a db setup with pgbench, we add an additional index:
> CREATE INDEX ON pgbench_accounts(abalance)
>
> And trigger several updates and vacuum to reach a stable amount of
> dirtied pages:
> UPDATE pgbench_accounts set abalance = abalance + 1 WHERE aid=1; CHECKPOINT;
> VACUUM (VERBOSE, INDEX_CLEANUP ON) pgbench_accounts
>
> The vacuum will report the following:
> INFO:  vacuuming "postgres.public.pgbench_accounts"
> INFO:  launched 1 parallel vacuum worker for index vacuuming (planned: 1)
> INFO:  finished vacuuming "postgres.public.pgbench_accounts": index scans: 1
> ...
> buffer usage: 122 hits, 165 misses, 4 dirtied
>
> 4 pages were reported dirtied. However, we have 5 dirtied blocks at
> the end of the vacuum when looking at pg_buffercache:
>
> SELECT c.relname, b.relfilenode
>  FROM
> pg_buffercache b LEFT JOIN pg_class c
>  ON b.relfilenode =
> pg_relation_filenode(c.oid)
>WHERE isdirty=true;
> relname| relfilenode
> ---+-
>  pg_class  |1259
>  pgbench_accounts  |   16400
>  pgbench_accounts  |   16400
>  pgbench_accounts_pkey |   16408
>  pgbench_accounts_abalance_idx |   16480
>
> The missing dirty block comes from the parallel worker vacuuming the
> abalance index. Running vacuum with parallel disabled will give the
> correct result.
>
> Vacuum uses dedicated VacuumPage{Hit,Miss,Dirty} globals to track
> buffer usage. However, those values are not collected at the end of
> parallel vacuum workers, leading to incorrect buffer count.

True. I think we should fix it also on backbranches.

>
> Those vacuum specific globals are redundant with the existing
> pgBufferUsage and only used in the verbose output. This patch removes
> them and replaces them by pgBufferUsage which is already correctly
> collected at the end of parallel workers, fixing the buffer count.

It seems to make sense to remove VacuumPageHit and friends, only on
the master branch, if we can use BufferUsage instead.

As for the proposed patch, the following part should handle the temp tables too:

appendStringInfo(, _("avg read rate: %.3f
MB/s, avg write rate: %.3f MB/s\n"),
 read_rate, write_rate);
appendStringInfo(, _("buffer usage: %lld
hits, %lld misses, %lld dirtied\n"),
-(long long)
AnalyzePageHit,
-(long long)
AnalyzePageMiss,
-(long long)
AnalyzePageDirty);
+(long long)
bufferusage.shared_blks_hit,
+(long long)
bufferusage.shared_blks_read,
+    (long long)
bufferusage.shared_blks_dirtied);
appendStringInfo(, _("system usage: %s"),
pg_rusage_show());

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

2024-03-27 Thread Masahiko Sawada

Hi,

On Thu, Jan 18, 2024 at 5:33 PM Masahiko Sawada  wrote:
>
> On Thu, Jan 18, 2024 at 4:59 PM Alexander Korotkov  
> wrote:
> >
> > On Thu, Jan 18, 2024 at 4:16 AM torikoshia  
> > wrote:
> > > On 2024-01-18 10:10, jian he wrote:
> > > > On Thu, Jan 18, 2024 at 8:57 AM Masahiko Sawada 
> > > > wrote:
> > > >> On Thu, Jan 18, 2024 at 6:38 AM Tom Lane  wrote:
> > > >> > Kyotaro-san's suggestion isn't bad, though I might shorten it to
> > > >> > error_action {error|ignore|log} (or perhaps "stop" instead of 
> > > >> > "error")?
> > > >> > You will need a separate parameter anyway to specify the destination
> > > >> > of "log", unless "none" became an illegal table name when I wasn't
> > > >> > looking.  I don't buy that one parameter that has some special values
> > > >> > while other values could be names will be a good design.  Moreover,
> > > >> > what if we want to support (say) log-to-file along with log-to-table?
> > > >> > Trying to distinguish a file name from a table name without any other
> > > >> > context seems impossible.
> > > >>
> > > >> I've been thinking we can add more values to this option to log errors
> > > >> not only to the server logs but also to the error table (not sure
> > > >> details but I imagined an error table is created for each table on
> > > >> error), without an additional option for the destination name. The
> > > >> values would be like error_action {error|ignore|save-logs|save-table}.
> > > >>
> > > >
> > > > another idea:
> > > > on_error {error|ignore|other_future_option}
> > > > if not specified then by default ERROR.
> > > > You can also specify ERROR or IGNORE for now.
> > > >
> > > > I agree, the parameter "error_action" is better than "location".
> > >
> > > I'm not sure whether error_action or on_error is better, but either way
> > > "error_action error" and "on_error error" seems a bit odd to me.
> > > I feel "stop" is better for both cases as Tom suggested.
> >
> > OK.  What about this?
> > on_error {stop|ignore|other_future_option}
> > where other_future_option might be compound like "file 'copy.log'" or
> > "table 'copy_log'".
>
> +1
>

I realized that ON_ERROR syntax synoposis in the documentation is not
correct. The option doesn't require the value to be quoted and the
value can be omitted. The attached patch fixes it.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


0001-doc-Fix-COPY-ON_ERROR-option-syntax-synopsis.patch
Description: Binary data

Re: Add new error_action COPY ON_ERROR "log"

2024-03-27 Thread Masahiko Sawada

On Thu, Mar 28, 2024 at 2:49 AM Bharath Rupireddy
 wrote:
>
> On Wed, Mar 27, 2024 at 7:42 AM Masahiko Sawada  wrote:
> >
> > I think that there are two options to handle it:
> >
> > 1. change COPY grammar to accept DEFAULT as an option value.
> > 2. change tab-completion to complement 'DEFAULT' instead of DEFAULT,
> > and update the doc too.
> >
> > As for the documentation, we can add single-quotes as follows:
> >
> >  ENCODING 'encoding_name'
> > +LOG_VERBOSITY [ 'mode' ]
> >
> > I thought the option (2) is better but there seems no precedent of
> > complementing a single-quoted string other than file names. So the
> > option (1) could be clearer.
> >
> > What do you think?
>
> There is another option to change log_verbosity to {none, verbose} or
> {none, skip_row_info} (discuseed here
> https://www.postgresql.org/message-id/Zelrqq-pkfkvsjPn%40paquier.xyz
> that we might extend this option to other use-cases in future). I tend
> to agree with you to support log_verbose to be set to default without
> quotes just to be consistent with other commands that allow that [1].
> And, thanks for quickly identifying where to change in the gram.y.
> With that change, now I have changed all the new tests added to use
> log_verbosity default without quotes.
>
> FWIW, a recent commit [2] did the same. Therefore, I don't see a
> problem supporting it that way for COPY log_verbosity.
>
> Please find the attached v13 patch with the change.

Thank you for updating the patch quickly, and sharing the reference.

I think {default, verbose} is a good start and more consistent with
other similar features. We can add other modes later.

Regarding the syntax change, since copy_generic_opt_arg rule is only
for COPY option syntax, the change doesn't affect other syntaxes. I've
confirmed the tab-completion works fine.

I'll push the patch, barring any objections.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Skip collecting decoded changes of already-aborted transactions

2024-03-27 Thread Masahiko Sawada

On Wed, Mar 27, 2024 at 8:49 PM Ajin Cherian  wrote:
>
>
>
> On Mon, Mar 18, 2024 at 7:50 PM Masahiko Sawada  wrote:
>>
>>
>> In addition to these changes, I've made some changes to the latest
>> patch. Here is the summary:
>>
>> - Use txn_flags field to record the transaction status instead of two
>> 'committed' and 'aborted' flags.
>> - Add regression tests.
>> - Update commit message.
>>
>> Regards,
>>
>
> Hi Sawada-san,
>
> Thanks for the updated patch. Some comments:

Thank you for the view comments!

>
> 1.
> + * already aborted, we discards all changes accumulated so far and ignore
> + * future changes, and return true. Otherwise return false.
>
> we discards/we discard

Will fix it.

>
> 2. In function ReorderBufferCheckTXNAbort(): I haven't tested this but I 
> wonder how prepared transactions would be considered, they are neither 
> committed, nor in progress.

The transaction that is prepared but not resolved yet is considered as
in-progress.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-27 Thread Masahiko Sawada

On Wed, Mar 27, 2024 at 9:25 AM John Naylor  wrote:
>
> On Mon, Mar 25, 2024 at 8:07 PM Masahiko Sawada  wrote:
> >
> > On Mon, Mar 25, 2024 at 3:25 PM John Naylor  wrote:
> > >
> > > On Fri, Mar 22, 2024 at 12:20 PM Masahiko Sawada  
> > > wrote:
>
> > > - * remaining LP_DEAD line pointers on the page in the dead_items
> > > - * array. These dead items include those pruned by lazy_scan_prune()
> > > - * as well we line pointers previously marked LP_DEAD.
> > > + * remaining LP_DEAD line pointers on the page in the dead_items.
> > > + * These dead items include those pruned by lazy_scan_prune() as well
> > > + * we line pointers previously marked LP_DEAD.
> > >
> > > Here maybe "into dead_items".
>
> - * remaining LP_DEAD line pointers on the page in the dead_items.
> + * remaining LP_DEAD line pointers on the page into the dead_items.
>
> Let me explain. It used to be "in the dead_items array." It is not an
> array anymore, so it was changed to "in the dead_items". dead_items is
> a variable name, and names don't take "the". "into dead_items" seems
> most natural to me, but there are other possible phrasings.

Thanks for the explanation. I was distracted. Fixed in the latest patch.

>
> > > > > Did you try it with 1MB m_w_m?
> > > >
> > > > I've incorporated the above comments and test results look good to me.
> > >
> > > Could you be more specific about what the test was?
> > > Does it work with 1MB m_w_m?
> >
> > If m_w_m is 1MB, both the initial and maximum segment sizes are 256kB.
> >
> > FYI other test cases I tested were:
> >
> > * m_w_m = 2199023254528 (maximum value)
> > initial: 1MB
> > max: 128GB
> >
> > * m_w_m = 64MB (default)
> > initial: 1MB
> > max: 8MB
>
> If the test was a vacuum, how big a table was needed to hit 128GB?

I just checked how TIdStoreCreateLocal() calculated the initial and
max segment sizes while changing m_w_m, so didn't check how big
segments are actually allocated in the maximum value test case.

>
> > > The existing comment slipped past my radar, but max_bytes is not a
> > > limit, it's a hint. Come to think of it, it never was a limit in the
> > > normal sense, but in earlier patches it was the criteria for reporting
> > > "I'm full" when asked.
> >
> > Updated the comment.
>
> + * max_bytes is not a limit; it's used to choose the memory block sizes of
> + * a memory context for TID storage in order for the total memory consumption
> + * not to be overshot a lot. The caller can use the max_bytes as the criteria
> + * for reporting whether it's full or not.
>
> This is good information. I suggest this edit:
>
> "max_bytes" is not an internally-enforced limit; it is used only as a
> hint to cap the memory block size of the memory context for TID
> storage. This reduces space wastage due to over-allocation. If the
> caller wants to monitor memory usage, it must compare its limit with
> the value reported by TidStoreMemoryUsage().
>
> Other comments:

Thanks for the suggestion!

>
> v79-0002 looks good to me.
>
> v79-0003:
>
> "With this commit, when creating a shared TidStore, a dedicated DSA
> area is created for TID storage instead of using the provided DSA
> area."
>
> This is very subtle, but "the provided..." implies there still is one.
> -> "a provided..."
>
> + * Similar to TidStoreCreateLocal() but create a shared TidStore on a
> + * DSA area. The TID storage will live in the DSA area, and a memory
> + * context rt_context will have only meta data of the radix tree.
>
> -> "the memory context"

Fixed in the latest patch.

>
> I think you can go ahead and commit 0002 and 0003/4.

I've pushed the 0002 (dsa init and max segment size) patch, and will
push the attached 0001 patch next.

>
> v79-0005:
>
> - bypass = (vacrel->lpdead_item_pages < threshold &&
> -   vacrel->lpdead_items < MAXDEADITEMS(32L * 1024L * 1024L));
> + bypass = (vacrel->lpdead_item_pages < threshold) &&
> + TidStoreMemoryUsage(vacrel->dead_items) < (32L * 1024L * 1024L);
>
> The parentheses look strange, and the first line shouldn't change
> without a good reason.

Fixed.

>
> - /* Set dead_items space */
> - dead_items = (VacDeadItems *) shm_toc_lookup(toc,
> - PARALLEL_VACUUM_KEY_DEAD_ITEMS,
> - false);
> + /* Set dead items */
> + dead_items = TidStoreAttach(shared->dead_items_dsa_handle,
> + shared->dead_items_handle);
>
> I feel ambivalent about this comment change. The original is not very
> descriptive to begin with. If we need to change at all, maybe "find
> dead_items in shared memory"?

Agreed.

>
> v79-0005: As I said earlier, Dilip Kumar reviewed an earlier version.
>
> v79-0006:
>
> vac_work_mem should also go back to being an int.

Fixed.

I've attached the latest patches.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


v80-0001-Rethink-create-and-attach-APIs-of-shared-TidStor.patch
Description: Binary data


v80-0002-Use-TidStore-for-dead-tuple-TIDs-storage-during-.patch
Description: Binary data

Re: Add new error_action COPY ON_ERROR "log"

2024-03-26 Thread Masahiko Sawada

On Tue, Mar 26, 2024 at 6:36 PM Bharath Rupireddy
 wrote:
>
> On Tue, Mar 26, 2024 at 1:46 PM Masahiko Sawada  wrote:
> >
> > Thank you for updating the patch! Looks good to me.
> >
> > Please find the attached patch. I've made some changes for the
> > documentation and the commit message. I'll push it, barring any
> > objections.
>
> Thanks. v12 patch LGTM.
>

While testing the new option, I realized that the tab-completion
complements DEFAULT value, but it doesn't work without single-quotes:

postgres(1:2179134)=# copy test from '/tmp/dump.data' with
(log_verbosity default );
ERROR:  syntax error at or near "default"
LINE 1: ...py test from '/tmp/dump.data' with (log_verbosity default );
 ^
postgres(1:2179134)=# copy test from '/tmp/dump.data' with
(log_verbosity 'default' );
COPY 0

Whereas VERBOSE works even without single-quotes:

postgres(1:2179134)=# copy test from '/tmp/dump.data' with
(log_verbosity verbose );
COPY 0

postgres(1:2179134)=# copy test from '/tmp/dump.data' with
(log_verbosity 'verbose' );
COPY 0

Which could confuse users. This is because DEFAULT is a reserved
keyword and the COPY option doesn't accept them as an option value.

I think that there are two options to handle it:

1. change COPY grammar to accept DEFAULT as an option value.
2. change tab-completion to complement 'DEFAULT' instead of DEFAULT,
and update the doc too.

As for the documentation, we can add single-quotes as follows:

 ENCODING 'encoding_name'
+LOG_VERBOSITY [ 'mode' ]

I thought the option (2) is better but there seems no precedent of
complementing a single-quoted string other than file names. So the
option (1) could be clearer.

What do you think?

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Add new error_action COPY ON_ERROR "log"

2024-03-26 Thread Masahiko Sawada

On Tue, Mar 26, 2024 at 3:04 PM Bharath Rupireddy
 wrote:
>
> On Tue, Mar 26, 2024 at 9:56 AM Masahiko Sawada  wrote:
> >
> > > > errmsg("data type incompatibility at line %llu for column %s: \"%s\"",
> >
> > > > I guess it would be better to make the log message clearer to convey
> > > > what we did for the malformed row. For example, how about something
> > > > like "skipping row due to data type incompatibility at line %llu for
> > > > column %s: \"s\""?
> > >
> > > The summary message which gets printed at the end says that "NOTICE:
> > > 6 rows were skipped due to data type incompatibility". Isn't this
> > > enough? If someone is using ON_ERROR 'ignore', it's quite natural that
> > > such rows get skipped softly and the summary message can help them,
> > > no?
> >
> > I think that in the main log message we should mention what happened
> > (or is happening) or what we did (or are doing). If the message "data
> > type incompatibility ..." was in the DETAIL message with the main
> > message saying something like "skipping row at line %llu for column
> > %s: ...", it would make sense to me. But the current message seems not
> > to be clear to me and consistent with other NOTICE messages. Also, the
> > last summary line would not be written if the user cancelled, and
> > someone other than person who used ON_ERROR 'ignore' might check the
> > server logs later.
>
> Agree. I changed the NOTICE message to what you've suggested. Thanks.
>

Thank you for updating the patch! Looks good to me.

Please find the attached patch. I've made some changes for the
documentation and the commit message. I'll push it, barring any
objections.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


v12-0001-Add-new-COPY-option-LOG_VERBOSITY.patch
Description: Binary data

Re: Improve eviction algorithm in ReorderBuffer

2024-03-25 Thread Masahiko Sawada

On Thu, Mar 14, 2024 at 12:02 PM Masahiko Sawada  wrote:
>
>
> I've attached new version patches.

Since the previous patch conflicts with the current HEAD, I've
attached the rebased patches.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


v10-0001-Make-binaryheap-enlargeable.patch
Description: Binary data


v10-0003-Improve-eviction-algorithm-in-Reorderbuffer-usin.patch
Description: Binary data


v10-0002-Add-functions-to-binaryheap-for-efficient-key-re.patch
Description: Binary data

Re: Add new error_action COPY ON_ERROR "log"

2024-03-25 Thread Masahiko Sawada

On Tue, Mar 26, 2024 at 12:23 PM Bharath Rupireddy
 wrote:
>
> On Tue, Mar 26, 2024 at 7:16 AM Masahiko Sawada  wrote:
> >
> > > Please see the attached v9 patch set.
> >
> > Thank you for updating the patch! The patch mostly looks good to me.
> > Here are some minor comments:
>
> Thanks for looking into this.
>
> > ---
> >  /* non-export function prototypes */
> > -static char *limit_printout_length(const char *str);
> > -
> > static void ClosePipeFromProgram(CopyFromState cstate);
> >
> > Now that we have only one function we should replace "prototypes" with
> > "prototype".
>
> Well no. We might add a few more (never know). A quick look around the
> GUCs under /* GUCs */ tells me that plural form there is being used
> even just one GUC is defined (xlogprefetcher.c for instance).

Understood.

>
> > ---
> > +ereport(NOTICE,
> > +
> > errmsg("data type incompatibility at line %llu for column %s: \"%s\"",
> > +
> >  (unsigned long long) cstate->cur_lineno,
> > +
> >  cstate->cur_attname,
> > +
> >  attval));
> >
> > I guess it would be better to make the log message clearer to convey
> > what we did for the malformed row. For example, how about something
> > like "skipping row due to data type incompatibility at line %llu for
> > column %s: \"s\""?
>
> The summary message which gets printed at the end says that "NOTICE:
> 6 rows were skipped due to data type incompatibility". Isn't this
> enough? If someone is using ON_ERROR 'ignore', it's quite natural that
> such rows get skipped softly and the summary message can help them,
> no?

I think that in the main log message we should mention what happened
(or is happening) or what we did (or are doing). If the message "data
type incompatibility ..." was in the DETAIL message with the main
message saying something like "skipping row at line %llu for column
%s: ...", it would make sense to me. But the current message seems not
to be clear to me and consistent with other NOTICE messages. Also, the
last summary line would not be written if the user cancelled, and
someone other than person who used ON_ERROR 'ignore' might check the
server logs later.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Add new error_action COPY ON_ERROR "log"

2024-03-25 Thread Masahiko Sawada

On Mon, Mar 25, 2024 at 8:21 PM Bharath Rupireddy
 wrote:
>
> On Mon, Mar 25, 2024 at 10:42 AM Masahiko Sawada  
> wrote:
> >
> > The current approach, eliminating the duplicated information in
> > CONTEXT, seems good to me.
>
> Thanks for looking into it.
>
> > One question about the latest (v8) patch:
> >
> > +   else
> > +   ereport(NOTICE,
> > +   errmsg("data type incompatibility at
> > line %llu for column %s: null input",
> > +  (unsigned long long) 
> > cstate->cur_lineno,
> > +  cstate->cur_attname));
> > +
> >
> > How can we reach this path? It seems we don't cover this path by the tests.
>
> Tests don't cover that part, but it can be hit with something like
> [1]. I've added a test for this.
>
> Note the use of domain to provide an indirect way of providing null
> constraint check. Otherwise, COPY FROM fails early in
> CopyFrom->ExecConstraints if the NOT NULL constraint is directly
> provided next to the column in the table [2].
>
> Please see the attached v9 patch set.
>

Thank you for updating the patch! The patch mostly looks good to me.
Here are some minor comments:

---
 /* non-export function prototypes */
-static char *limit_printout_length(const char *str);
-
static void ClosePipeFromProgram(CopyFromState cstate);

Now that we have only one function we should replace "prototypes" with
"prototype".

---
+ereport(NOTICE,
+
errmsg("data type incompatibility at line %llu for column %s: \"%s\"",
+
 (unsigned long long) cstate->cur_lineno,
+
 cstate->cur_attname,
+
 attval));

I guess it would be better to make the log message clearer to convey
what we did for the malformed row. For example, how about something
like "skipping row due to data type incompatibility at line %llu for
column %s: \"s\""?

---
 extern void CopyFromErrorCallback(void *arg);
+extern char *limit_printout_length(const char *str);

I don't disagree with exposing the limit_printout_length() function
but I think it's better to rename it for consistency with other
exposed COPY command functions. Only this function is snake-case. How
about CopyLimitPrintoutLength() or similar?

FWIW I'm going to merge two patches before the push.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-25 Thread Masahiko Sawada

On Mon, Mar 25, 2024 at 3:25 PM John Naylor  wrote:
>
> On Fri, Mar 22, 2024 at 12:20 PM Masahiko Sawada  
> wrote:
> >
> > On Thu, Mar 21, 2024 at 7:48 PM John Naylor  wrote:
>
> > > v77-0001
> > >
> > > - dead_items = (VacDeadItems *) 
> > > palloc(vac_max_items_to_alloc_size(max_items));
> > > - dead_items->max_items = max_items;
> > > - dead_items->num_items = 0;
> > > + vacrel->dead_items = TidStoreCreate(vac_work_mem, NULL, 0);
> > > +
> > > + dead_items_info = (VacDeadItemsInfo *) palloc(sizeof(VacDeadItemsInfo));
> > > + dead_items_info->max_bytes = vac_work_mem * 1024L;
> > >
> > > This is confusing enough that it looks like a bug:
> > >
> > > [inside TidStoreCreate()]
> > > /* choose the maxBlockSize to be no larger than 1/16 of max_bytes */
> > > while (16 * maxBlockSize > max_bytes * 1024L)
> > > maxBlockSize >>= 1;
> > >
> > > This was copied from CreateWorkExprContext, which operates directly on
> > > work_mem -- if the parameter is actually bytes, we can't "* 1024"
> > > here. If we're passing something measured in kilobytes, the parameter
> > > is badly named. Let's use convert once and use bytes everywhere.
> >
> > True. The attached 0001 patch fixes it.
>
> v78-0001 and 02 are fine, but for 0003 there is a consequence that I
> didn't see mentioned:

I think that the fix done in 0001 patch can be merged into 0003 patch.

>  vac_work_mem now refers to bytes, where before
> it referred to kilobytes. It seems pretty confusing to use a different
> convention from elsewhere, especially if it has the same name but
> different meaning across versions. Worse, this change is buried inside
> a moving-stuff-around diff, making it hard to see. Maybe "convert only
> once" is still possible, but I was actually thinking of
>
> + dead_items_info->max_bytes = vac_work_mem * 1024L;
> + vacrel->dead_items = TidStoreCreate(dead_items_info->max_bytes, NULL, 0);
>
> That way it's pretty obvious that it's correct. That may require a bit
> of duplication and moving around for shmem, but there is some of that
> already.

Agreed.

>
> More on 0003:
>
> - * The major space usage for vacuuming is storage for the array of dead TIDs
> + * The major space usage for vacuuming is TidStore, a storage for dead TIDs
>
> + * autovacuum_work_mem) memory space to keep track of dead TIDs.  If the
> + * TidStore is full, we must call lazy_vacuum to vacuum indexes (and to 
> vacuum
>
> I wonder if the comments here should refer to it using a more natural
> spelling, like "TID store".
>
> - * items in the dead_items array for later vacuuming, count live and
> + * items in the dead_items for later vacuuming, count live and
>
> Maybe "the dead_items area", or "the dead_items store" or "in dead_items"?
>
> - * remaining LP_DEAD line pointers on the page in the dead_items
> - * array. These dead items include those pruned by lazy_scan_prune()
> - * as well we line pointers previously marked LP_DEAD.
> + * remaining LP_DEAD line pointers on the page in the dead_items.
> + * These dead items include those pruned by lazy_scan_prune() as well
> + * we line pointers previously marked LP_DEAD.
>
> Here maybe "into dead_items".
>
> Also, "we line pointers" seems to be a pre-existing typo.
>
> - (errmsg("table \"%s\": removed %lld dead item identifiers in %u pages",
> - vacrel->relname, (long long) index, vacuumed_pages)));
> + (errmsg("table \"%s\": removed " INT64_FORMAT "dead item identifiers
> in %u pages",
> + vacrel->relname, vacrel->dead_items_info->num_items, vacuumed_pages)));
>
> This is a translated message, so let's keep the message the same.
>
> /*
>  * Allocate dead_items (either using palloc, or in dynamic shared memory).
>  * Sets dead_items in vacrel for caller.
>  *
>  * Also handles parallel initialization as part of allocating dead_items in
>  * DSM when required.
>  */
> static void
> dead_items_alloc(LVRelState *vacrel, int nworkers)
>
> This comment didn't change at all. It's not wrong, but let's consider
> updating the specifics.

Fixed above comments.

> v78-0005:
>
> "Although commit XXX
> allowed specifying the initial and maximum DSA segment sizes, callers
> still needed to clamp their own limits, which was not consistent and
> user-friendly."
>
> Perhaps s/still needed/would have needed/ ..., since we're preventing
> that necessity.
>
> > > Did you try it with 1MB m_w_m?
&

Re: Add new error_action COPY ON_ERROR "log"

2024-03-24 Thread Masahiko Sawada

On Wed, Mar 13, 2024 at 11:02 PM Bharath Rupireddy
 wrote:
>
> On Wed, Mar 13, 2024 at 11:09 AM Michael Paquier  wrote:
> >
> > Hmm.  This NOTICE is really bugging me.  It is true that the clients
> > would get more information, but the information is duplicated on the
> > server side because the error context provides the same information as
> > the NOTICE itself:
> > NOTICE:  data type incompatibility at line 1 for column "a"
> > CONTEXT:  COPY aa, line 1, column a: "a"
> > STATEMENT:  copy aa from stdin with (on_error ignore, log_verbosity 
> > verbose);
>
> Yes, if wanted, clients can also get the CONTEXT - for instance, using
> '\set SHOW_CONTEXT always' in psql.
>
> I think we can enhance the NOTICE message to include the column value
> (just like CONTEXT message is showing) and leverage relname_only to
> emit only the relation name in the CONTEXT message.
>
> /*
>  * We suppress error context information other than the relation name,
>  * if one of the operations below fails.
>  */
> Assert(!cstate->relname_only);
> cstate->relname_only = true;
>
> I'm attaching the v8 patch set implementing the above idea. With this,
> [1] is sent to the client, [2] is sent to the server log. This
> approach not only reduces the duplicate info in the NOTICE and CONTEXT
> messages, but also makes it easy for users to get all the necessary
> info in the NOTICE message without having to set extra parameters to
> get CONTEXT message.
>
> Another idea is to move even the table name to NOTICE message and hide
> the context with errhidecontext when we emit the new NOTICE messages.
>
> Thoughts?
>

The current approach, eliminating the duplicated information in
CONTEXT, seems good to me.

One question about the latest (v8) patch:

+   else
+   ereport(NOTICE,
+   errmsg("data type incompatibility at
line %llu for column %s: null input",
+  (unsigned long long) cstate->cur_lineno,
+  cstate->cur_attname));
+

How can we reach this path? It seems we don't cover this path by the tests.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-24 Thread Masahiko Sawada

On Mon, Mar 25, 2024 at 10:13 AM Tom Lane  wrote:
>
> Masahiko Sawada  writes:
> > On Mon, Mar 25, 2024 at 1:53 AM Tom Lane  wrote:
> >> I think the point here is that if you start with an arbitrary
> >> non-negative shift value, the preceding loop may in fact decrement it
> >> down to something less than zero before exiting, in which case we
> >> would indeed have trouble.  I suspect that the code is making
> >> undocumented assumptions about the possible initial values of shift.
> >> Maybe some Asserts would be good?  Also, if we're effectively assuming
> >> that shift must be exactly zero here, why not let the compiler
> >> hard-code that?
>
> > Sounds like a good solution. I've attached the patch for that.
>
> Personally I'd put the Assert immediately after the loop, because
> it's not related to the "Reserve slot for the value" comment.
> Seems reasonable otherwise.
>

Thanks. Pushed the fix after moving the Assert.


Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-24 Thread Masahiko Sawada

On Mon, Mar 25, 2024 at 1:53 AM Tom Lane  wrote:
>
> John Naylor  writes:
> > Done. I pushed this with a few last-minute cosmetic adjustments. This
> > has been a very long time coming, but we're finally in the home
> > stretch!

Thank you for the report.

>
> I'm not sure why it took a couple weeks for Coverity to notice
> ee1b30f12, but it saw it today, and it's not happy:

Hmm, I've also done Coverity Scan in development but I wasn't able to
see this one for some reason...

>
> /srv/coverity/git/pgsql-git/postgresql/src/include/lib/radixtree.h: 1621 in 
> local_ts_extend_down()
> 1615node = child;
> 1616shift -= RT_SPAN;
> 1617}
> 1618
> 1619/* Reserve slot for the value. */
> 1620n4 = (RT_NODE_4 *) node.local;
> >>> CID 1594658:  Integer handling issues  (BAD_SHIFT)
> >>> In expression "key >> shift", shifting by a negative amount has 
> >>> undefined behavior.  The shift amount, "shift", is as little as -7.
> 1621n4->chunks[0] = RT_GET_KEY_CHUNK(key, shift);
> 1622n4->base.count = 1;
> 1623
> 1624return >children[0];
> 1625 }
> 1626
>
> I think the point here is that if you start with an arbitrary
> non-negative shift value, the preceding loop may in fact decrement it
> down to something less than zero before exiting, in which case we
> would indeed have trouble.  I suspect that the code is making
> undocumented assumptions about the possible initial values of shift.
> Maybe some Asserts would be good?  Also, if we're effectively assuming
> that shift must be exactly zero here, why not let the compiler
> hard-code that?
>
> -   n4->chunks[0] = RT_GET_KEY_CHUNK(key, shift);
> +   n4->chunks[0] = RT_GET_KEY_CHUNK(key, 0);

Sounds like a good solution. I've attached the patch for that.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


fix_radixtree.patch
Description: Binary data

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-21 Thread Masahiko Sawada

On Thu, Mar 21, 2024 at 7:48 PM John Naylor  wrote:
>
> On Thu, Mar 21, 2024 at 4:03 PM Masahiko Sawada  wrote:
> >
> > I've looked into this idea further. Overall, it looks clean and I
> > don't see any problem so far in terms of integration with lazy vacuum.
> > I've attached three patches for discussion and tests.
>
> Seems okay in the big picture, it's the details we need to be careful of.
>
> v77-0001
>
> - dead_items = (VacDeadItems *) 
> palloc(vac_max_items_to_alloc_size(max_items));
> - dead_items->max_items = max_items;
> - dead_items->num_items = 0;
> + vacrel->dead_items = TidStoreCreate(vac_work_mem, NULL, 0);
> +
> + dead_items_info = (VacDeadItemsInfo *) palloc(sizeof(VacDeadItemsInfo));
> + dead_items_info->max_bytes = vac_work_mem * 1024L;
>
> This is confusing enough that it looks like a bug:
>
> [inside TidStoreCreate()]
> /* choose the maxBlockSize to be no larger than 1/16 of max_bytes */
> while (16 * maxBlockSize > max_bytes * 1024L)
> maxBlockSize >>= 1;
>
> This was copied from CreateWorkExprContext, which operates directly on
> work_mem -- if the parameter is actually bytes, we can't "* 1024"
> here. If we're passing something measured in kilobytes, the parameter
> is badly named. Let's use convert once and use bytes everywhere.

True. The attached 0001 patch fixes it.

>
> v77-0002:
>
> +#define dsa_create(tranch_id) \
> + dsa_create_ext(tranch_id, DSA_INITIAL_SEGMENT_SIZE, DSA_MAX_SEGMENT_SIZE)
>
> Since these macros are now referring to defaults, maybe their name
> should reflect that. Something like DSA_DEFAULT_INIT_SEGMENT_SIZE
> (*_MAX_*)

It makes sense to rename DSA_INITIAL_SEGMENT_SIZE , but I think that
the DSA_MAX_SEGMENT_SIZE is the theoretical maximum size, the current
name also makes sense to me.

>
> +/* The minimum size of a DSM segment. */
> +#define DSA_MIN_SEGMENT_SIZE ((size_t) 1024)
>
> That's a *lot* smaller than it is now. Maybe 256kB? We just want 1MB
> m_w_m to work correctly.

Fixed.

>
> v77-0003:
>
> +/* Public APIs to create local or shared TidStore */
> +
> +TidStore *
> +TidStoreCreateLocal(size_t max_bytes)
> +{
> + return tidstore_create_internal(max_bytes, false, 0);
> +}
> +
> +TidStore *
> +TidStoreCreateShared(size_t max_bytes, int tranche_id)
> +{
> + return tidstore_create_internal(max_bytes, true, tranche_id);
> +}
>
> I don't think these operations have enough in common to justify
> sharing even an internal implementation. Choosing aset block size is
> done for both memory types, but it's pointless to do it for shared
> memory, because the local context is then only used for small
> metadata.
>
> + /*
> + * Choose the DSA initial and max segment sizes to be no longer than
> + * 1/16 and 1/8 of max_bytes, respectively.
> + */
>
> I'm guessing the 1/8 here because the number of segments is limited? I
> know these numbers are somewhat arbitrary, but readers will wonder why
> one has 1/8 and the other has 1/16.
>
> + if (dsa_init_size < DSA_MIN_SEGMENT_SIZE)
> + dsa_init_size = DSA_MIN_SEGMENT_SIZE;
> + if (dsa_max_size < DSA_MAX_SEGMENT_SIZE)
> + dsa_max_size = DSA_MAX_SEGMENT_SIZE;
>
> The second clamp seems against the whole point of this patch -- it
> seems they should all be clamped bigger than the DSA_MIN_SEGMENT_SIZE?
> Did you try it with 1MB m_w_m?

I've incorporated the above comments and test results look good to me.

I've attached the several patches:

- 0002 is a minor fix for tidstore I found.
- 0005 changes the create APIs of tidstore.
- 0006 update the vacuum improvement patch to use the new
TidStoreCreateLocal/Shared() APIs.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


v78-0005-Rethink-create-and-attach-APIs-of-shared-TidStor.patch
Description: Binary data


v78-0004-Allow-specifying-initial-and-maximum-segment-siz.patch
Description: Binary data


v78-0003-Use-TidStore-for-dead-tuple-TIDs-storage-during-.patch
Description: Binary data


v78-0006-Adjust-the-vacuum-improvement-patch-to-new-TidSt.patch
Description: Binary data


v78-0002-Fix-an-inconsistent-function-prototype-with-the-.patch
Description: Binary data


v78-0001-Fix-a-calculation-in-TidStoreCreate.patch
Description: Binary data

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-21 Thread Masahiko Sawada

On Thu, Mar 21, 2024 at 4:35 PM John Naylor  wrote:
>
> On Thu, Mar 21, 2024 at 1:11 PM Masahiko Sawada  wrote:
>
> > Or we can have a new function for dsa.c to set the initial and max
> > segment size (or either one) to the existing DSA area so that
> > TidStoreCreate() can specify them at creation.
>
> I didn't like this very much, because it's splitting an operation
> across an API boundary. The caller already has all the information it
> needs when it creates the DSA. Straw man proposal: it could do the
> same for local memory, then they'd be more similar. But if we made
> local contexts the responsibility of the caller, that would cause
> duplication between creating and resetting.

Fair point.

>
> > In shared TidStore
> > cases, since all memory required by shared radix tree is allocated in
> > the passed-in DSA area and the memory usage is the total segment size
> > allocated in the DSA area
>
> ...plus apparently some overhead, I just found out today, but that's
> beside the point.
>
> On Thu, Mar 21, 2024 at 2:02 PM Masahiko Sawada  wrote:
> >
> > Yet another idea is that TidStore creates its own DSA area in
> > TidStoreCreate(). That is, In TidStoreCreate() we create a DSA area
> > (using dsa_create()) and pass it to RT_CREATE(). Also, we need a new
> > API to get the DSA area. The caller (e.g. parallel vacuum) gets the
> > dsa_handle of the DSA and stores it in the shared memory (e.g. in
> > PVShared). TidStoreAttach() will take two arguments: dsa_handle for
> > the DSA area and dsa_pointer for the shared radix tree. This idea
> > still requires controlling min/max segment sizes since dsa_create()
> > uses the 1MB as the initial segment size. But the TidStoreCreate()
> > would be more user friendly.
>
> This seems like an overall simplification, aside from future size
> configuration, so +1 to continue looking into this. If we go this
> route, I'd like to avoid a boolean parameter and cleanly separate
> TidStoreCreateLocal() and TidStoreCreateShared(). Every operation
> after that can introspect, but it's a bit awkward to force these cases
> into the same function. It always was a little bit, but this change
> makes it more so.

I've looked into this idea further. Overall, it looks clean and I
don't see any problem so far in terms of integration with lazy vacuum.
I've attached three patches for discussion and tests.

- 0001 patch makes lazy vacuum use of tidstore.
- 0002 patch makes DSA init/max segment size configurable (borrowed
from another thread).
- 0003 patch makes TidStore create its own DSA area with init/max DSA
segment adjustment (PoC patch).

One thing unclear to me is that this idea will be usable even when we
want to use the tidstore for parallel bitmap scan. Currently, we
create a shared tidbitmap on a DSA area in ParallelExecutorInfo. This
DSA area is used not only for tidbitmap but also for parallel hash
etc. If the tidstore created its own DSA area, parallel bitmap scan
would have to use the tidstore's DSA in addition to the DSA area in
ParallelExecutorInfo. I'm not sure if there are some differences
between these usages in terms of resource manager etc. It seems no
problem but I might be missing something.

Regards,
-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


v77-0003-PoC-Make-shared-TidStore-create-its-own-DSA-area.patch
Description: Binary data


v77-0002-Make-DSA-initial-and-maximum-segment-size-config.patch
Description: Binary data


v77-0001-Use-TidStore-for-dead-tuple-TIDs-storage-during-.patch
Description: Binary data

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-21 Thread Masahiko Sawada

On Thu, Mar 21, 2024 at 3:10 PM Masahiko Sawada  wrote:
>
> On Thu, Mar 21, 2024 at 12:40 PM John Naylor  wrote:
> >
> > On Thu, Mar 21, 2024 at 9:37 AM Masahiko Sawada  
> > wrote:
> > >
> > > On Wed, Mar 20, 2024 at 11:19 PM John Naylor  
> > > wrote:
> > > > Are they (the blocks to be precise) really out of order? The VALUES
> > > > statement is ordered, but after inserting it does not output that way.
> > > > I wondered if this is platform independent, but CI and our dev
> > > > machines haven't failed this test, and I haven't looked into what
> > > > determines the order. It's easy enough to hide the blocks if we ever
> > > > need to, as we do elsewhere...
> > >
> > > It seems not necessary as such a test is already covered by
> > > test_radixtree. I've changed the query to hide the output blocks.
> >
> > Okay.
> >
> > > The buildfarm has been all-green so far.
> >
> > Great!
> >
> > > I've attached the latest vacuum improvement patch.
> > >
> > > I just remembered that the tidstore cannot still be used for parallel
> > > vacuum with minimum maintenance_work_mem. Even when the shared
> > > tidstore is empty, its memory usage reports 1056768 bytes, a bit above
> > > 1MB (1048576 bytes). We need something discussed on another thread[1]
> > > in order to make it work.
> >
> > For exactly this reason, we used to have a clamp on max_bytes when it
> > was internal to tidstore, so that it never reported full when first
> > created, so I guess that got thrown away when we got rid of the
> > control object in shared memory. Forcing callers to clamp their own
> > limits seems pretty unfriendly, though.
>
> Or we can have a new function for dsa.c to set the initial and max
> segment size (or either one) to the existing DSA area so that
> TidStoreCreate() can specify them at creation. In shared TidStore
> cases, since all memory required by shared radix tree is allocated in
> the passed-in DSA area and the memory usage is the total segment size
> allocated in the DSA area, the user will have to prepare a DSA area
> only for the shared tidstore. So we might be able to expect that the
> DSA passed-in to TidStoreCreate() is empty and its segment sizes can
> be adjustable.

Yet another idea is that TidStore creates its own DSA area in
TidStoreCreate(). That is, In TidStoreCreate() we create a DSA area
(using dsa_create()) and pass it to RT_CREATE(). Also, we need a new
API to get the DSA area. The caller (e.g. parallel vacuum) gets the
dsa_handle of the DSA and stores it in the shared memory (e.g. in
PVShared). TidStoreAttach() will take two arguments: dsa_handle for
the DSA area and dsa_pointer for the shared radix tree. This idea
still requires controlling min/max segment sizes since dsa_create()
uses the 1MB as the initial segment size. But the TidStoreCreate()
would be more user friendly.

I've attached a PoC patch for discussion.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


tidstore_creates_dsa.patch.nocfbot
Description: Binary data

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-21 Thread Masahiko Sawada

On Thu, Mar 21, 2024 at 12:40 PM John Naylor  wrote:
>
> On Thu, Mar 21, 2024 at 9:37 AM Masahiko Sawada  wrote:
> >
> > On Wed, Mar 20, 2024 at 11:19 PM John Naylor  
> > wrote:
> > > Are they (the blocks to be precise) really out of order? The VALUES
> > > statement is ordered, but after inserting it does not output that way.
> > > I wondered if this is platform independent, but CI and our dev
> > > machines haven't failed this test, and I haven't looked into what
> > > determines the order. It's easy enough to hide the blocks if we ever
> > > need to, as we do elsewhere...
> >
> > It seems not necessary as such a test is already covered by
> > test_radixtree. I've changed the query to hide the output blocks.
>
> Okay.
>
> > The buildfarm has been all-green so far.
>
> Great!
>
> > I've attached the latest vacuum improvement patch.
> >
> > I just remembered that the tidstore cannot still be used for parallel
> > vacuum with minimum maintenance_work_mem. Even when the shared
> > tidstore is empty, its memory usage reports 1056768 bytes, a bit above
> > 1MB (1048576 bytes). We need something discussed on another thread[1]
> > in order to make it work.
>
> For exactly this reason, we used to have a clamp on max_bytes when it
> was internal to tidstore, so that it never reported full when first
> created, so I guess that got thrown away when we got rid of the
> control object in shared memory. Forcing callers to clamp their own
> limits seems pretty unfriendly, though.

Or we can have a new function for dsa.c to set the initial and max
segment size (or either one) to the existing DSA area so that
TidStoreCreate() can specify them at creation. In shared TidStore
cases, since all memory required by shared radix tree is allocated in
the passed-in DSA area and the memory usage is the total segment size
allocated in the DSA area, the user will have to prepare a DSA area
only for the shared tidstore. So we might be able to expect that the
DSA passed-in to TidStoreCreate() is empty and its segment sizes can
be adjustable.

>
> The proposals in that thread are pretty simple. If those don't move
> forward soon, a hackish workaround would be to round down the number
> we get from dsa_get_total_size to the nearest megabyte. Then
> controlling min/max segment size would be a nice-to-have for PG17, not
> a prerequisite.

Interesting idea.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-20 Thread Masahiko Sawada

On Wed, Mar 20, 2024 at 11:19 PM John Naylor  wrote:
>
> On Wed, Mar 20, 2024 at 8:30 PM Masahiko Sawada  wrote:
> > I forgot to report the results. Yes, I did some tests where I inserted
> > many TIDs to make the tidstore use several GB memory. I did two cases:
> >
> > 1. insert 100M blocks of TIDs with an offset of 100.
> > 2. insert 10M blocks of TIDs with an offset of 2048.
> >
> > The tidstore used about 4.8GB and 5.2GB, respectively, and all lookup
> > and iteration results were expected.
>
> Thanks for confirming!
>
> > While reviewing the codes again, the following two things caught my eyes:
> >
> > in check_set_block_offset() function, we don't take a lock on the
> > tidstore while checking all possible TIDs. I'll add
> > TidStoreLockShare() and TidStoreUnlock() as follows:
> >
> > +   TidStoreLockShare(tidstore);
> > if (TidStoreIsMember(tidstore, ))
> > ItemPointerSet(_tids[num_lookup_tids++],
> > blkno, offset);
> > +   TidStoreUnlock(tidstore);
>
> In one sense, all locking in the test module is useless since there is
> only a single process. On the other hand, it seems good to at least
> run what we have written to run it trivially, and serve as an example
> of usage. We should probably be consistent, and document at the top
> that the locks are pro-forma only.

Agreed.

>
> > Regarding TidStoreMemoryUsage(), IIUC the caller doesn't need to take
> > a lock on the shared tidstore since dsa_get_total_size() (called by
> > RT_MEMORY_USAGE()) does appropriate locking. I think we can mention it
> > in the comment as follows:
> >
> > -/* Return the memory usage of TidStore */
> > +/*
> > + * Return the memory usage of TidStore.
> > + *
> > + * In shared TidStore cases, since shared_ts_memory_usage() does 
> > appropriate
> > + * locking, the caller doesn't need to take a lock.
> > + */
> >
> > What do you think?
>
> That duplicates the underlying comment on the radix tree function that
> this calls, so I'm inclined to leave it out. At this level it's
> probably best to document when a caller _does_ need to take an action.

Okay, I didn't change it.

>
> One thing I forgot to ask about earlier:
>
> +-- Add tids in out of order.
>
> Are they (the blocks to be precise) really out of order? The VALUES
> statement is ordered, but after inserting it does not output that way.
> I wondered if this is platform independent, but CI and our dev
> machines haven't failed this test, and I haven't looked into what
> determines the order. It's easy enough to hide the blocks if we ever
> need to, as we do elsewhere...

It seems not necessary as such a test is already covered by
test_radixtree. I've changed the query to hide the output blocks.

I've pushed the tidstore patch after incorporating the above changes.
In addition to that, I've added the following changes before the push:

- Added src/test/modules/test_tidstore/.gitignore file.
- Removed unnecessary #include from tidstore.c.

The buildfarm has been all-green so far.

I've attached the latest vacuum improvement patch.

I just remembered that the tidstore cannot still be used for parallel
vacuum with minimum maintenance_work_mem. Even when the shared
tidstore is empty, its memory usage reports 1056768 bytes, a bit above
1MB (1048576 bytes). We need something discussed on another thread[1]
in order to make it work.

Regards,

[1] 
https://www.postgresql.org/message-id/CAD21AoCVMw6DSmgZY9h%2BxfzKtzJeqWiwxaUD2T-FztVcV-XibQ%40mail.gmail.com

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


v76-0001-Use-TidStore-for-dead-tuple-TIDs-storage-during-.patch
Description: Binary data

Re: Have pg_basebackup write "dbname" in "primary_conninfo"?

2024-03-20 Thread Masahiko Sawada

 URIs
> case 11:
> ./pg_basebackup -D test10  -X s -P  -R -d "postgresql://localhost:5431"
> primary_conninfo will not have dbname
>
> case 12:
> ./pg_basebackup -D test10  -p 5431 -X s -P  -R -d
> "postgresql://localhost/db3:5431"
> primary_conninfo will have dbname=''db3:5431'' like below:
> primary_conninfo = 'user=vignesh passfile=''/home/vignesh/.pgpass''
> channel_binding=prefer host=localhost port=5431 sslmode=prefer
> sslcompression=0 sslcertmode=allow sslsni=1
> ssl_min_protocol_version=TLSv1.2 gssencmode=disable
> krbsrvname=postgres gssdelegation=0 target_session_attrs=any
> load_balance_hosts=disable dbname=''db3:5431'''
>
> case 13:
> ./pg_basebackup -D test10  -p 5431 -X s -P  -R -d "postgresql://localhost/db3"
> primary_conninfo will have dbname=db3
>
> case 14:
> ./pg_basebackup -D test10   -X s -P  -R -d "postgresql://localhost:5431/db3"
> primary_conninfo will have dbname=db3
>
> case 15:
> ./pg_basebackup -D test10   -X s -P  -R -d
> "postgresql://localhost:5431/db4,127.0.0.1:5431/db5"
> primary_conninfo will have dbname=''db4,127.0.0.1:5431/db5'' like below:
> primary_conninfo = 'user=vignesh passfile=''/home/vignesh/.pgpass''
> channel_binding=prefer host=localhost port=5431 sslmode=prefer
> sslcompression=0 sslcertmode=allow sslsni=1
> ssl_min_protocol_version=TLSv1.2 gssencmode=disable
> krbsrvname=postgres gssdelegation=0 target_session_attrs=any
> load_balance_hosts=disable dbname=''db4,127.0.0.1:5431/db5'''
>
> case 16:
> ./pg_basebackup -D test10   -X s -P  -R -d
> "postgresql://localhost:5431,127.0.0.1:5431/db5"
> primary_conninfo will have dbname=db5
>
> case 17:
> ./pg_basebackup -D test10   -X s -P  -R -d
> "postgresql:///db6?host=localhost=5431"
> primary_conninfo will have dbname=db6
>
> case 18:
>  ./pg_basebackup -D test10 -p 5431  -X s -P  -R -d
> "postgresql:///db7?host=/home/vignesh/postgres/inst/bin"
>  primary_conninfo will have dbname=db7
>
> case 19:
> ./pg_basebackup -D test10 -p 5431  -X s -P  -R -d
> "postgresql:///db8?host=%2Fhome%2Fvignesh%2Fpostgres%2Finst%2Fbin"
>  primary_conninfo will have dbname=db8
>
> In these cases, the database name specified will be written to the
> conf file. The test results look good to me.

Thank you for the tests! These results look good to me too.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Have pg_basebackup write "dbname" in "primary_conninfo"?

2024-03-20 Thread Masahiko Sawada

On Tue, Mar 19, 2024 at 8:48 PM Hayato Kuroda (Fujitsu)
 wrote:
>
> Dear Sawada-san,
>
> Thanks for giving comments!
>
> > This behavior makes sense to me. But do we want to handle the case of
> > using environment variables too?
>
> Yeah, v5 does not consider which libpq parameters are specified by environment
> variables. Such a variable should be used when the dbname is not expressly 
> written
> in the connection string.
> Such a path was added in the v6 patch. If the dbname is not determined after
> parsing the connection string, we call PQconndefaults() to get settings from
> environment variables and service files [1], then start to search dbname 
> again.
> Below shows an example.

Thank you for updating the patch!

>
> ```
> PGPORT=5431 PGUSER=kuroda PGDATABASE=postgres pg_basebackup -D data_N2 -R -v
> ->
> primary_conninfo = 'user=kuroda ... port=5431 ... dbname=postgres ... '
> ```
>
> > IIUC,
> >
> > pg_basebackup -D tmp -d "user=masahiko dbname=test_db"
> >
> > is equivalent to:
> >
> > PGDATABASE="user=masahiko dbname=test_db" pg_basebackup -D tmp
>
> The case won't work. I think You assumed that expanded_dbname like
> PQconnectdbParams() [2] can be used for enviroment variables, but it is not 
> correct
> - it won't parse as connection string again.
>
> In the libpq layer, connection parameters are parsed in 
> PQconnectStartParams()->conninfo_array_parse().
> When expand_dbname is specified, the entry "dbname" is firstly checked and
> parsed its value. They are done at fe-connect.c:5846.
>
> The environment variables are checked and parsed in conninfo_add_defaults(), 
> which
> is called from conninfo_array_parse(). However, it is done at 
> fe-connect.c:5956 - the
> expand_dbname has already been done at that time. This means there is no 
> chance
> that PGDATABASE is parsed as an expanded style.
>

Thank you for pointing it out. I tested the use of PGDATABASE with
pg_basebackup and somehow missed the fact you explained.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-20 Thread Masahiko Sawada

On Wed, Mar 20, 2024 at 3:48 PM John Naylor  wrote:
>
> On Thu, Mar 14, 2024 at 12:06 PM Masahiko Sawada  
> wrote:
> >
> > On Thu, Mar 14, 2024 at 1:29 PM John Naylor  wrote:
> > > Locally (not CI), we should try big inputs to make sure we can
> > > actually go up to many GB -- it's easier and faster this way than
> > > having vacuum give us a large data set.
> >
> > I'll do these tests.
>
> I just remembered this -- did any of this kind of testing happen? I
> can do it as well.

I forgot to report the results. Yes, I did some tests where I inserted
many TIDs to make the tidstore use several GB memory. I did two cases:

1. insert 100M blocks of TIDs with an offset of 100.
2. insert 10M blocks of TIDs with an offset of 2048.

The tidstore used about 4.8GB and 5.2GB, respectively, and all lookup
and iteration results were expected.

>
> > Thank you. I've incorporated all the comments above. I've attached the
> > latest patches, and am going to push them (one by one) after
> > self-review again.
>
> One more cosmetic thing in 0001 that caught my eye:
>
> diff --git a/src/backend/access/common/Makefile
> b/src/backend/access/common/Makefile
> index b9aff0ccfd..67b8cc6108 100644
> --- a/src/backend/access/common/Makefile
> +++ b/src/backend/access/common/Makefile
> @@ -27,6 +27,7 @@ OBJS = \
>   syncscan.o \
>   toast_compression.o \
>   toast_internals.o \
> + tidstore.o \
>   tupconvert.o \
>   tupdesc.o
>
> diff --git a/src/backend/access/common/meson.build
> b/src/backend/access/common/meson.build
> index 725041a4ce..a02397855e 100644
> --- a/src/backend/access/common/meson.build
> +++ b/src/backend/access/common/meson.build
> @@ -15,6 +15,7 @@ backend_sources += files(
>'syncscan.c',
>'toast_compression.c',
>'toast_internals.c',
> +  'tidstore.c',
>'tupconvert.c',
>'tupdesc.c',
>  )
>
> These aren't in alphabetical order.

Good catch. I'll fix them before the push.

While reviewing the codes again, the following two things caught my eyes:

in check_set_block_offset() function, we don't take a lock on the
tidstore while checking all possible TIDs. I'll add
TidStoreLockShare() and TidStoreUnlock() as follows:

+   TidStoreLockShare(tidstore);
if (TidStoreIsMember(tidstore, ))
ItemPointerSet(_tids[num_lookup_tids++],
blkno, offset);
+   TidStoreUnlock(tidstore);

---
Regarding TidStoreMemoryUsage(), IIUC the caller doesn't need to take
a lock on the shared tidstore since dsa_get_total_size() (called by
RT_MEMORY_USAGE()) does appropriate locking. I think we can mention it
in the comment as follows:

-/* Return the memory usage of TidStore */
+/*
+ * Return the memory usage of TidStore.
+ *
+ * In shared TidStore cases, since shared_ts_memory_usage() does appropriate
+ * locking, the caller doesn't need to take a lock.
+ */

What do you think?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-19 Thread Masahiko Sawada

On Tue, Mar 19, 2024 at 6:40 PM John Naylor  wrote:
>
> On Tue, Mar 19, 2024 at 10:24 AM Masahiko Sawada  
> wrote:
> >
> > On Tue, Mar 19, 2024 at 8:35 AM John Naylor  wrote:
> > >
> > > On Mon, Mar 18, 2024 at 11:12 AM Masahiko Sawada  
> > > wrote:
> > > >
> > > > On Sun, Mar 17, 2024 at 11:46 AM John Naylor  
> > > > wrote:
>
> > > It might also be worth reducing the number of blocks in the random
> > > test -- multiple runs will have different offsets anyway.
> >
> > Yes. If we reduce the number of blocks from 1000 to 100, the
> > regression test took on my environment:
> >
> > 1000 blocks : 516 ms
> > 100 blocks   : 228 ms
>
> Sounds good.
>
> > Removed some unnecessary variables in 0002 patch.
>
> Looks good.
>
> > So the MaxBlocktableEntrySize calculation would be as follows?
> >
> > #define MaxBlocktableEntrySize \
> > offsetof(BlocktableEntry, words) + \
> > (sizeof(bitmapword) * \
> > WORDS_PER_PAGE(Min(MaxOffsetNumber, \
> >BITS_PER_BITMAPWORD * PG_INT8_MAX - 1
> >
> > I've made this change in the 0003 patch.
>
> This is okay, but one side effect is that we have both an assert and
> an elog, for different limits. I think we'll need a separate #define
> to help. But for now, I don't want to hold up tidstore further with
> this because I believe almost everything else in v74 is in pretty good
> shape. I'll save this for later as a part of the optimization I
> proposed.
>
> Remaining things I noticed:
>
> +#define RT_PREFIX local_rt
> +#define RT_PREFIX shared_rt
>
> Prefixes for simplehash, for example, don't have "sh" -- maybe 
> "local/shared_ts"
>
> + /* MemoryContext where the radix tree uses */
>
> s/where/that/
>
> +/*
> + * Lock support functions.
> + *
> + * We can use the radix tree's lock for shared TidStore as the data we
> + * need to protect is only the shared radix tree.
> + */
> +void
> +TidStoreLockExclusive(TidStore *ts)
>
> Talking about multiple things, so maybe a blank line after the comment.
>
> With those, I think you can go ahead and squash all the tidstore
> patches except for 0003 and commit it.
>
> > While reviewing the vacuum patch, I realized that we always pass
> > LWTRANCHE_SHARED_TIDSTORE to RT_CREATE(), and the wait event related
> > to the tidstore is therefore always the same. I think it would be
> > better to make the caller of TidStoreCreate() specify the tranch_id
> > and pass it to RT_CREATE(). That way, the caller can specify their own
> > wait event for tidstore. The 0008 patch tried this idea. dshash.c does
> > the same idea.
>
> Sounds reasonable. I'll just note that src/include/storage/lwlock.h
> still has an entry for LWTRANCHE_SHARED_TIDSTORE.

Thank you. I've incorporated all the comments above. I've attached the
latest patches, and am going to push them (one by one) after
self-review again.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


v75-0001-Add-TIDStore-to-store-sets-of-TIDs-ItemPointerDa.patch
Description: Binary data


v75-0002-Use-TidStore-for-dead-tuple-TIDs-storage-during-.patch
Description: Binary data

Re: New Table Access Methods for Multi and Single Inserts

2024-03-18 Thread Masahiko Sawada

Hi,

On Fri, Mar 8, 2024 at 7:37 PM Bharath Rupireddy
 wrote:
>
> On Sat, Mar 2, 2024 at 12:02 PM Bharath Rupireddy
>  wrote:
> >
> > On Mon, Jan 29, 2024 at 5:16 PM Bharath Rupireddy
> >  wrote:
> > >
> > > > Please find the attached v9 patch set.
> >
> > I've had to rebase the patches due to commit 874d817, please find the
> > attached v11 patch set.
>
> Rebase needed. Please see the v12 patch set.
>

I've not reviewed the patches in depth yet, but run performance tests
for CREATE MATERIALIZED VIEW. The test scenarios is:

-- setup
create unlogged table test (c int);
insert into test select generate_series(1, 1000);

-- run
create materialized view test_mv as select * from test;

Here are the results:

* HEAD
3775.221 ms
3744.039 ms
3723.228 ms

* v12 patch
6289.972 ms
5880.674 ms
7663.509 ms

I can see performance regressions and the perf report says that CPU
spent most time on extending the ResourceOwner's array while copying
the buffer-heap tuple:

- 52.26% 0.18% postgres postgres [.] intorel_receive
52.08% intorel_receive
table_multi_insert_v2 (inlined)
- heap_multi_insert_v2
- 51.53% ExecCopySlot (inlined)
tts_buffer_heap_copyslot
tts_buffer_heap_store_tuple (inlined)
 - IncrBufferRefCount
 - ResourceOwnerEnlarge
 ResourceOwnerAddToHash (inlined)

Is there any reason why we copy a buffer-heap tuple to another
buffer-heap tuple? Which results in that we increments the buffer
refcount and register it to ResourceOwner for every tuples. I guess
that the destination tuple slot is not necessarily a buffer-heap, and
we could use VirtualTupleTableSlot instead. It would in turn require
copying a heap tuple. I might be missing something but it improved the
performance at least in my env. The change I made was:

-   dstslot = table_slot_create(state->rel, NULL);
+   //dstslot = table_slot_create(state->rel, NULL);
+   dstslot = MakeTupleTableSlot(RelationGetDescr(state->rel),
+);
+

And the execution times are:
1588.984 ms
1591.618 ms
1582.519 ms

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-18 Thread Masahiko Sawada

On Tue, Mar 19, 2024 at 8:35 AM John Naylor  wrote:
>
> On Mon, Mar 18, 2024 at 11:12 AM Masahiko Sawada  
> wrote:
> >
> > On Sun, Mar 17, 2024 at 11:46 AM John Naylor  
> > wrote:
>
> > > Random offsets is what I was thinking of (if made distinct and
> > > ordered), but even there the code is fairy trivial, so I don't have a
> > > strong feeling about it.
> >
> > Agreed.
>
> Looks good.
>
> A related thing I should mention is that the tests which look up all
> possible offsets are really expensive with the number of blocks we're
> using now (assert build):
>
> v70 0.33s
> v72 1.15s
> v73 1.32
>
> To trim that back, I think we should give up on using shared memory
> for the is-full test: We can cause aset to malloc a new block with a
> lot fewer entries. In the attached, this brings it back down to 0.43s.

Looks good. Agreed with this change.

> It might also be worth reducing the number of blocks in the random
> test -- multiple runs will have different offsets anyway.

Yes. If we reduce the number of blocks from 1000 to 100, the
regression test took on my environment:

1000 blocks : 516 ms
100 blocks   : 228 ms

>
> > > I think we can stop including the debug-tid-store patch for CI now.
> > > That would allow getting rid of some unnecessary variables.
> >
> > Agreed.
>
> Okay, all that remains here is to get rid of those variables (might be
> just one).

Removed some unnecessary variables in 0002 patch.

>
> > > + * Scan the TidStore and return a pointer to TidStoreIterResult that has 
> > > TIDs
> > > + * in one block. We return the block numbers in ascending order and the 
> > > offset
> > > + * numbers in each result is also sorted in ascending order.
> > > + */
> > > +TidStoreIterResult *
> > > +TidStoreIterateNext(TidStoreIter *iter)
> > >
> > > The wording is a bit awkward.
> >
> > Fixed.
>
> - * Scan the TidStore and return a pointer to TidStoreIterResult that has TIDs
> - * in one block. We return the block numbers in ascending order and the 
> offset
> - * numbers in each result is also sorted in ascending order.
> + * Scan the TidStore and return the TIDs of the next block. The returned 
> block
> + * numbers is sorted in ascending order, and the offset numbers in each 
> result
> + * is also sorted in ascending order.
>
> Better, but it's still not very clear. Maybe "The offsets in each
> iteration result are ordered, as are the block numbers over all
> iterations."

Thanks, fixed.

>
> > > +/* Extract TIDs from the given key-value pair */
> > > +static void
> > > +tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key,
> > > BlocktableEntry *page)
> > >
> > > This is a leftover from the old encoding scheme. This should really
> > > take a "BlockNumber blockno" not a "key", and the only call site
> > > should probably cast the uint64 to BlockNumber.
> >
> > Fixed.
>
> This part looks good. I didn't notice earlier, but this comment has a
> similar issue
>
> @@ -384,14 +391,15 @@ TidStoreIterateNext(TidStoreIter *iter)
>   return NULL;
>
>   /* Collect TIDs extracted from the key-value pair */
> - tidstore_iter_extract_tids(iter, key, page);
> + tidstore_iter_extract_tids(iter, (BlockNumber) key, page);
>
> ..."extracted" was once a separate operation. I think just removing
> that one word is enough to update it.

Fixed.

>
> Some other review on code comments:
>
> v73-0001:
>
> + /* Enlarge the TID array if necessary */
>
> It's "arrays" now.
>
> v73-0005:
>
> +-- Random TIDs test. We insert TIDs for 1000 blocks. Each block has
> +-- different randon 100 offset numbers each other.
>
> The numbers are obvious from the query. Maybe just mention that the
> offsets are randomized and must be unique and ordered.
>
> + * The caller is responsible for release any locks.
>
> "releasing"

Fixed.

>
> > > +typedef struct BlocktableEntry
> > > +{
> > > + uint16 nwords;
> > > + bitmapword words[FLEXIBLE_ARRAY_MEMBER];
> > > +} BlocktableEntry;
> > >
> > > In my WIP for runtime-embeddable offsets, nwords needs to be one byte.
>
> I should be more clear here: nwords fitting into one byte allows 3
> embedded offsets (1 on 32-bit platforms, which is good for testing at
> least). With uint16 nwords that reduces to 2 (none on 32-bit
> platforms). Further, after the current patch series is fully
> committed, I plan to split the embedded-offset patch into two parts:
> The first w

Re: Have pg_basebackup write "dbname" in "primary_conninfo"?

2024-03-18 Thread Masahiko Sawada

On Thu, Mar 14, 2024 at 11:46 PM vignesh C  wrote:
>
> On Thu, 14 Mar 2024 at 15:49, Amit Kapila  wrote:
> >
> > On Thu, Mar 14, 2024 at 1:45 PM Masahiko Sawada  
> > wrote:
> > >
> > > On Thu, Mar 14, 2024 at 2:27 PM Amit Kapila  
> > > wrote:
> > > >
> > > > On Thu, Mar 14, 2024 at 5:57 AM Masahiko Sawada  
> > > > wrote:
> > > > >
> > > > > This fact makes me think that the slotsync worker might be able to
> > > > > accept the primary_conninfo value even if there is no dbname in the
> > > > > value. That is, if there is no dbname in the primary_conninfo, it uses
> > > > > the username in accordance with the specs of the connection string.
> > > > > Currently, the slotsync worker connects to the local database first
> > > > > and then establishes the connection to the primary server. But if we
> > > > > can reverse the two steps, it can get the dbname that has actually
> > > > > been used to establish the remote connection and use it for the local
> > > > > connection too. That way, the primary_conninfo generated by
> > > > > pg_basebackup could work even without the patch. For example, if the
> > > > > OS user executing pg_basebackup is 'postgres', the slotsync worker
> > > > > would connect to the postgres database. Given the 'postgres' database
> > > > > is created by default and 'postgres' OS user is used in common, I
> > > > > guess it could cover many cases in practice actually.
> > > > >
> > > >
> > > > I think this is worth investigating but I suspect that in most cases
> > > > users will end up using a replication connection without specifying
> > > > the user name and we may not be able to give a meaningful error
> > > > message when slotsync worker won't be able to connect. The same will
> > > > be true even when the dbname same as the username would be used.
> > >
> > > What do you mean by not being able to give a meaningful error message?
> > >
> > > If the slotsync worker uses the user name as the dbname, and such a
> > > database doesn't exist, the error message the user will get is
> > > "database "test_user" does not exist". ISTM the same is true when the
> > > user specifies the wrong database in the primary_conninfo.
> > >
> >
> > Right, the exact error message as mentioned by Shveta will be:
> > ERROR:  could not connect to the primary server: connection to server
> > at "127.0.0.1", port 5433 failed: FATAL: database "bckp_user" does not
> > exist
> >
> > Now, without this idea, the ERROR message will be:
> >  ERROR:  slot synchronization requires dbname to be specified in
> > primary_conninfo
> >
> > I am not sure how much this matters but the second message sounds more 
> > useful.
> >
> > > >
> > > > > Having said that, even with (or without) the above change, we might
> > > > > want to change the pg_basebackup so that it writes the dbname to the
> > > > > primary_conninfo if -R option is specified. Since the database where
> > > > > the slotsync worker connects cannot be dropped while the slotsync
> > > > > worker is running, the user might want to change the database to
> > > > > connect, and it would be useful if they can do that using
> > > > > pg_basebackup instead of modifying the configuration file manually.
> > > > >
> > > > > While the current approach makes sense to me, I'm a bit concerned that
> > > > > we might end up having the pg_basebackup search the actual database
> > > > > name (e.g. 'dbname=template1') from the .pgpass file instead of
> > > > > 'dbname=replication'. As far as I tested on my environment, suppose
> > > > > that I execute:
> > > > >
> > > > > pg_basebackup -D tmp -d "dbname=testdb" -R
> > > > >
> > > > > The pg_basebackup established a replication connection but looked for
> > > > > the password of the 'testdb' database. This could be another
> > > > > inconvenience for the existing users who want to use the slot
> > > > > synchronization.
> > > > >
> > > >
> > > > This is true because it is internally using logical replication
> > > > connection (as we will set set replication=database).
> > >

Re: Skip collecting decoded changes of already-aborted transactions

2024-03-18 Thread Masahiko Sawada

On Fri, Mar 15, 2024 at 1:21 PM Ajin Cherian  wrote:
>
>
>
> On Fri, Mar 15, 2024 at 3:17 PM Masahiko Sawada  wrote:
>>
>>
>> I resumed working on this item. I've attached the new version patch.
>>
>> I rebased the patch to the current HEAD and updated comments and
>> commit messages. The patch is straightforward and I'm somewhat
>> satisfied with it, but I'm thinking of adding some tests for it.
>>
>> Regards,
>>
>> --
>> Masahiko Sawada
>> Amazon Web Services: https://aws.amazon.com
>
>
> I just had a look at the patch, the patch no longer applies because of a 
> removal of a header in a recent commit. Overall the patch looks fine, and I 
> didn't find any issues. Some cosmetic comments:

Thank you for your review comments.

> in ReorderBufferCheckTXNAbort()
> + /* Quick return if we've already knew the transaction status */
> + if (txn->aborted)
> + return true;
>
> knew/know

Maybe it should be "known"?

>
> /*
> + * If logical_replication_mode is "immediate", we don't check the
> + * transaction status so the caller always process this transaction.
> + */
> + if (debug_logical_replication_streaming == 
> DEBUG_LOGICAL_REP_STREAMING_IMMEDIATE)
> + return false;
>
> /process/processes
>

Fixed.

In addition to these changes, I've made some changes to the latest
patch. Here is the summary:

- Use txn_flags field to record the transaction status instead of two
'committed' and 'aborted' flags.
- Add regression tests.
- Update commit message.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


v4-0001-Skip-logical-decoding-of-already-aborted-transact.patch
Description: Binary data

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-17 Thread Masahiko Sawada

On Sun, Mar 17, 2024 at 11:46 AM John Naylor  wrote:
>
> On Fri, Mar 15, 2024 at 9:17 PM Masahiko Sawada  wrote:
> >
> > On Fri, Mar 15, 2024 at 4:36 PM John Naylor  wrote:
> > >
> > > On Thu, Mar 14, 2024 at 7:04 PM Masahiko Sawada  
> > > wrote:
>
> > > > Given TidStoreSetBlockOffsets() is designed to always set (i.e.
> > > > overwrite) the value, I think we should not expect that found is
> > > > always false.
> > >
> > > I find that a puzzling statement, since 1) it was designed for
> > > insert-only workloads, not actual overwrite IIRC and 2) the tests will
> > > now fail if the same block is set twice, since we just switched the
> > > tests to use a remnant of vacuum's old array. Having said that, I
> > > don't object to removing artificial barriers to using it for purposes
> > > not yet imagined, as long as test_tidstore.sql warns against that.
> >
> > I think that if it supports only insert-only workload and expects the
> > same block is set only once, it should raise an error rather than an
> > assertion. It's odd to me that the function fails only with an
> > assertion build assertions even though it actually works fine even in
> > that case.
>
> After thinking some more, I think you're right -- it's too
> heavy-handed to throw an error/assert and a public function shouldn't
> make assumptions about the caller. It's probably just a matter of
> documenting the function (and it's lack of generality), and the tests
> (which are based on the thing we're replacing).

Removed 'found' in 0003 patch.

>
> > As for test_tidstore you're right that the test code doesn't handle
> > the case where setting the same block twice. I think that there is no
> > problem in the fixed-TIDs tests, but we would need something for
> > random-TIDs tests so that we don't set the same block twice. I guess
> > it could be trivial since we can use SQL queries to generate TIDs. I'm
> > not sure how the random-TIDs tests would be like, but I think we can
> > use SELECT DISTINCT to eliminate the duplicates of block numbers to
> > use.
>
> Also, I don't think we need random blocks, since the radix tree tests
> excercise that heavily already.
>
> Random offsets is what I was thinking of (if made distinct and
> ordered), but even there the code is fairy trivial, so I don't have a
> strong feeling about it.

Agreed.

>
> > > Given the above two things, I think this function's comment needs
> > > stronger language about its limitations. Perhaps even mention that
> > > it's intended for, and optimized for, vacuum. You and I have long
> > > known that tidstore would need a separate, more complex, function to
> > > add or remove individual tids from existing entries, but it might be
> > > good to have that documented.
> >
> > Agreed.
>
> How about this:
>
>  /*
> - * Set the given TIDs on the blkno to TidStore.
> + * Create or replace an entry for the given block and array of offsets
>   *
> - * NB: the offset numbers in offsets must be sorted in ascending order.
> + * NB: This function is designed and optimized for vacuum's heap scanning
> + * phase, so has some limitations:
> + * - The offset numbers in "offsets" must be sorted in ascending order.
> + * - If the block number already exists, the entry will be replaced --
> + *   there is no way to add or remove offsets from an entry.
>   */
>  void
>  TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber 
> *offsets,

Looks good.

>
> I think we can stop including the debug-tid-store patch for CI now.
> That would allow getting rid of some unnecessary variables.

Agreed.

>
> + * Prepare to iterate through a TidStore. Since the radix tree is locked 
> during
> + * the iteration, TidStoreEndIterate() needs to be called when finished.
>
> + * Concurrent updates during the iteration will be blocked when inserting a
> + * key-value to the radix tree.
>
> This is outdated. Locking is optional. The remaining real reason now
> is that TidStoreEndIterate needs to free memory. We probably need to
> say something about locking, too, but not this.

Fixed.

>
> + * Scan the TidStore and return a pointer to TidStoreIterResult that has TIDs
> + * in one block. We return the block numbers in ascending order and the 
> offset
> + * numbers in each result is also sorted in ascending order.
> + */
> +TidStoreIterResult *
> +TidStoreIterateNext(TidStoreIter *iter)
>
> The wording is a bit awkward.

Fixed.

>
> +/*
> + * Finish an iteration over TidStore. This needs to be called after finishing
> +

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-15 Thread Masahiko Sawada

On Fri, Mar 15, 2024 at 4:36 PM John Naylor  wrote:
>
> On Thu, Mar 14, 2024 at 7:04 PM Masahiko Sawada  wrote:
> >
> > On Thu, Mar 14, 2024 at 6:55 PM John Naylor  wrote:
> > >
> > > On Thu, Mar 14, 2024 at 12:06 PM Masahiko Sawada  
> > > wrote:
> > > >
> > > > On Thu, Mar 14, 2024 at 1:29 PM John Naylor  
> > > > wrote:
> > > > > Okay, here's an another idea: Change test_lookup_tids() to be more
> > > > > general and put the validation down into C as well. First we save the
> > > > > blocks from do_set_block_offsets() into a table, then with all those
> > > > > blocks lookup a sufficiently-large range of possible offsets and save
> > > > > found values in another array. So the static items structure would
> > > > > have 3 arrays: inserts, successful lookups, and iteration (currently
> > > > > the iteration output is private to check_set_block_offsets(). Then
> > > > > sort as needed and check they are all the same.
> > > >
> > > > That's a promising idea. We can use the same mechanism for randomized
> > > > tests too. If you're going to work on this, I'll do other tests on my
> > > > environment in the meantime.
> > >
> > > Some progress on this in v72 -- I tried first without using SQL to
> > > save the blocks, just using the unique blocks from the verification
> > > array. It seems to work fine.
> >
> > Thanks!
>
> Seems I forgot the attachment last time...there's more stuff now
> anyway, based on discussion.

Thank you for updating the patches!

The idea of using three TID arrays for the lookup test and iteration
test looks good to me. I think we can add random-TIDs tests on top of
it.

>
> > > - Since there are now three arrays we should reduce max bytes to
> > > something smaller.
> >
> > Agreed.
>
> I went further than this, see below.
>
> > > - Further on that, I'm not sure if the "is full" test is telling us
> > > much. It seems we could make max bytes a static variable and set it to
> > > the size of the empty store. I'm guessing it wouldn't take much to add
> > > enough tids so that the contexts need to allocate some blocks, and
> > > then it would appear full and we can test that. I've made it so all
> > > arrays repalloc when needed, just in case.
> >
> > How about using work_mem as max_bytes instead of having it as a static
> > variable? In test_tidstore.sql we set work_mem before creating the
> > tidstore. It would make the tidstore more controllable by SQL queries.
>
> My complaint is that the "is full" test is trivial, and also strange
> in that max_bytes is used for two unrelated things:
>
> - the initial size of the verification arrays, which was always larger
> than necessary, and now there are three of them
> - the hint to TidStoreCreate to calculate its max block size / the
> threshold for being "full"
>
> To make the "is_full" test slightly less trivial, my idea is to save
> the empty store size and later add enough tids so that it has to
> allocate new blocks/DSA segments, which is not that many, and then it
> will appear full. I've done this and also separated the purpose of
> various sizes in v72-0009/10.

I see your point and the changes look good to me.

> Using actual work_mem seems a bit more difficult to make this work.

Agreed.

>
>
> > ---
> > +   if (TidStoreIsShared(ts))
> > +   found = shared_rt_set(ts->tree.shared, blkno, page);
> > +   else
> > +   found = local_rt_set(ts->tree.local, blkno, page);
> > +
> > +   Assert(!found);
> >
> > Given TidStoreSetBlockOffsets() is designed to always set (i.e.
> > overwrite) the value, I think we should not expect that found is
> > always false.
>
> I find that a puzzling statement, since 1) it was designed for
> insert-only workloads, not actual overwrite IIRC and 2) the tests will
> now fail if the same block is set twice, since we just switched the
> tests to use a remnant of vacuum's old array. Having said that, I
> don't object to removing artificial barriers to using it for purposes
> not yet imagined, as long as test_tidstore.sql warns against that.

I think that if it supports only insert-only workload and expects the
same block is set only once, it should raise an error rather than an
assertion. It's odd to me that the function fails only with an
assertion build assertions even though it actually works fine even in
that case.

As for test_tidstore you're right that the test code doesn't handle
the case where

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-14 Thread Masahiko Sawada

On Thu, Mar 14, 2024 at 9:03 PM Masahiko Sawada  wrote:
>
> On Thu, Mar 14, 2024 at 6:55 PM John Naylor  wrote:
> >
> > On Thu, Mar 14, 2024 at 12:06 PM Masahiko Sawada  
> > wrote:
> > >
> > > On Thu, Mar 14, 2024 at 1:29 PM John Naylor  
> > > wrote:
> > > > Okay, here's an another idea: Change test_lookup_tids() to be more
> > > > general and put the validation down into C as well. First we save the
> > > > blocks from do_set_block_offsets() into a table, then with all those
> > > > blocks lookup a sufficiently-large range of possible offsets and save
> > > > found values in another array. So the static items structure would
> > > > have 3 arrays: inserts, successful lookups, and iteration (currently
> > > > the iteration output is private to check_set_block_offsets(). Then
> > > > sort as needed and check they are all the same.
> > >
> > > That's a promising idea. We can use the same mechanism for randomized
> > > tests too. If you're going to work on this, I'll do other tests on my
> > > environment in the meantime.
> >
> > Some progress on this in v72 -- I tried first without using SQL to
> > save the blocks, just using the unique blocks from the verification
> > array. It seems to work fine.
>
> Thanks!
>
> >
> > - Since there are now three arrays we should reduce max bytes to
> > something smaller.
>
> Agreed.
>
> > - Further on that, I'm not sure if the "is full" test is telling us
> > much. It seems we could make max bytes a static variable and set it to
> > the size of the empty store. I'm guessing it wouldn't take much to add
> > enough tids so that the contexts need to allocate some blocks, and
> > then it would appear full and we can test that. I've made it so all
> > arrays repalloc when needed, just in case.
>
> How about using work_mem as max_bytes instead of having it as a static
> variable? In test_tidstore.sql we set work_mem before creating the
> tidstore. It would make the tidstore more controllable by SQL queries.
>
> > - Why are we switching to TopMemoryContext? It's not explained -- the
> > comment only tells what the code is doing (which is obvious), but not
> > why.
>
> This is because the tidstore needs to live across the transaction
> boundary. We can use TopMemoryContext or CacheMemoryContext.
>
> > - I'm not sure it's useful to keep test_lookup_tids() around. Since we
> > now have a separate lookup test, the only thing it can tell us is that
> > lookups fail on an empty store. I arranged it so that
> > check_set_block_offsets() works on an empty store. Although that's
> > even more trivial, it's just reusing what we already need.
>
> Agreed.
>

I have two questions on tidstore.c:

+/*
+ * Set the given TIDs on the blkno to TidStore.
+ *
+ * NB: the offset numbers in offsets must be sorted in ascending order.
+ */

Do we need some assertions to check if the given offset numbers are
sorted expectedly?

---
+   if (TidStoreIsShared(ts))
+   found = shared_rt_set(ts->tree.shared, blkno, page);
+   else
+   found = local_rt_set(ts->tree.local, blkno, page);
+
+   Assert(!found);

Given TidStoreSetBlockOffsets() is designed to always set (i.e.
overwrite) the value, I think we should not expect that found is
always false.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-14 Thread Masahiko Sawada

On Thu, Mar 14, 2024 at 6:55 PM John Naylor  wrote:
>
> On Thu, Mar 14, 2024 at 12:06 PM Masahiko Sawada  
> wrote:
> >
> > On Thu, Mar 14, 2024 at 1:29 PM John Naylor  wrote:
> > > Okay, here's an another idea: Change test_lookup_tids() to be more
> > > general and put the validation down into C as well. First we save the
> > > blocks from do_set_block_offsets() into a table, then with all those
> > > blocks lookup a sufficiently-large range of possible offsets and save
> > > found values in another array. So the static items structure would
> > > have 3 arrays: inserts, successful lookups, and iteration (currently
> > > the iteration output is private to check_set_block_offsets(). Then
> > > sort as needed and check they are all the same.
> >
> > That's a promising idea. We can use the same mechanism for randomized
> > tests too. If you're going to work on this, I'll do other tests on my
> > environment in the meantime.
>
> Some progress on this in v72 -- I tried first without using SQL to
> save the blocks, just using the unique blocks from the verification
> array. It seems to work fine.

Thanks!

>
> - Since there are now three arrays we should reduce max bytes to
> something smaller.

Agreed.

> - Further on that, I'm not sure if the "is full" test is telling us
> much. It seems we could make max bytes a static variable and set it to
> the size of the empty store. I'm guessing it wouldn't take much to add
> enough tids so that the contexts need to allocate some blocks, and
> then it would appear full and we can test that. I've made it so all
> arrays repalloc when needed, just in case.

How about using work_mem as max_bytes instead of having it as a static
variable? In test_tidstore.sql we set work_mem before creating the
tidstore. It would make the tidstore more controllable by SQL queries.

> - Why are we switching to TopMemoryContext? It's not explained -- the
> comment only tells what the code is doing (which is obvious), but not
> why.

This is because the tidstore needs to live across the transaction
boundary. We can use TopMemoryContext or CacheMemoryContext.

> - I'm not sure it's useful to keep test_lookup_tids() around. Since we
> now have a separate lookup test, the only thing it can tell us is that
> lookups fail on an empty store. I arranged it so that
> check_set_block_offsets() works on an empty store. Although that's
> even more trivial, it's just reusing what we already need.

Agreed.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Have pg_basebackup write "dbname" in "primary_conninfo"?

2024-03-14 Thread Masahiko Sawada

On Thu, Mar 14, 2024 at 2:27 PM Amit Kapila  wrote:
>
> On Thu, Mar 14, 2024 at 5:57 AM Masahiko Sawada  wrote:
> >
> > This fact makes me think that the slotsync worker might be able to
> > accept the primary_conninfo value even if there is no dbname in the
> > value. That is, if there is no dbname in the primary_conninfo, it uses
> > the username in accordance with the specs of the connection string.
> > Currently, the slotsync worker connects to the local database first
> > and then establishes the connection to the primary server. But if we
> > can reverse the two steps, it can get the dbname that has actually
> > been used to establish the remote connection and use it for the local
> > connection too. That way, the primary_conninfo generated by
> > pg_basebackup could work even without the patch. For example, if the
> > OS user executing pg_basebackup is 'postgres', the slotsync worker
> > would connect to the postgres database. Given the 'postgres' database
> > is created by default and 'postgres' OS user is used in common, I
> > guess it could cover many cases in practice actually.
> >
>
> I think this is worth investigating but I suspect that in most cases
> users will end up using a replication connection without specifying
> the user name and we may not be able to give a meaningful error
> message when slotsync worker won't be able to connect. The same will
> be true even when the dbname same as the username would be used.

What do you mean by not being able to give a meaningful error message?

If the slotsync worker uses the user name as the dbname, and such a
database doesn't exist, the error message the user will get is
"database "test_user" does not exist". ISTM the same is true when the
user specifies the wrong database in the primary_conninfo.

>
> > Having said that, even with (or without) the above change, we might
> > want to change the pg_basebackup so that it writes the dbname to the
> > primary_conninfo if -R option is specified. Since the database where
> > the slotsync worker connects cannot be dropped while the slotsync
> > worker is running, the user might want to change the database to
> > connect, and it would be useful if they can do that using
> > pg_basebackup instead of modifying the configuration file manually.
> >
> > While the current approach makes sense to me, I'm a bit concerned that
> > we might end up having the pg_basebackup search the actual database
> > name (e.g. 'dbname=template1') from the .pgpass file instead of
> > 'dbname=replication'. As far as I tested on my environment, suppose
> > that I execute:
> >
> > pg_basebackup -D tmp -d "dbname=testdb" -R
> >
> > The pg_basebackup established a replication connection but looked for
> > the password of the 'testdb' database. This could be another
> > inconvenience for the existing users who want to use the slot
> > synchronization.
> >
>
> This is true because it is internally using logical replication
> connection (as we will set set replication=database).

Did you mean the pg_basebackup is using a logical replication
connection in this case? As far as I tested, even if we specify dbname
to the -d option of pg_basebackup, it uses a physical replication
connection. For example, it can take a backup even if I specify a
non-existing database name.

> > A random idea I came up with is, we add a new option to the
> > pg_basebackup to overwrite the full or some portion of the connection
> > string that is eventually written in the primary_conninfo in
> > postgresql.auto.conf. For example, the command:
> >
> > pg_basebackup -D tmp -d "host=1.1.1.1 port=" -R
> > --primary-coninfo-ext "host=2.2.2.2 dbname=postgres"
> >
> > will produce the connection string that is based on -d option value
> > but is overwritten by --primary-conninfo-ext option value, which will
> > be like:
> >
> > host=2.2.2.2 dbname=postgres port=
> >
> > This option might help not only for users who want to use the slotsync
> > worker but also for users who want to take a basebackup from a standby
> > but have the new standby connect to the primary.
> >
>
> Agreed, this could be another way though it would be good to get some
> inputs from users or otherwise about the preferred way to specify
> dbname. One can also imagine using the Alter System for this purpose.

Agreed.

>
> > But it's still just an idea and I might be missing something. And
> > given we're getting closer to the feature freeze, it would be a PG18
> > item.
> >
>
> +1. At this stage, it is important to discu

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-13 Thread Masahiko Sawada

On Thu, Mar 14, 2024 at 1:29 PM John Naylor  wrote:
>
> On Thu, Mar 14, 2024 at 8:53 AM Masahiko Sawada  wrote:
> >
> > On Thu, Mar 14, 2024 at 9:59 AM John Naylor  wrote:
> > > > BTW do we still want to test the tidstore by using a combination of
> > > > SQL functions? We might no longer need to input TIDs via a SQL
> > > > function.
> > >
> > > I'm not sure. I stopped short of doing that to get feedback on this
> > > much. One advantage with SQL functions is we can use generate_series
> > > to easily input lists of blocks with different numbers and strides,
> > > and array literals for offsets are a bit easier. What do you think?
> >
> > While I'm not a fan of the following part, I agree that it makes sense
> > to use SQL functions for test data generation:
> >
> > -- Constant values used in the tests.
> > \set maxblkno 4294967295
> > -- The maximum number of heap tuples (MaxHeapTuplesPerPage) in 8kB block is 
> > 291.
> > -- We use a higher number to test tidstore.
> > \set maxoffset 512
>
> I'm not really a fan of these either, and could be removed a some
> point if we've done everything else nicely.
>
> > It would also be easier for developers to test the tidstore with their
> > own data set. So I agreed with the current approach; use SQL functions
> > for data generation and do the actual tests inside C functions.
>
> Okay, here's an another idea: Change test_lookup_tids() to be more
> general and put the validation down into C as well. First we save the
> blocks from do_set_block_offsets() into a table, then with all those
> blocks lookup a sufficiently-large range of possible offsets and save
> found values in another array. So the static items structure would
> have 3 arrays: inserts, successful lookups, and iteration (currently
> the iteration output is private to check_set_block_offsets(). Then
> sort as needed and check they are all the same.

That's a promising idea. We can use the same mechanism for randomized
tests too. If you're going to work on this, I'll do other tests on my
environment in the meantime.

>
> Further thought: We may not really need to test block numbers that
> vigorously, since the radix tree tests should cover keys/values pretty
> well.

Agreed. Probably boundary block numbers: 0, 1, MaxBlockNumber - 1, and
MaxBlockNumber, would be sufficient.

>  The difference here is using bitmaps of tids and that should be
> well covered.

Right. We would need to test offset numbers vigorously instead.

>
> Locally (not CI), we should try big inputs to make sure we can
> actually go up to many GB -- it's easier and faster this way than
> having vacuum give us a large data set.

I'll do these tests.

>
> > Is it
> > convenient for developers if we have functions like generate_tids()
> > and generate_random_tids() to generate TIDs so that they can pass them
> > to do_set_block_offsets()?
>
> I guess I don't see the advantage of adding a layer of indirection at
> this point, but it could be useful at a later time.

Agreed.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Improve eviction algorithm in ReorderBuffer

2024-03-13 Thread Masahiko Sawada

On Wed, Mar 13, 2024 at 11:23 AM Peter Smith  wrote:
>
> On Wed, Mar 13, 2024 at 12:48 PM Masahiko Sawada  
> wrote:
> >
> > On Wed, Mar 13, 2024 at 10:15 AM Peter Smith  wrote:
> > >
> > > On Tue, Mar 12, 2024 at 4:23 PM Masahiko Sawada  
> > > wrote:
> > > >
> > > > On Fri, Mar 8, 2024 at 12:58 PM Peter Smith  
> > > > wrote:
> > > > >
> > > ...
> > > > > > > 5.
> > > > > > > + *
> > > > > > > + * If 'indexed' is true, we create a hash table to track of each 
> > > > > > > node's
> > > > > > > + * index in the heap, enabling to perform some operations such 
> > > > > > > as removing
> > > > > > > + * the node from the heap.
> > > > > > >   */
> > > > > > >  binaryheap *
> > > > > > > -binaryheap_allocate(int capacity, binaryheap_comparator compare, 
> > > > > > > void *arg)
> > > > > > > +binaryheap_allocate(int capacity, binaryheap_comparator compare,
> > > > > > > + bool indexed, void *arg)
> > > > > > >
> > > > > > > BEFORE
> > > > > > > ... enabling to perform some operations such as removing the node 
> > > > > > > from the heap.
> > > > > > >
> > > > > > > SUGGESTION
> > > > > > > ... to help make operations such as removing nodes more efficient.
> > > > > > >
> > > > > >
> > > > > > But these operations literally require the indexed binary heap as we
> > > > > > have an assertion:
> > > > > >
> > > > > > void
> > > > > > binaryheap_remove_node_ptr(binaryheap *heap, bh_node_type d)
> > > > > > {
> > > > > > bh_nodeidx_entry *ent;
> > > > > >
> > > > > > Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
> > > > > > Assert(heap->bh_indexed);
> > > > > >
> > > > >
> > > > > I didn’t quite understand -- the operations mentioned are "operations
> > > > > such as removing the node", but binaryheap_remove_node() also removes
> > > > > a node from the heap. So I still felt the comment wording of the patch
> > > > > is not quite correct.
> > > >
> > > > Now I understand your point. That's a valid point.
> > > >
> > > > >
> > > > > Now, if the removal of a node from an indexed heap can *only* be done
> > > > > using binaryheap_remove_node_ptr() then:
> > > > > - the other removal functions (binaryheap_remove_*) probably need some
> > > > > comments to make sure nobody is tempted to call them directly for an
> > > > > indexed heap.
> > > > > - maybe some refactoring and assertions are needed to ensure those
> > > > > *cannot* be called directly for an indexed heap.
> > > > >
> > > >
> > > > If the 'index' is true, the caller can not only use the existing
> > > > functions but also newly added functions such as
> > > > binaryheap_remove_node_ptr() and binaryheap_update_up() etc. How about
> > > > something like below?
> > > >
> > >
> > > You said: "can not only use the existing functions but also..."
> > >
> > > Hmm. Is that right? IIUC those existing "remove" functions should NOT
> > > be called directly if the heap was "indexed" because they'll delete
> > > the node from the heap OK, but any corresponding index for that
> > > deleted node will be left lying around -- i.e. everything gets out of
> > > sync. This was the reason for my original concern.
> > >
> >
> > All existing binaryheap functions should be available even if the
> > binaryheap is 'indexed'. For instance, with the patch,
> > binaryheap_remote_node() is:
> >
> > void
> > binaryheap_remove_node(binaryheap *heap, int n)
> > {
> > int cmp;
> >
> > Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
> > Assert(n >= 0 && n < heap->bh_size);
> >
> > /* compare last node to the one that is being removed */
> > cmp = heap->bh_compare(heap->bh_nodes[--heap->b

Re: Improve eviction algorithm in ReorderBuffer

2024-03-13 Thread Masahiko Sawada

orderBuffer *rb)
>
> /We will run a heap assembly step at the end, which is more
> efficient./The heap assembly step is deferred until the end, for
> efficiency./

Fixed.

>
> ~~~
>
> 8. ReorderBufferLargestTXN
>
> + if (hash_get_num_entries(rb->by_txn) < REORDER_BUFFER_MEM_TRACK_THRESHOLD)
> + {
> + HASH_SEQ_STATUS hash_seq;
> + ReorderBufferTXNByIdEnt *ent;
> +
> + hash_seq_init(_seq, rb->by_txn);
> + while ((ent = hash_seq_search(_seq)) != NULL)
> + {
> + ReorderBufferTXN *txn = ent->txn;
> +
> + /* if the current transaction is larger, remember it */
> + if ((!largest) || (txn->size > largest->size))
> + largest = txn;
> + }
> +
> + Assert(largest);
> + }
>
> That Assert(largest) seems redundant because there is anyway another
> Assert(largest) immediately after this code.

Removed.

>
> ~~~
>
> 9.
> + /* Get the largest transaction from the max-heap */
> + if (rb->memtrack_state == REORDER_BUFFER_MEM_TRACK_MAINTAIN_MAXHEAP)
> + {
> + Assert(binaryheap_size(rb->txn_heap) > 0);
> + largest = (ReorderBufferTXN *)
> + DatumGetPointer(binaryheap_first(rb->txn_heap));
>   }
> Assert(binaryheap_size(rb->txn_heap) > 0); seemed like slightly less
> readable way of saying:
>
> Assert(!binaryheap_empty(rb->txn_heap));

Fixed.

>
> ~~~
>
> 10.
> +
> +/*
> + * Compare between sizes of two transactions. This is for a binary heap
> + * comparison function.
> + */
> +static int
> +ReorderBufferTXNSizeCompare(Datum a, Datum b, void *arg)
>
> 10a.
> /Compare between sizes of two transactions./Compare two transactions by size./

Fixed.

>
> ~~~
>
> 10b.
> IMO this comparator function belongs just before the
> ReorderBufferAllocate() function since that is the only place where it
> is used.

I think it's better to move close to new max-heap related functions.

>
> ==
> src/include/replication/reorderbuffer.h
>
> 11.
> +/* State of how to track the memory usage of each transaction being decoded 
> */
> +typedef enum ReorderBufferMemTrackState
> +{
> + /*
> + * We don't update max-heap while updating the memory counter. The
> + * max-heap is built before use.
> + */
> + REORDER_BUFFER_MEM_TRACK_NO_MAXHEAP,
> +
> + /*
> + * We also update the max-heap when updating the memory counter so the
> + * heap property is always preserved.
> + */
> + REORDER_BUFFER_MEM_TRACK_MAINTAIN_MAXHEAP,
> +} ReorderBufferMemTrackState;
> +
>
> In my GENERAL review comment #0, I suggested the removal of this
> entire enum. e.g. It could be replaced with a boolean field
> 'track_txn_sizes'
>
> TBH, I think there is a better way to handle this "state". IIUC
> - the txn_heap is always allocated up-front.
> - you only "build" it when > threshold and
> - when it drops < 0.9 x threshold you reset it.
>
> Therefore, AFAICT you do not need to maintain any “switch states” at
> all; you simply need to check binaryheap_empty(txn_heap), right?
> * If the heap is empty…. It means you are NOT tracking, so don’t use it
> * If the heap is NOT empty …. It means you ARE tracking, so use it.
>
> ~
>
> Using my idea to remove the state flag will have the side effect of
> simplifying many other parts of this patch. For example
>
> BEFORE
> +static void
> +ReorderBufferMaybeChangeNoMaxHeap(ReorderBuffer *rb)
> +{
> + if (rb->memtrack_state == REORDER_BUFFER_MEM_TRACK_NO_MAXHEAP)
> + return;
> +
> ...
> + if (binaryheap_size(rb->txn_heap) < REORDER_BUFFER_MEM_TRACK_THRESHOLD * 
> 0.9)
> + {
> + rb->memtrack_state = REORDER_BUFFER_MEM_TRACK_NO_MAXHEAP;
> + binaryheap_reset(rb->txn_heap);
> + }
> +}
>
> AFTER
> +static void
> +ReorderBufferMaybeChangeNoMaxHeap(ReorderBuffer *rb)
> +{
> + if (binaryheap_empty(rb->txn_heap))
> + return;
> +
> ...
> + if (binaryheap_size(rb->txn_heap) < REORDER_BUFFER_MEM_TRACK_THRESHOLD * 
> 0.9)
> + binaryheap_reset(rb->txn_heap);
> +}

Agreed. I removed the enum and changed the logic.

>
> ~~~
>
> 12. struct ReorderBuffer
>
> + /* Max-heap for sizes of all top-level and sub transactions */
> + ReorderBufferMemTrackState memtrack_state;
> + binaryheap *txn_heap;
> +
>
> 12a.
> Why is this being referred to in the commit message and code comments
> as "max-heap" when the field is not called by that same name? Won't it
> be better to give the field a better name -- e.g. "txn_maxheap" or
> similar?

Not sure it helps increase readability. Other codes where we use
binaryheap use neither max nor min in the field name.

>
> ~
>
> 12b.
> This comment should also say that the heap is ordered by tx size --
> (e.g. the comparator is ReorderBufferTXNSizeCompare)

It seems to me the comment "/* Max-heap for sizes of all top-level and
sub transactions */" already mentions that, no? I'm not sure we need
to refer to the actual function name here.

I've attached new version patches.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


v9-0001-Make-binaryheap-enlargeable.patch
Description: Binary data


v9-0002-Add-functions-to-binaryheap-for-efficient-key-rem.patch
Description: Binary data


v9-0003-Improve-eviction-algorithm-in-Reorderbuffer-using.patch
Description: Binary data

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-13 Thread Masahiko Sawada

On Thu, Mar 14, 2024 at 9:59 AM John Naylor  wrote:
>
> On Wed, Mar 13, 2024 at 9:29 PM Masahiko Sawada  wrote:
> >
> > On Wed, Mar 13, 2024 at 8:05 PM John Naylor  wrote:
> > >
> > > On Wed, Mar 13, 2024 at 8:39 AM Masahiko Sawada  
> > > wrote:
> > >
> > > > As I mentioned above, if we implement the test cases in C, we can use
> > > > the debug-build array in the test code. And we won't use it in AND/OR
> > > > operations tests in the future.
> > >
> > > That's a really interesting idea, so I went ahead and tried that for
> > > v71. This seems like a good basis for testing larger, randomized
> > > inputs, once we decide how best to hide that from the expected output.
> > > The tests use SQL functions do_set_block_offsets() and
> > > check_set_block_offsets(). The latter does two checks against a tid
> > > array, and replaces test_dump_tids().
> >
> > Great! I think that's a very good starter.
> >
> > The lookup_test() (and test_lookup_tids()) do also test that the
> > IsMember() function returns false as expected if the TID doesn't exist
> > in it, and probably we can do these tests in a C function too.
> >
> > BTW do we still want to test the tidstore by using a combination of
> > SQL functions? We might no longer need to input TIDs via a SQL
> > function.
>
> I'm not sure. I stopped short of doing that to get feedback on this
> much. One advantage with SQL functions is we can use generate_series
> to easily input lists of blocks with different numbers and strides,
> and array literals for offsets are a bit easier. What do you think?

While I'm not a fan of the following part, I agree that it makes sense
to use SQL functions for test data generation:

-- Constant values used in the tests.
\set maxblkno 4294967295
-- The maximum number of heap tuples (MaxHeapTuplesPerPage) in 8kB block is 291.
-- We use a higher number to test tidstore.
\set maxoffset 512

It would also be easier for developers to test the tidstore with their
own data set. So I agreed with the current approach; use SQL functions
for data generation and do the actual tests inside C functions. Is it
convenient for developers if we have functions like generate_tids()
and generate_random_tids() to generate TIDs so that they can pass them
to do_set_block_offsets()? Then they call check_set_block_offsets()
and others for actual data lookup and iteration tests.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Have pg_basebackup write "dbname" in "primary_conninfo"?

2024-03-13 Thread Masahiko Sawada

On Fri, Feb 23, 2024 at 3:05 PM Amit Kapila  wrote:
>
> On Wed, Feb 21, 2024 at 7:46 AM Hayato Kuroda (Fujitsu)
>  wrote:
> >
> > > > Just FYI - here is an extreme case. And note that I have applied 
> > > > proposed patch.
> > > >
> > > > When `pg_basebackup -D data_N2 -R` is used:
> > > > ```
> > > > primary_conninfo = 'user=hayato ... dbname=hayato ...
> > > > ```
> > > >
> > > > But when `pg_basebackup -d "" -D data_N2 -R` is used:
> > > > ```
> > > > primary_conninfo = 'user=hayato ... dbname=replication
> > > > ```
> > >
> > > It seems like maybe somebody should look into why this is happening,
> > > and perhaps fix it.
> >
> > I think this caused from below part [1] in GetConnection().
> >
> > If both dbname and connection_string are the NULL, we will enter the else 
> > part
> > and NULL would be substituted - {"dbnmae", NULL} key-value pair is generated
> > only here.
> >
> > Then, in PQconnectdbParams()->PQconnectStartParams->pqConnectOptions2(),
> > the strange part would be found and replaced to the username [2].
> >
> > I think if both the connection string and the dbname are NULL, it should be
> > considered as the physical replication connection. here is a patch to fix 
> > it.
> >
>
> When dbname is NULL or not given, it defaults to username. This
> follows the specs of the connection string.

This fact makes me think that the slotsync worker might be able to
accept the primary_conninfo value even if there is no dbname in the
value. That is, if there is no dbname in the primary_conninfo, it uses
the username in accordance with the specs of the connection string.
Currently, the slotsync worker connects to the local database first
and then establishes the connection to the primary server. But if we
can reverse the two steps, it can get the dbname that has actually
been used to establish the remote connection and use it for the local
connection too. That way, the primary_conninfo generated by
pg_basebackup could work even without the patch. For example, if the
OS user executing pg_basebackup is 'postgres', the slotsync worker
would connect to the postgres database. Given the 'postgres' database
is created by default and 'postgres' OS user is used in common, I
guess it could cover many cases in practice actually.

Having said that, even with (or without) the above change, we might
want to change the pg_basebackup so that it writes the dbname to the
primary_conninfo if -R option is specified. Since the database where
the slotsync worker connects cannot be dropped while the slotsync
worker is running, the user might want to change the database to
connect, and it would be useful if they can do that using
pg_basebackup instead of modifying the configuration file manually.

While the current approach makes sense to me, I'm a bit concerned that
we might end up having the pg_basebackup search the actual database
name (e.g. 'dbname=template1') from the .pgpass file instead of
'dbname=replication'. As far as I tested on my environment, suppose
that I execute:

pg_basebackup -D tmp -d "dbname=testdb" -R

The pg_basebackup established a replication connection but looked for
the password of the 'testdb' database. This could be another
inconvenience for the existing users who want to use the slot
synchronization.

A random idea I came up with is, we add a new option to the
pg_basebackup to overwrite the full or some portion of the connection
string that is eventually written in the primary_conninfo in
postgresql.auto.conf. For example, the command:

pg_basebackup -D tmp -d "host=1.1.1.1 port=" -R
--primary-coninfo-ext "host=2.2.2.2 dbname=postgres"

will produce the connection string that is based on -d option value
but is overwritten by --primary-conninfo-ext option value, which will
be like:

host=2.2.2.2 dbname=postgres port=

This option might help not only for users who want to use the slotsync
worker but also for users who want to take a basebackup from a standby
but have the new standby connect to the primary.

But it's still just an idea and I might be missing something. And
given we're getting closer to the feature freeze, it would be a PG18
item.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-13 Thread Masahiko Sawada

On Wed, Mar 13, 2024 at 8:05 PM John Naylor  wrote:
>
> On Wed, Mar 13, 2024 at 8:39 AM Masahiko Sawada  wrote:
>
> > As I mentioned above, if we implement the test cases in C, we can use
> > the debug-build array in the test code. And we won't use it in AND/OR
> > operations tests in the future.
>
> That's a really interesting idea, so I went ahead and tried that for
> v71. This seems like a good basis for testing larger, randomized
> inputs, once we decide how best to hide that from the expected output.
> The tests use SQL functions do_set_block_offsets() and
> check_set_block_offsets(). The latter does two checks against a tid
> array, and replaces test_dump_tids().

Great! I think that's a very good starter.

The lookup_test() (and test_lookup_tids()) do also test that the
IsMember() function returns false as expected if the TID doesn't exist
in it, and probably we can do these tests in a C function too.

BTW do we still want to test the tidstore by using a combination of
SQL functions? We might no longer need to input TIDs via a SQL
function.

> Funnily enough, the debug array
> itself gave false failures when using a similar array in the test
> harness, because it didn't know all the places where the array should
> have been sorted -- it only worked by chance before because of what
> order things were done.

Good catch, thanks.

> I squashed everything from v70 and also took the liberty of switching
> on shared memory for tid store tests. The only reason we didn't do
> this with the radix tree tests is that the static attach/detach
> functions would raise warnings since they are not used.

Agreed to test the tidstore on shared memory.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Improve eviction algorithm in ReorderBuffer

2024-03-12 Thread Masahiko Sawada

On Wed, Mar 13, 2024 at 10:15 AM Peter Smith  wrote:
>
> On Tue, Mar 12, 2024 at 4:23 PM Masahiko Sawada  wrote:
> >
> > On Fri, Mar 8, 2024 at 12:58 PM Peter Smith  wrote:
> > >
> ...
> > > > > 5.
> > > > > + *
> > > > > + * If 'indexed' is true, we create a hash table to track of each 
> > > > > node's
> > > > > + * index in the heap, enabling to perform some operations such as 
> > > > > removing
> > > > > + * the node from the heap.
> > > > >   */
> > > > >  binaryheap *
> > > > > -binaryheap_allocate(int capacity, binaryheap_comparator compare, 
> > > > > void *arg)
> > > > > +binaryheap_allocate(int capacity, binaryheap_comparator compare,
> > > > > + bool indexed, void *arg)
> > > > >
> > > > > BEFORE
> > > > > ... enabling to perform some operations such as removing the node 
> > > > > from the heap.
> > > > >
> > > > > SUGGESTION
> > > > > ... to help make operations such as removing nodes more efficient.
> > > > >
> > > >
> > > > But these operations literally require the indexed binary heap as we
> > > > have an assertion:
> > > >
> > > > void
> > > > binaryheap_remove_node_ptr(binaryheap *heap, bh_node_type d)
> > > > {
> > > > bh_nodeidx_entry *ent;
> > > >
> > > > Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
> > > > Assert(heap->bh_indexed);
> > > >
> > >
> > > I didn’t quite understand -- the operations mentioned are "operations
> > > such as removing the node", but binaryheap_remove_node() also removes
> > > a node from the heap. So I still felt the comment wording of the patch
> > > is not quite correct.
> >
> > Now I understand your point. That's a valid point.
> >
> > >
> > > Now, if the removal of a node from an indexed heap can *only* be done
> > > using binaryheap_remove_node_ptr() then:
> > > - the other removal functions (binaryheap_remove_*) probably need some
> > > comments to make sure nobody is tempted to call them directly for an
> > > indexed heap.
> > > - maybe some refactoring and assertions are needed to ensure those
> > > *cannot* be called directly for an indexed heap.
> > >
> >
> > If the 'index' is true, the caller can not only use the existing
> > functions but also newly added functions such as
> > binaryheap_remove_node_ptr() and binaryheap_update_up() etc. How about
> > something like below?
> >
>
> You said: "can not only use the existing functions but also..."
>
> Hmm. Is that right? IIUC those existing "remove" functions should NOT
> be called directly if the heap was "indexed" because they'll delete
> the node from the heap OK, but any corresponding index for that
> deleted node will be left lying around -- i.e. everything gets out of
> sync. This was the reason for my original concern.
>

All existing binaryheap functions should be available even if the
binaryheap is 'indexed'. For instance, with the patch,
binaryheap_remote_node() is:

void
binaryheap_remove_node(binaryheap *heap, int n)
{
int cmp;

Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
Assert(n >= 0 && n < heap->bh_size);

/* compare last node to the one that is being removed */
cmp = heap->bh_compare(heap->bh_nodes[--heap->bh_size],
   heap->bh_nodes[n],
   heap->bh_arg);

/* remove the last node, placing it in the vacated entry */
replace_node(heap, n, heap->bh_nodes[heap->bh_size]);

/* sift as needed to preserve the heap property */
if (cmp > 0)
sift_up(heap, n);
else if (cmp < 0)
sift_down(heap, n);
}

The replace_node(), sift_up() and sift_down() update node's index as
well if the binaryheap is indexed. When deleting the node from the
binaryheap, it will also delete its index from the hash table.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-12 Thread Masahiko Sawada

On Tue, Mar 12, 2024 at 7:34 PM John Naylor  wrote:
>
> On Mon, Mar 11, 2024 at 3:13 PM Masahiko Sawada  wrote:
> >
> > On Mon, Mar 11, 2024 at 12:20 PM John Naylor  
> > wrote:
> > >
> > > On Thu, Mar 7, 2024 at 10:35 PM Masahiko Sawada  
> > > wrote:
>
> > > + ts->context = CurrentMemoryContext;
> > >
> > > As far as I can tell, this member is never accessed again -- am I
> > > missing something?
> >
> > You're right. It was used to re-create the tidstore in the same
> > context again while resetting it, but we no longer support the reset
> > API. Considering it again, would it be better to allocate the iterator
> > struct in the same context as we store the tidstore struct?
>
> That makes sense.
>
> > > + /* DSA for tidstore will be detached at the end of session */
> > >
> > > No other test module pins the mapping, but that doesn't necessarily
> > > mean it's wrong. Is there some advantage over explicitly detaching?
> >
> > One small benefit of not explicitly detaching dsa_area in
> > tidstore_destroy() would be simplicity; IIUC if we want to do that, we
> > need to remember the dsa_area using (for example) a static variable,
> > and free it if it's non-NULL. I've implemented this idea in the
> > attached patch.
>
> Okay, I don't have a strong preference at this point.

I'd keep the update on that.

>
> > > +-- Add tids in random order.
> > >
> > > I don't see any randomization here. I do remember adding row_number to
> > > remove whitespace in the output, but I don't remember a random order.
> > > On that subject, the row_number was an easy trick to avoid extra
> > > whitespace, but maybe we should just teach the setting function to
> > > return blocknumber rather than null?
> >
> > Good idea, fixed.
>
> + test_set_block_offsets
> +
> + 2147483647
> +  0
> + 4294967294
> +  1
> + 4294967295
>
> Hmm, was the earlier comment about randomness referring to this? I'm
> not sure what other regression tests do in these cases, or how
> relibale this is. If this is a problem we could simply insert this
> result into a temp table so it's not output.

I didn't address the comment about randomness.

I think that we will have both random TIDs tests and fixed TIDs tests
in test_tidstore as we discussed, and probably we can do both tests
with similar steps; insert TIDs into both a temp table and tidstore
and check if the tidstore returned the results as expected by
comparing the results to the temp table. Probably we can have a common
pl/pgsql function that checks that and raises a WARNING or an ERROR.
Given that this is very similar to what we did in test_radixtree, why
do we really want to implement it using a pl/pgsql function? When we
discussed it before, I found the current way makes sense. But given
that we're adding more tests and will add more tests in the future,
doing the tests in C will be more maintainable and faster. Also, I
think we can do the debug-build array stuff in the test_tidstore code
instead.

>
> > > +Datum
> > > +tidstore_create(PG_FUNCTION_ARGS)
> > > +{
> > > ...
> > > + tidstore = TidStoreCreate(max_bytes, dsa);
> > >
> > > +Datum
> > > +tidstore_set_block_offsets(PG_FUNCTION_ARGS)
> > > +{
> > > 
> > > + TidStoreSetBlockOffsets(tidstore, blkno, offs, noffs);
> > >
> > > These names are too similar. Maybe the test module should do
> > > s/tidstore_/test_/ or similar.
> >
> > Agreed.
>
> Mostly okay, although a couple look a bit generic now. I'll leave it
> up to you if you want to tweak things.
>
> > > In general, the .sql file is still very hard-coded. Functions are
> > > created that contain a VALUES statement. Maybe it's okay for now, but
> > > wanted to mention it. Ideally, we'd have some randomized tests,
> > > without having to display it. That could be in addition to (not
> > > replacing) the small tests we have that display input. (see below)
> > >
> >
> > Agreed to add randomized tests in addition to the existing tests.
>
> I'll try something tomorrow.
>
> > Sounds a good idea. In fact, if there are some bugs in tidstore, it's
> > likely that even initdb would fail in practice. However, it's a very
> > good idea that we can test the tidstore anyway with such a check
> > without a debug-build array.
> >
> > Or as another idea, I wonder if we could keep the debug-build array in
> > some form

Re: Improve eviction algorithm in ReorderBuffer

2024-03-11 Thread Masahiko Sawada

On Fri, Mar 8, 2024 at 12:58 PM Peter Smith  wrote:
>
> On Thu, Mar 7, 2024 at 2:16 PM Masahiko Sawada  wrote:
> >
> > On Tue, Mar 5, 2024 at 3:28 PM Peter Smith  wrote:
> > >
>
> > > 4a.
> > > The comment in simplehash.h says
> > >  *   The following parameters are only relevant when SH_DEFINE is defined:
> > >  *   - SH_KEY - ...
> > >  *   - SH_EQUAL(table, a, b) - ...
> > >  *   - SH_HASH_KEY(table, key) - ...
> > >  *   - SH_STORE_HASH - ...
> > >  *   - SH_GET_HASH(tb, a) - ...
> > >
> > > So maybe it is nicer to reorder the #defines in that same order?
> > >
> > > SUGGESTION:
> > > +#define SH_PREFIX bh_nodeidx
> > > +#define SH_ELEMENT_TYPE bh_nodeidx_entry
> > > +#define SH_KEY_TYPE bh_node_type
> > > +#define SH_SCOPE extern
> > > +#ifdef FRONTEND
> > > +#define SH_RAW_ALLOCATOR pg_malloc0
> > > +#endif
> > >
> > > +#define SH_DEFINE
> > > +#define SH_KEY key
> > > +#define SH_EQUAL(tb, a, b) (memcmp(, , sizeof(bh_node_type)) == 0)
> > > +#define SH_HASH_KEY(tb, key) \
> > > + hash_bytes((const unsigned char *) , sizeof(bh_node_type))
> > > +#include "lib/simplehash.h"
> >
> > I'm really not sure it helps increase readability. For instance, for
> > me it's readable if SH_DEFINE and SH_DECLARE come to the last before
> > #include since it's more obvious whether we want to declare, define or
> > both. Other simplehash.h users also do so.
> >
>
> OK.
>
> > > 5.
> > > + *
> > > + * If 'indexed' is true, we create a hash table to track of each node's
> > > + * index in the heap, enabling to perform some operations such as 
> > > removing
> > > + * the node from the heap.
> > >   */
> > >  binaryheap *
> > > -binaryheap_allocate(int capacity, binaryheap_comparator compare, void 
> > > *arg)
> > > +binaryheap_allocate(int capacity, binaryheap_comparator compare,
> > > + bool indexed, void *arg)
> > >
> > > BEFORE
> > > ... enabling to perform some operations such as removing the node from 
> > > the heap.
> > >
> > > SUGGESTION
> > > ... to help make operations such as removing nodes more efficient.
> > >
> >
> > But these operations literally require the indexed binary heap as we
> > have an assertion:
> >
> > void
> > binaryheap_remove_node_ptr(binaryheap *heap, bh_node_type d)
> > {
> > bh_nodeidx_entry *ent;
> >
> > Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
> > Assert(heap->bh_indexed);
> >
>
> I didn’t quite understand -- the operations mentioned are "operations
> such as removing the node", but binaryheap_remove_node() also removes
> a node from the heap. So I still felt the comment wording of the patch
> is not quite correct.

Now I understand your point. That's a valid point.

>
> Now, if the removal of a node from an indexed heap can *only* be done
> using binaryheap_remove_node_ptr() then:
> - the other removal functions (binaryheap_remove_*) probably need some
> comments to make sure nobody is tempted to call them directly for an
> indexed heap.
> - maybe some refactoring and assertions are needed to ensure those
> *cannot* be called directly for an indexed heap.
>

If the 'index' is true, the caller can not only use the existing
functions but also newly added functions such as
binaryheap_remove_node_ptr() and binaryheap_update_up() etc. How about
something like below?

 * If 'indexed' is true, we create a hash table to track each node's
 * index in the heap, enabling to perform some operations such as
 * binaryheap_remove_node_ptr() etc.

>
> And, here are some review comments for v8-0002.
>
> ==
> 1. delete_nodeidx
>
> +/*
> + * Remove the node's index from the hash table if the heap is indexed.
> + */
> +static bool
> +delete_nodeidx(binaryheap *heap, bh_node_type node)
> +{
> + if (!binaryheap_indexed(heap))
> + return false;
> +
> + return bh_nodeidx_delete(heap->bh_nodeidx, node);
> +}
>
> 1a.
> In v8 this function was changed to now return bool, so, I think the
> function comment should explain the meaning of that return value.
>
> ~
>
> 1b.
> I felt the function body is better expressed positively: "If this then
> do that", instead of "If not this then do nothing otherwise do that"
>
> SUGGESTION
> if (binaryheap_indexed(heap))
>   return bh_nodeidx_delete(heap->bh_nodeidx, node);
>
> return

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-11 Thread Masahiko Sawada

On Mon, Mar 11, 2024 at 5:13 PM Masahiko Sawada  wrote:
>
> In the latest (v69) patch:
>
> - squashed v68-0005 and v68-0006 patches.
> - removed most of the changes in v68-0007 patch.
> - addressed above review comments in v69-0002 patch.
> - v69-0003, 0004, and 0005 are miscellaneous updates.

Since the v69 conflicts with the current HEAD, I've rebased them. In
addition, v70-0008 is the new patch, which cleans up the vacuum
integration patch.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


v70-ART.tar.gz
Description: GNU Zip compressed data

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-11 Thread Masahiko Sawada

On Mon, Mar 11, 2024 at 12:20 PM John Naylor  wrote:
>
> On Thu, Mar 7, 2024 at 10:35 PM Masahiko Sawada  wrote:
> >
> > I've attached the remaining patches for CI. I've made some minor
> > changes in separate patches and drafted the commit message for
> > tidstore patch.
> >
> > While reviewing the tidstore code, I thought that it would be more
> > appropriate to place tidstore.c under src/backend/lib instead of
> > src/backend/common/access since the tidstore is no longer implemented
> > only for heap or other access methods, and it might also be used by
> > executor nodes in the future. What do you think?
>
> That's a heck of a good question. I don't think src/backend/lib is
> right -- it seems that's for general-purpose data structures.
> Something like backend/utils is also too general.
> src/backend/access/common has things for tuple descriptors, toast,
> sessions, and I don't think tidstore is out of place here. I'm not
> sure there's a better place, but I could be convinced otherwise.

Yeah, I agreed that src/backend/lib seems not to be the place for
tidstore. Let's keep it in src/backend/access/common. If others think
differently, we can move it later.

>
> v68-0001:
>
> I'm not sure if commit messages are much a subject of review, and it's
> up to the committer, but I'll share a couple comments just as
> something to think about, not something I would ask you to change: I
> think it's a bit distracting that the commit message talks about the
> justification to use it for vacuum. Let's save that for the commit
> with actual vacuum changes. Also, I suspect saying there are a "wide
> range" of uses is over-selling it a bit, and that paragraph is a bit
> awkward aside from that.

Thank you for the comment, and I agreed. I've updated the commit message.

>
> + /* Collect TIDs extracted from the key-value pair */
> + result->num_offsets = 0;
> +
>
> This comment has nothing at all to do with this line. If the comment
> is for several lines following, some of which are separated by blank
> lines, there should be a blank line after the comment. Also, why isn't
> tidstore_iter_extract_tids() responsible for setting that to zero?

Agreed, fixed.

I also updated this part so we set result->blkno in
tidstore_iter_extract_tids() too, which seems more readable.

>
> + ts->context = CurrentMemoryContext;
>
> As far as I can tell, this member is never accessed again -- am I
> missing something?

You're right. It was used to re-create the tidstore in the same
context again while resetting it, but we no longer support the reset
API. Considering it again, would it be better to allocate the iterator
struct in the same context as we store the tidstore struct?

>
> + /* DSA for tidstore will be detached at the end of session */
>
> No other test module pins the mapping, but that doesn't necessarily
> mean it's wrong. Is there some advantage over explicitly detaching?

One small benefit of not explicitly detaching dsa_area in
tidstore_destroy() would be simplicity; IIUC if we want to do that, we
need to remember the dsa_area using (for example) a static variable,
and free it if it's non-NULL. I've implemented this idea in the
attached patch.

>
> +-- Add tids in random order.
>
> I don't see any randomization here. I do remember adding row_number to
> remove whitespace in the output, but I don't remember a random order.
> On that subject, the row_number was an easy trick to avoid extra
> whitespace, but maybe we should just teach the setting function to
> return blocknumber rather than null?

Good idea, fixed.

>
> +Datum
> +tidstore_create(PG_FUNCTION_ARGS)
> +{
> ...
> + tidstore = TidStoreCreate(max_bytes, dsa);
>
> +Datum
> +tidstore_set_block_offsets(PG_FUNCTION_ARGS)
> +{
> 
> + TidStoreSetBlockOffsets(tidstore, blkno, offs, noffs);
>
> These names are too similar. Maybe the test module should do
> s/tidstore_/test_/ or similar.

Agreed.

>
> +/* Sanity check if we've called tidstore_create() */
> +static void
> +check_tidstore_available(void)
> +{
> + if (tidstore == NULL)
> + elog(ERROR, "tidstore is not initialized");
> +}
>
> I don't find this very helpful. If a developer wiped out the create
> call, wouldn't the test crash and burn pretty obviously?

Removed.

>
> In general, the .sql file is still very hard-coded. Functions are
> created that contain a VALUES statement. Maybe it's okay for now, but
> wanted to mention it. Ideally, we'd have some randomized tests,
> without having to display it. That could be in addition to (not
> replacing) the small tests we have that display input. (see below)
>

Agreed to add randomized tests in addition to the existing tests.

>
> v

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-07 Thread Masahiko Sawada

On Fri, Mar 8, 2024 at 10:04 AM John Naylor  wrote:
>
> On Thu, Mar 7, 2024 at 11:15 PM Masahiko Sawada  wrote:
> >
> > It looks like it requires a link with pgport_srv but I'm not sure. It
> > seems that the recent commit 1f1d73a8b breaks CI, Windows - Server
> > 2019, VS 2019 - Meson & ninja, too.
>
> Unfortunately, none of the Windows animals happened to run both after
> the initial commit and before removing the (seemingly useless on our
> daily platfoms) link. I'll confirm on my own CI branch in a few
> minutes.

Yesterday I've confirmed the something like the below fixes the
problem happened in Windows CI:

--- a/src/test/modules/test_radixtree/meson.build
+++ b/src/test/modules/test_radixtree/meson.build
@@ -12,6 +12,7 @@ endif

 test_radixtree = shared_module('test_radixtree',
   test_radixtree_sources,
+  link_with: host_system == 'windows' ? pgport_srv : [],
   kwargs: pg_test_mod_args,
 )
 test_install_libs += test_radixtree

But I'm not sure it's the right fix especially because I guess it
could raise "AddressSanitizer: odr-violation" error on Windows.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-07 Thread Masahiko Sawada

On Thu, Mar 7, 2024 at 8:06 PM John Naylor  wrote:
>
> On Thu, Mar 7, 2024 at 4:47 PM Masahiko Sawada  wrote:
> >
> > On Thu, Mar 7, 2024 at 6:37 PM John Naylor  wrote:
>
> > > $ git grep 'link_with: pgport_srv'
> > > src/test/modules/test_radixtree/meson.build:  link_with: pgport_srv,
> > >
> > > No other test module uses this directive, and indeed, removing this
> > > still builds fine for me. Thoughts?
> >
> > Yeah, it could be the culprit. The test_radixtree/meson.build is the
> > sole extension that explicitly specifies a link with pgport_srv. I
> > think we can get rid of it as I've also confirmed the build still fine
> > even without it.
>
> olingo and grassquit have turned green, so that must have been it.

fairywren is complaining another build failure:

[1931/2156] "gcc"  -o
src/test/modules/test_radixtree/test_radixtree.dll
src/test/modules/test_radixtree/test_radixtree.dll.p/win32ver.obj
src/test/modules/test_radixtree/test_radixtree.dll.p/test_radixtree.c.obj
"-Wl,--allow-shlib-undefined" "-shared" "-Wl,--start-group"
"-Wl,--out-implib=src/test\\modules\\test_radixtree\\test_radixtree.dll.a"
"-Wl,--stack,4194304" "-Wl,--allow-multiple-definition"
"-Wl,--disable-auto-import" "-fvisibility=hidden"
"C:/tools/nmsys64/home/pgrunner/bf/root/HEAD/pgsql.build/src/backend/libpostgres.exe.a"
"-pthread" "C:/tools/nmsys64/ucrt64/bin/../lib/libssl.dll.a"
"C:/tools/nmsys64/ucrt64/bin/../lib/libcrypto.dll.a"
"C:/tools/nmsys64/ucrt64/bin/../lib/libz.dll.a" "-lws2_32" "-lm"
"-lkernel32" "-luser32" "-lgdi32" "-lwinspool" "-lshell32" "-lole32"
"-loleaut32" "-luuid" "-lcomdlg32" "-ladvapi32" "-Wl,--end-group"
FAILED: src/test/modules/test_radixtree/test_radixtree.dll
"gcc"  -o src/test/modules/test_radixtree/test_radixtree.dll
src/test/modules/test_radixtree/test_radixtree.dll.p/win32ver.obj
src/test/modules/test_radixtree/test_radixtree.dll.p/test_radixtree.c.obj
"-Wl,--allow-shlib-undefined" "-shared" "-Wl,--start-group"
"-Wl,--out-implib=src/test\\modules\\test_radixtree\\test_radixtree.dll.a"
"-Wl,--stack,4194304" "-Wl,--allow-multiple-definition"
"-Wl,--disable-auto-import" "-fvisibility=hidden"
"C:/tools/nmsys64/home/pgrunner/bf/root/HEAD/pgsql.build/src/backend/libpostgres.exe.a"
"-pthread" "C:/tools/nmsys64/ucrt64/bin/../lib/libssl.dll.a"
"C:/tools/nmsys64/ucrt64/bin/../lib/libcrypto.dll.a"
"C:/tools/nmsys64/ucrt64/bin/../lib/libz.dll.a" "-lws2_32" "-lm"
"-lkernel32" "-luser32" "-lgdi32" "-lwinspool" "-lshell32" "-lole32"
"-loleaut32" "-luuid" "-lcomdlg32" "-ladvapi32" "-Wl,--end-group"
C:/tools/nmsys64/ucrt64/bin/../lib/gcc/x86_64-w64-mingw32/12.2.0/../../../../x86_64-w64-mingw32/bin/ld.exe:
src/test/modules/test_radixtree/test_radixtree.dll.p/test_radixtree.c.obj:test_radixtree:(.rdata$.refptr.pg_popcount64[.refptr.pg_popcount64]+0x0):
undefined reference to `pg_popcount64'

It looks like it requires a link with pgport_srv but I'm not sure. It
seems that the recent commit 1f1d73a8b breaks CI, Windows - Server
2019, VS 2019 - Meson & ninja, too.

Regards,

[1] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren=2024-03-07%2012%3A53%3A20

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 2753 matches

Mail list logo