RE: Using per-transaction memory contexts for storing decoded tuples

2024-09-20 Thread Hayato Kuroda (Fujitsu)
Dear Sawada-san,

> Thank you for your interest in this patch. I've just shared some
> benchmark results (with a patch) that could be different depending on
> the environment[1]. I would be appreciated if you also do similar
> tests and share the results.

Okay, I did similar tests, the attached script is the test runner. 
rb_mem_block_size
was changed from 8kB to 8MB. Below table show the result (millisecond unit).
Each cell is the average of 5 runs.

==
8kB 12877.4
16kB12829.1
32kB11793.3
64kB13134.4
128kB   13353.1
256kB   11664.0
512kB   12603.4
1MB 13443.8
2MB 12469.0
4MB 12651.4
8MB 12381.4
==

The standard deviation of measurements was 100-500 ms, there were no noticeable
differences on my env as well.

Also, I've checked the statistics of the generation context, and confirmed the
number of allocated blocks is x1000 higher if the block size is changed 
8kB->8MB.
[1] shows the output from MemoryContextStats(), just in case. IIUC, the 
difference
of actual used space comes from the header of each block. Each block has 
attributes
for management so that the total usage becomes larger based on the number.

[1]
8kB
Tuples: 724959232 total in 88496 blocks (1000 chunks); 3328 free (0 
chunks); 724955904 used
Grand total: 724959232 bytes in 88496 blocks; 3328 free (0 chunks); 724955904 
used

8MB
Tuples: 721420288 total in 86 blocks (1000 chunks); 1415344 free (0 
chunks); 720004944 used
Grand total: 721420288 bytes in 86 blocks; 1415344 free (0 chunks); 720004944 
use

Best regards,
Hayato Kuroda
FUJITSU LIMITED



test.sh
Description: test.sh


RE: [Proposal] Add foreign-server health checks infrastructure

2024-09-19 Thread Hayato Kuroda (Fujitsu)
Dear members,

(This mail is just a wrap-up)

I found that the final patch was pushed 2 days ago [1] and BF animals say OK for
now. Therefore, I've closed the CF entry as "committed". We can extend the
feature to other platforms, but I think it could be at another thread later.

Thanks everyone for many efforts!

[1]: 
https://github.com/postgres/postgres/commit/4f08ab55457751308ffd8d33e82155758cd0e304

Best regards,
Hayato Kuroda
FUJITSU LIMITED



RE: [Proposal] Add foreign-server health checks infrastructure

2024-09-16 Thread Hayato Kuroda (Fujitsu)
Dear Fujii-san,

Thanks for reviewing!

> I made a couple of small adjustments and attached the updated version.
> If that's ok, I'll go ahead and commit it.
> 
> + Name of the local user mapped to the foreign server of this
> + connection, or "public" if a public mapping is used. If the user
> 
> I enclosed "public" with  tag, i.e., public.

Right, it should be. I grepped sgml files just in case, but they are tagged by 
.

> > I did not done that be cause either of server_name or user_name is NULL and
> > it might be strange. But yes, the example should have more information.
> > Based on that, I added a tuple so that the example has below. Thought?
> >
> > loopback1 - user is "postgres", valid
> > loopback2 - user is "public", valid
> > loopback3 - user is NULL, invalid
> 
> LGTM.
> Also I added the following to the example for clarity:
> 
> postgres=# SELECT * FROM postgres_fdw_get_connections(true);

+1.

Best regards,
Hayato Kuroda
FUJITSU LIMITED



RE: Using per-transaction memory contexts for storing decoded tuples

2024-09-16 Thread Hayato Kuroda (Fujitsu)
Hi,

> We have several reports that logical decoding uses memory much more
> than logical_decoding_work_mem[1][2][3]. For instance in one of the
> reports[1], even though users set logical_decoding_work_mem to
> '256MB', a walsender process was killed by OOM because of using more
> than 4GB memory.

I appreciate your work on logical replication and am interested in the thread.
I've heard this issue from others, and this has been the barrier to using 
logical
replication. Please let me know if I can help with benchmarking, other
measurements, etc.

Best regards,
Hayato Kuroda
FUJITSU LIMITED



RE: Conflict detection and logging in logical replication

2024-08-20 Thread Hayato Kuroda (Fujitsu)
Dear Hou,

Thanks for updating the patch! I think the patch is mostly good.
Here are minor comments.

0001:

01.
```
+
+LOG:  conflict detected on relation "schemaname.tablename": 
conflict=conflict_type
+DETAIL:  detailed explaination.
...
+
```

I don't think the label is correct.  label should be used for the actual
example output, not for explaining the format. I checked several files like
amcheck.sgml and auto-exlain.sgml and func.sgml and they seemed to follow the
rule.

02.
```
+ 
+  The key section in the second sentence of the
...
```

I preferred that section name is quoted.

0002:

03.
```
-#include "replication/logicalrelation.h"
```

Just to confirm - this removal is not related with the feature but just the
improvement, right?


Best regards,
Hayato Kuroda
FUJITSU LIMITED


RE: [bug fix] prepared transaction might be lost when max_prepared_transactions is zero on the subscriber

2024-08-09 Thread Hayato Kuroda (Fujitsu)
Dear Amit, Shveta, Hou,

Thanks for giving many comments! I've updated the patch.

> @@ -4409,6 +4409,14 @@ start_apply(XLogRecPtr origin_startpos)
>   }
>   PG_CATCH();
>   {
> + /*
> + * Reset the origin data to prevent the advancement of origin progress
> + * if the transaction failed to apply.
> + */
> + replorigin_session_origin = InvalidRepOriginId;
> + replorigin_session_origin_lsn = InvalidXLogRecPtr;
> + replorigin_session_origin_timestamp = 0;
> 
> Can't we call replorigin_reset() instead here?

I didn't use the function because arguments of calling function looked strange,
but ideally I can. Fixed.

> + /*
> + * Register a callback to reset the origin state before aborting the
> + * transaction in ShutdownPostgres(). This is to prevent the advancement
> + * of origin progress if the transaction failed to apply.
> + */
> + before_shmem_exit(replorigin_reset, (Datum) 0);
> 
> I think we need this despite resetting the origin-related variables in
> PG_CATCH block to handle FATAL error cases, right? If so, can we use
> PG_ENSURE_ERROR_CLEANUP() instead of PG_CATCH()?

There are two reasons to add a shmem-exit callback. One is to support a FATAL,
another one is to support the case that user does the shutdown request while
applying changes. In this case, I think ShutdownPostgres() can be called so that
the session origin may advance.

However, I think we cannot use 
PG_ENSURE_ERROR_CLEANUP()/PG_END_ENSURE_ERROR_CLEANUP
macros here. According to codes, it assumes that any before-shmem callbacks are
not registered within the block because the cleanup function is registered and 
canceled
within the macro. LogicalRepApplyLoop() can register the function when
it handles COMMIT PREPARED message so it breaks the rule.

Best regards,
Hayato Kuroda
FUJITSU LIMITED



v3-0001-Prevent-origin-progress-advancement-if-failed-to-.patch
Description:  v3-0001-Prevent-origin-progress-advancement-if-failed-to-.patch


RE: Found issues related with logical replication and 2PC

2024-08-08 Thread Hayato Kuroda (Fujitsu)
Dear Amit,

>
> The code changes look mostly good to me. I have changed/added a few
> comments in the attached modified version.
>

Thanks for updating the patch! It LGTM. I've tested your patch and confirmed
it did not cause the data loss. I used the source which was applied v3 and 
additional
fix to visualize the replication command [1].

Method
==

1. Construct a logical replication system with two_phase = true and
   synchronous_commit = false
2. attach a walwriter of the subscriber to stop the process
3. Start a transaction and prepare it for the publisher.
4. Wait until the worker replies to the publisher.
5. Stop the subscriber
6. Restart subscriber.
7. Do COMMIT PREPARED

Attached script can construct the same situation.

Result
==

After the step 5, I ran pg_waldump and confirmed PREPARE record existed on
the subscriber.

```
$ pg_waldump data_sub/pg_wal/00010001
...
rmgr: Transaction len..., desc: PREPARE gid pg_gid_16389_741: ...
rmgr: XLOGlen..., desc: CHECKPOINT_SHUTDOWN ...
```

Also, after the step 7, I confirmed that only the COMMIT PREPARED record
was sent because log output the below line. "75" means the ASCII character 'K';
this indicated that the replication message corresponded to COMMIT PREPARED.
```
LOG:  XXX got message 75
```



Additionally, I did another test, which is basically same as above but 1) 
XLogFlush()
in EndPrepare() was commented out and 2) kill -9 was used at step 5 to emulate a
crash. Since the PREPAREd transaction cannot survive on the subscriber in this 
case,
so COMMIT PREPARED command on publisher causes an ERROR on the subscriber.
```
ERROR:  prepared transaction with identifier "pg_gid_16389_741" does not exist
CONTEXT:  processing remote data for replication origin "pg_16389" during 
message
type "COMMIT PREPARED" in transaction 741, finished at 0/15463C0
```
I think this shows that the backend process can ensure the WAL is persisted so 
data loss
won't occur.


[1]:
```
@@ -3297,6 +3297,8 @@ apply_dispatch(StringInfo s)
 saved_command = apply_error_callback_arg.command;
 apply_error_callback_arg.command = action;
 
+elog(LOG, "XXX got message %d", action);
```

Best regards,
Hayato Kuroda
FUJITSU LIMITED



test_0809.sh
Description: test_0809.sh


RE: Found issues related with logical replication and 2PC

2024-08-08 Thread Hayato Kuroda (Fujitsu)
Dear Amit, Shveta,

Thanks for discussing!

I reported the issue because 1) I feared the risk of data loss and 2) simply
because the coding looked incorrect. However, per discussion, I understood that
it wouldn't lead to loss, and adding a global variable was unacceptable in this
case. I modified the patch completely.

The attached patch avoids using the LastCommitLSN as the local_lsn while 
applying
PREPARE. get_flush_position() was not changed. Also, it contains changes that
have not been discussed yet:

- Set last_commit_end to InvaldXLogPtr in the PREPARE case.
  This causes the same result as when the stream option is not "parallel."
- XactLastCommitEnd was replaced even ROLLBACK PREPARED case.
  Since the COMMIT PREPARED record is flushed in 
RecordTransactionAbortPrepared(),
  there is no need to ensure the WAL must be sent.


Best regards,
Hayato Kuroda
FUJITSU LIMITED



v2-0001-Not-to-store-the-flush-position-of-the-PREPARE-re.patch
Description:  v2-0001-Not-to-store-the-flush-position-of-the-PREPARE-re.patch


RE: Found issues related with logical replication and 2PC

2024-08-07 Thread Hayato Kuroda (Fujitsu)
Dear Amit,

> Can we start a separate thread to issue 2? I understand that this one
> is also related to two_phase but since both are different issues it is
> better to discuss in separate threads. This will also help us refer to
> the discussion in future if required.

You are right, we should discuss one topic per thread. Forked: [1].

> BTW, why did the 0002 patch change the below code:
> --- a/src/include/replication/worker_internal.h
> +++ b/src/include/replication/worker_internal.h
> @@ -164,7 +164,8 @@ typedef struct ParallelApplyWorkerShared
> 
>   /*
>   * XactLastCommitEnd or XactLastPrepareEnd from the parallel apply worker.
> - * This is required by the leader worker so it can update the lsn_mappings.
> + * This is required by the leader worker so it can update the
> + * lsn_mappings.
>   */
>   XLogRecPtr last_commit_end;
>

Opps. Fixed version is posted in [1].

[1]: 
https://www.postgresql.org/message-id/TYAPR01MB5692FAC23BE40C69DA8ED4AFF5B92%40TYAPR01MB5692.jpnprd01.prod.outlook.com

Best regards,
Hayato Kuroda
FUJITSU LIMITED



[bug fix] prepared transaction might be lost when max_prepared_transactions is zero on the subscriber

2024-08-07 Thread Hayato Kuroda (Fujitsu)
Dear hackers,

This thread forks from [1]. Here can be used to discuss second item.
Below part contains the same statements written in [1], but I did copy-and-paste
just in case. Attached patch is almost the same but bit modified based on the 
comment
from Amit [2] - an unrelated change is removed.

Found issue
=
When the subscriber enables two-phase commit but doesn't set 
max_prepared_transaction >0
and a transaction is prepared on the publisher, the apply worker reports an 
ERROR
on the subscriber. After that, the prepared transaction is not replayed, which
means it's lost forever. Attached script can emulate the situation.

--
ERROR:  prepared transactions are disabled
HINT:  Set "max_prepared_transactions" to a nonzero value.
--

The reason is that we advanced the origin progress when aborting the
transaction as well (RecordTransactionAbort->replorigin_session_advance). So,
after setting replorigin_session_origin_lsn, if any ERROR happens when preparing
the transaction, the transaction aborts which incorrectly advances the origin 
lsn.

An easiest fix is to reset session replication origin before calling the
RecordTransactionAbort(). I think this can happen when 1) LogicalRepApplyLoop()
raises an ERROR or 2) apply worker exits. Attached patch can fix the issue on 
HEAD.

[1]: 
https://www.postgresql.org/message-id/TYAPR01MB5692FA4926754B91E9D7B5F0F5AA2%40TYAPR01MB5692.jpnprd01.prod.outlook.com
[2]: 
https://www.postgresql.org/message-id/CAA4eK1L-r8OKGdBwC6AeXSibrjr9xKsg8LjGpX_PDR5Go-A9TA%40mail.gmail.com

Best regards,
Hayato Kuroda
FUJITSU LIMITED



v2-0001-Prevent-origin-progress-advancement-if-failed-to-.patch
Description:  v2-0001-Prevent-origin-progress-advancement-if-failed-to-.patch


test_2pc.sh
Description: test_2pc.sh


RE: [Proposal] Add foreign-server health checks infrastructure

2024-08-07 Thread Hayato Kuroda (Fujitsu)
Dear Fujii-san,

Thanks for reviewing! PSA new version.

> 
>  postgres_fdw_get_connections(
>IN check_conn boolean DEFAULT false, OUT server_name text,
>OUT valid boolean, OUT used_in_xact boolean, OUT closed boolean)
>returns setof record
> 
> In the documentation, this part should be updated to include the user_name 
> output
> column.

Right, fixed.

> 
> 
> +user_name
> +text
> +
> + The local user name of this connection. If the user mapping is
> + dropped but the connection remains open (i.e., marked as
> + invalid), this will be NULL.
> 
> How about changing the first description to "Name of the local user mapped to 
> the
> foreign server of this connection, or "public" if a public mapping is used." 
> for more
> precision?

Added. I ran Grammarly and it said OK.

> - server_name | valid | used_in_xact | closed
> --+---+--+
> - loopback1   | t | t|
> - loopback2   | f | t|
> + server_name | user_name | valid | used_in_xact | closed
> +-+---+---+--+
> + loopback1   | postgres  | t | t|
> + loopback2   | postgres  | t | t|
> 
> How about displaying the record with loopback2 and valid=false like the 
> previous
> usage example?

I did not done that be cause either of server_name or user_name is NULL and
it might be strange. But yes, the example should have more information.
Based on that, I added a tuple so that the example has below. Thought?

loopback1 - user is "postgres", valid
loopback2 - user is "public", valid
loopback3 - user is NULL, invalid

> 
> 
> +UserMapping *
> +GetUserMappingByOid(Oid umid, bool missing_ok)
> 
> postgres_fdw doesn't need a generic function to return UserMapping. How about
> simplifying the function by removing unnecessary code, e.g., as follows?
> 
> --
> tp = SearchSysCache1(USERMAPPINGOID, ObjectIdGetDatum(umid));
> if (!HeapTupleIsValid(tp))
>  nulls[i++] = true;
> else
> {
>  Oid userid =  ((Form_pg_user_mapping) GETSTRUCT(tp))->userid;
>  values[i++] = CStringGetTextDatum(MappingUserName(userid));
>  ReleaseSysCache(tp);
> }
> --

Largely agreed, but some comments and Assertion() may be needed. Done.

> -ForeignTable *
> -GetForeignTable(Oid relid);
> -
> -
> - This function returns a ForeignTable object
> for
> - the foreign table with the given OID.  A
> - ForeignTable object contains properties of
> the
> - foreign table (see foreign/foreign.h for details).
> -
> -
> -
> -
> 
> Why did you remove these code? Just mistake?

Oh, my fault. I tried to remove GetUserMappingByOid() and the entry was also
Removed at that time. Restored.

Best regards,
Hayato Kuroda
FUJITSU LIMITED



v3-0001-Extend-postgres_fdw_get_connections-to-return-use.patch
Description:  v3-0001-Extend-postgres_fdw_get_connections-to-return-use.patch


RE: Conflict detection and logging in logical replication

2024-08-07 Thread Hayato Kuroda (Fujitsu)
Dear Hou,

While playing with the 0003 patch (the patch may not be ready), I found that
when the insert_exists event occurred, both apply_error_count and 
insert_exists_count
was counted.

```
-- insert a tuple on the subscriber
subscriber =# INSERT INTO tab VALUES (1);

-- insert the same tuple on the publisher, which causes insert_exists conflict
publisher =# INSERT INTO tab VALUES (1);

-- after some time...
subscriber =# SELECT * FROM pg_stat_subscription_stats;
-[ RECORD 1 ]+--
subid| 16389
subname  | sub
apply_error_count| 16
sync_error_count | 0
insert_exists_count  | 16
update_differ_count  | 0
update_exists_count  | 0
update_missing_count | 0
delete_differ_count  | 0
delete_missing_count | 0
stats_reset  |
```

Not tested, but I think this could also happen for the update_exists_count case,
or sync_error_count may be counted when the tablesync worker detects the 
conflict.

IIUC, the reason is that pgstat_report_subscription_error() is called in the
PG_CATCH part in start_apply() even after ReportApplyConflict(ERROR) is called.

What do you think of the current behavior? I wouldn't say I like that the same
phenomenon is counted as several events. E.g., in the case of vacuum, the entry
seemed to be separated based on the process by backends or autovacuum.
I feel the spec is unfamiliar in that only insert_exists and update_exists are
counted duplicated with the apply_error_count.

An easy fix is to introduce a global variable which is turned on when the 
conflict
is found.

Thought?

Best regards,
Hayato Kuroda
FUJITSU LIMITED



RE: Conflict detection and logging in logical replication

2024-08-06 Thread Hayato Kuroda (Fujitsu)
Dear Hou,

> 
> Here is the V11 patch set which addressed above and Kuroda-san[1] comments.
>

Thanks for updating the patch. I read 0001 again and I don't have critical 
comments for now.
I found some cosmetic issues (e.g., lines should be shorter than 80 columns) and
attached the fix patch. Feel free to include in the next version.

Best regards,
Hayato Kuroda
FUJITSU LIMITED



fixes_for_v11.patch
Description: fixes_for_v11.patch


RE: [Proposal] Add foreign-server health checks infrastructure

2024-08-01 Thread Hayato Kuroda (Fujitsu)
Dear Fujii-san,

> Thanks for updating the patch!
> 
> > - Changed the name of new API from `GetUserMappingFromOid` to
> `GetUserMappingByOid`
> >to keep the name consistent with others.
> 
> If we expose this function as an FDW helper function, it should return
> a complete UserMapping object, including umoptions.
> 
> However, if postgres_fdw_get_connections() is the only user of this function,
> I'm not inclined to expose it as an FDW helper.

One reason is that the function does not handle any specific data for 
postgres_fdw,
however, there are no users and requirests from other projects. Based on that, 
ok,
we can move it to connection.c. If needed, we can export it again.

> Instead, we can either get
> the user ID by user mapping OID directly in connection.c using 
> SearchSysCache(),
> or add the user ID to ConnCacheEntry and use it in
> postgres_fdw_get_connections().
> Thought?

I moved the function to connection.c, which uses the SearchSysCache1().

I've tried both ways, and they worked well. One difference is that when we use
the extended ConnCacheEntry approach and the entry has been invalidated, we 
cannot
distinguish the reason. For example, in the below case, the entry is 
invalidated,
so the user_name of the output record will be NULL, whereas the user mapping is
actually still valid. We may be able to add the reason for invalidation, but
I'm not very motivated to modify the part.

```
BEGIN;
SELECT 1 FROM ft1 LIMIT 1; -- ft1 is at server "loopback"
...
ALTER SERVER loopback OPTIONS (ADD use_remote_estimate 'off');
...
SELECT * FROM postgres_fdw_get_connections() ORDER BY 1;
-> {"loopback", NULL, } will be returned
```

Another reason is that we can keep the code consistent with the server case.
The part does not read data from the entry, and we can follow.

Best regards,
Hayato Kuroda
FUJITSU LIMITED



v2-0001-Extend-postgres_fdw_get_connections-to-return-use.patch
Description:  v2-0001-Extend-postgres_fdw_get_connections-to-return-use.patch


RE: Conflict detection and logging in logical replication

2024-08-01 Thread Hayato Kuroda (Fujitsu)
Dear Hou,

Let me contribute the great feature. I read only the 0001 patch and here are 
initial comments.

01. logical-replication.sgml

track_commit_timestamp must be specified only on the subscriber, but it is not 
clarified.
Can you write down that?

02. logical-replication.sgml

I felt that the ordering of {exists, differ,missing} should be fixed, but not 
done.
For update "differ" is listerd after the "missing", but for delete, "differ"
locates before the "missing". The inconsistency exists on souce code as well.

03. conflict.h

The copyright seems wrong. 2012 is not needed.

04. general

According to the documentation [1], there is another constraint "exclude", which
can cause another type of conflict. But this pattern cannot be logged in detail.
I tested below workload as an example.

=
publisher=# create table tab (a int, EXCLUDE (a WITH =));
publisher=# create publication pub for all tables;

subscriber=# create table tab (a int, EXCLUDE (a WITH =));
subscriber=# create subscription sub...;
subscriber=# insert into tab values (1);

publisher=# insert into tab values (1);

-> Got conflict with below log lines:
```
ERROR:  conflicting key value violates exclusion constraint "tab_a_excl"
DETAIL:  Key (a)=(1) conflicts with existing key (a)=(1).
CONTEXT:  processing remote data for replication origin "pg_16389" during 
message type "INSERT"
for replication target relation "public.tab" in transaction 740, finished at 
0/1543940
```
=

Can we support the type of conflict?

[1]: 
https://www.postgresql.org/docs/devel/sql-createtable.html#SQL-CREATETABLE-EXCLUDE

Best regards,
Hayato Kuroda
FUJITSU LIMITED



RE: [BUG?] check_exclusion_or_unique_constraint false negative

2024-07-31 Thread Hayato Kuroda (Fujitsu)
Dear Michail,

Thanks for pointing out the issue!

>* RelationFindReplTupleByIndex
>
>Amit, this is why I've included you in this previously solo thread :)
>RelationFindReplTupleByIndex uses DirtySnapshot and may not find some records
>if they are updated by a parallel transaction. This could lead to lost
>deletes/updates, especially in the case of streaming=parallel mode. 
>I'm not familiar with how parallel workers apply transactions, so maybe this
>isn't possible.

IIUC, the issue can happen when two concurrent transactions using DirtySnapshot 
access
the same tuples, which is not specific to the parallel apply. Consider that two
subscriptions exist and publishers modify the same tuple of the same table.
In this case, two workers access the tuple, so one of the changes may be missed
by the scenario you said. I feel we do not need special treatments for parallel
apply.

Best regards,
Hayato Kuroda
FUJITSU LIMITED



RE: Remove duplicate table scan in logical apply worker and code refactoring

2024-07-31 Thread Hayato Kuroda (Fujitsu)
Dear Hou,

> Thanks for reviewing the patch, and your understanding is correct.
> 
> Here is the updated patch 0001. I removed the comments as suggested by Amit.
> 
> Since 0002 patch is only refactoring the code and I need some time to review
> the comments for it, I will hold it until the 0001 is committed.

Thanks for updating the patch. I did a performance testing with v2-0001.

Before: 15.553 [s]
After:  7.593 [s]

I used the attached script for setting up. I used almost the same setting and 
synchronous
replication is used.

[machine]
CPU(s):120
Model name:Intel(R) Xeon(R) CPU E7-4890 v2 @ 2.80GHz
Core(s) per socket:15
Socket(s): 4

Best regards,
Hayato Kuroda
FUJITSU LIMITED



test.sh
Description: test.sh


RE: make pg_createsubscriber option names more consistent

2024-07-31 Thread Hayato Kuroda (Fujitsu)
Dear Peter,

> I propose to rename the pg_createsubscriber option --socket-directory to
> --socketdir.  This would make it match the equivalent option in
> pg_upgrade.  (It even has the same short option '-s'.)
> pg_createsubscriber and pg_upgrade have a lot of common terminology and
> a similar operating mode, so it would make sense to keep this consistent.

+1. If so, should we say "default current dir." instead of "default current 
directory" in usage()
because pg_upgrade says like that?

Best regards,
Hayato Kuroda
FUJITSU LIMITED
 


RE: Remove duplicate table scan in logical apply worker and code refactoring

2024-07-31 Thread Hayato Kuroda (Fujitsu)
Dear Hou,

Thanks for creating a patch!

> When reviewing the code in logical/worker.c, I noticed that when applying a
> cross-partition update action, it scans the old partition twice.
> I am attaching the patch 0001 to remove this duplicate table scan.

Just to clarify, you meant that FindReplTupleInLocalRel() are called in
apply_handle_tuple_routing() and 
apply_handle_tuple_routing()->apply_handle_delete_internal(),
which requires the index or sequential scan, right? LGTM.

> Apart from above, I found there are quite a few duplicate codes related to 
> partition
> handling(e.g. apply_handle_tuple_routing), so I tried to extract some
> common logic to simplify the codes. Please see 0002 for this refactoring.

IIUC, you wanted to remove the application code from 
apply_handle_tuple_routing()
and put only a part partition detection. Is it right? Anyway, here are comments.

01. apply_handle_insert()

```
+targetRelInfo = edata->targetRelInfo;
+
 /* For a partitioned table, insert the tuple into a partition. */
 if (rel->localrel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
-apply_handle_tuple_routing(edata,
-   remoteslot, NULL, CMD_INSERT);
-else
-apply_handle_insert_internal(edata, edata->targetRelInfo,
- remoteslot);
+remoteslot = apply_handle_tuple_routing(edata, CMD_INSERT, remoteslot,
+&targetRelInfo);
+
+/* For a partitioned table, insert the tuple into a partition. */
+apply_handle_insert_internal(edata, targetRelInfo, remoteslot);
```

This part contains same comments, and no need to subsctitute in case of normal 
tables.
How about:

```
-/* For a partitioned table, insert the tuple into a partition. */
+/*
+ * Find the actual target table if the table is partitioned. Otherwise, use
+ * the same table as the remote one.
+ */
 if (rel->localrel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
-apply_handle_tuple_routing(edata,
-   remoteslot, NULL, CMD_INSERT);
+remoteslot = apply_handle_tuple_routing(edata, CMD_INSERT, remoteslot,
+&targetRelInfo);
 else
-apply_handle_insert_internal(edata, edata->targetRelInfo,
- remoteslot);
+targetRelInfo = edata->targetRelInfo;
+
+/* Insert a tuple to the target table */
+apply_handle_insert_internal(edata, targetRelInfo, remoteslot);
```

02. apply_handle_tuple_routing()

```
 /*
- * This handles insert, update, delete on a partitioned table.
+ * Determine the partition in which the tuple in slot is to be inserted, and
...
```

But this function is called from delete_internal(). How about "Determine the
partition to which the tuple in the slot belongs."?

03. apply_handle_tuple_routing()

Do you have a reason why this does not return `ResultRelInfo *` but 
`TupleTableSlot *`?
Not sure, but it is more proper for me to return the table info because this is 
a
routing function. 

04. apply_handle_update()

```
+targetRelInfo = edata->targetRelInfo;
+targetrel = rel;
+remoteslot_root = remoteslot;
```

Here I can say the same thing as 1.

05. apply_handle_update_internal()

It looks the ordering of function's implementations are changed. Is it 
intentaional?

before

apply_handle_update
apply_handle_update_internal
apply_handle_delete
apply_handle_delete_internal
FindReplTupleInLocalRel
apply_handle_tuple_routing

after

apply_handle_update
apply_handle_delete
apply_handle_delete_internal
FindReplTupleInLocalRel
apply_handle_tuple_routing
apply_handle_update_internal

06. apply_handle_delete_internal()

```
+targetRelInfo = edata->targetRelInfo;
+    targetrel = rel;
+
```

Here I can say the same thing as 1.

Best regards,
Hayato Kuroda
FUJITSU LIMITED





RE: speed up a logical replica setup

2024-07-30 Thread Hayato Kuroda (Fujitsu)
Dear Euler,

I applied your patch and confirmed it could fix the issue [1].

METHOD
===
1. shortened the snapshot interval to 3ms, and
```
-#define LOG_SNAPSHOT_INTERVAL_MS 15000
+#define LOG_SNAPSHOT_INTERVAL_MS 3
```
2. ran test 040_pg_createsubscriber.pl many times.
   Before patching, I could reproduce the failure when I ran less than 1hr.
   After patching, I ran more than 1hr but I could not reproduce.

Since this is a timing issue which I could not surely reproduce, I wanted others
to test it as well.

[1]: 
https://www.postgresql.org/message-id/68de6498-0449-a113-dd03-e198dded0bac%40gmail.com

Best regards,
Hayato Kuroda
FUJITSU LIMITED





RE: speed up a logical replica setup

2024-07-29 Thread Hayato Kuroda (Fujitsu)
Dear Euler,

> Hayato already mentioned one of the solution in a previous email [2].
> AFAICS we can use any solution that creates a WAL record. I tested the
> following options:

Yes, my point was that we should add an arbitrary record after the 
recovery_target_lsn.

> (a) temporary replication slot: requires an additional replication slot.
> small payload. it is extremely slow in comparison with the other
> options.
> (b) logical message: can be consumed by logical replication when/if it
> is supported some day. big payload. fast.
> (c) snapshot of running txn:  small payload. fast.
> (d) named restore point: biggest payload. fast.

Another PoV is whether they trigger the flush of the generated WAL record. 
You've
tested pg_logical_emit_message() with flush=false, but this meant that the WAL 
does
not flush so that the wait_for_end_recovery() waits a time. IIUC, it may wait 15
seconds, which is the time duration bgwriter works. If flush is set to true, the
execution time will be longer.
pg_create_restore_point() also does not flush the generated record so that it 
may
be problematic. Moreover, the name of the restore point might be a conflict that
users will use.

Overall, I could agree to use pg_log_standby_snapshot(). This flushes the WAL
asynchronously but ensures the timing is not so delayed.

Best regards,
Hayato Kuroda
FUJITSU LIMITED



RE: [Proposal] Add foreign-server health checks infrastructure

2024-07-28 Thread Hayato Kuroda (Fujitsu)
Dear Fujii-san,

> > IIUC, the patch which adds user_name attribute to get_connection() can be
> discussed
> > in later stage, is it right?
> 
> No, let's work on the patch at this stage :)

OK, here is a rebased patch.

- Changed the name of new API from `GetUserMappingFromOid` to 
`GetUserMappingByOid`
  to keep the name consistent with others.
- Comments and docs were updated.

Best regards,
Hayato Kuroda
FUJITSU LIMITED



0001-Extend-postgres_fdw_get_connections-to-return-user-n.patch
Description:  0001-Extend-postgres_fdw_get_connections-to-return-user-n.patch


RE: [Proposal] Add foreign-server health checks infrastructure

2024-07-28 Thread Hayato Kuroda (Fujitsu)
Dear Fujii-san,

Thanks for pushing and analyzing the failure!

> The regression.diffs shows that pgfdw_conn_check returned 0 even though
> pgfdw_conn_checkable()
> returned true. This can happen if the "revents" from poll() indicates 
> something
> other than
> POLLRDHUP. I think that "revents" could indicate POLLHUP, POLLERR, or
> POLLNVAL. Therefore,
> IMO pgfdw_conn_check() should be updated as follows. I will test this change.
> 
> -   return (input_fd.revents & POLLRDHUP) ? 1 : 0;
> +   return (input_fd.revents &
> +   (POLLRDHUP | POLLHUP | POLLERR |
> POLLNVAL)) ? 1 : 0;

I think you are right.
According to the man page, the input socket is invalid or disconnected if 
revents
has such bits. So such cases should be also regarded as *failure*.
After the fix, the animal said OK. Great works!

Best regards,
Hayato Kuroda
FUJITSU LIMITED



RE: [Proposal] Add foreign-server health checks infrastructure

2024-07-26 Thread Hayato Kuroda (Fujitsu)
Dear Fujii-san,

Just in case - based on the agreement in [1], I updated patches to keep them
consistent. We can use same pictures for further discussions...

IIUC, the patch which adds user_name attribute to get_connection() can be 
discussed
in later stage, is it right?

[1]: 
https://www.postgresql.org/message-id/768032ee-fb57-494b-b674-1ccb65b6f969%40oss.nttdata.com

Best regards,
Hayato Kuroda
FUJITSU LIMITED



v44-0001-doc-Enhance-documentation-for-postgres_fdw_get_c.patch
Description:  v44-0001-doc-Enhance-documentation-for-postgres_fdw_get_c.patch


v44-0002-postgres_fdw-Add-used_in_xact-column-to-postgres.patch
Description:  v44-0002-postgres_fdw-Add-used_in_xact-column-to-postgres.patch


v44-0003-postgres_fdw-Add-connection-status-check-to-post.patch
Description:  v44-0003-postgres_fdw-Add-connection-status-check-to-post.patch


RE: [Proposal] Add foreign-server health checks infrastructure

2024-07-25 Thread Hayato Kuroda (Fujitsu)
I apologize to post incomplete message, here is a correct version.

Dear Fujii-san,

> Thanks for updating the patches!
> 
> I’ve created a new base patch and split the v42-0001 patch into two parts
> to implement the feature and improvements step by step. After pushing these
> patches, I’ll focus on the v42-0002 patch next.

+1.

> I’ve attached the three patches.
> 
> v43-0001:
> This new patch enhances documentation for postgres_fdw_get_connections()
> output columns. The output columns were documented in text format,
> which is manageable for the current two columns. However, upcoming patches
> will add new columns, making text descriptions less readable.
> This patch updates the documentation to use a table format,
> making it easier for users to understand each output column.

Agreed to add the table. I ran a proofreading tool, and it said below points.
You can revise if they are acceptable.

```
+  This function returns information about all open connections that
```
-> "that" can be removed.

```
+ the current transaction but its foreign server or
```
-> can add comma before "but".

> I’ve also made several code improvements, for example adding a typedef for
> pgfdwVersion to simplify its usage, and updated typedefs.list.
> 
> +enum pgfdwVersion
> +{
> + PGFDW_V1_0 = 0,
> + PGFDW_V1_2,
> +}pgfdwVersion;

It was intentionally not added, because while developing pg_createsubscriber,
I got a comment that local-use data have not have to be typedef'd [1]:
```
- src/bin/pg_basebackup/pg_createsubscriber.c
+typedef struct CreateSubscriberOptions
+typedef struct LogicalRepInfo
I think these kinds of local-use struct don't need to be typedef'ed.
(Then you also don't need to update typdefs.list.)
```

A comment for 0002. 

```
+Datum
+postgres_fdw_get_connections_1_2(PG_FUNCTION_ARGS)
+{
+postgres_fdw_get_connections_internal(fcinfo, PGFDW_V1_2);
+
+return (Datum) 0;
+}
```

I know they have a same meaning, but can you clarify why (Datun) 0
is returned instead of PG_RETURN_VOID()?

[1]: 
https://www.postgresql.org/message-id/3ee79f2c-e8b3-4342-857c-a31b87e1afda%40eisentraut.org

Best regards,
Hayato Kuroda
FUJITSU LIMITED



RE: [Proposal] Add foreign-server health checks infrastructure

2024-07-25 Thread Hayato Kuroda (Fujitsu)
Dear Fujii-san,

> 
> Thanks for updating the patches!
> 
> I’ve created a new base patch and split the v42-0001 patch into two parts
> to implement the feature and improvements step by step. After pushing these
> patches, I’ll focus on the v42-0002 patch next.

Best regards,
Hayato Kuroda
FUJITSU LIMITED
> 
> I’ve attached the three patches.
> 
> v43-0001:
> This new patch enhances documentation for postgres_fdw_get_connections()
> output columns. The output columns were documented in text format,
> which is manageable for the current two columns. However, upcoming patches
> will add new columns, making text descriptions less readable.
> This patch updates the documentation to use a table format,
> making it easier for users to understand each output column.
> 
> v43-0002:
> This patch adds the "used_in_xact" column to postgres_fdw_get_connections().
> It separates this change from the original v42-0001 patch for clarity.
> 
> v43-0003
> This patch adds the "closed" column to postgres_fdw_get_connections().
> 
> I’ve also made several code improvements, for example adding a typedef for
> pgfdwVersion to simplify its usage, and updated typedefs.list.
> 
> +enum pgfdwVersion
> +{
> + PGFDW_V1_0 = 0,
> + PGFDW_V1_2,
> +}pgfdwVersion;
> 
> Regards,
> 
> --
> Fujii Masao
> Advanced Computing Technology Center
> Research and Development Headquarters
> NTT DATA CORPORATION


RE: Parallel heap vacuum

2024-07-25 Thread Hayato Kuroda (Fujitsu)
Dear Sawada-san,

> Thank you for the test!
> 
> I could reproduce this issue and it's a bug; it skipped even
> non-all-visible pages. I've attached the new version patch.
> 
> BTW since we compute the number of parallel workers for the heap scan
> based on the table size, it's possible that we launch multiple workers
> even if most blocks are all-visible. It seems to be better if we
> calculate it based on (relpages - relallvisible).

Thanks for updating the patch. I applied and confirmed all pages are scanned.
I used almost the same script (just changed max_parallel_maintenance_workers)
and got below result. I think the tendency was the same as yours.

```
parallel 0: 61114.369 ms
parallel 1: 34870.316 ms
parallel 2: 23598.115 ms
parallel 3: 17732.363 ms
parallel 4: 15203.271 ms
parallel 5: 13836.025 ms
```

I started to read your codes but takes much time because I've never seen 
before...
Below part contains initial comments.

1.
This patch cannot be built when debug mode is enabled. See [1].
IIUC, this was because NewRelminMxid was moved from struct LVRelState to 
PHVShared.
So you should update like " vacrel->counters->NewRelminMxid".

2.
I compared parallel heap scan and found that it does not have compute_worker 
API.
Can you clarify the reason why there is an inconsistency?
(I feel it is intentional because the calculation logic seems to depend on the 
heap structure,
so should we add the API for table scan as well?)

[1]:
```
vacuumlazy.c: In function ‘lazy_scan_prune’:
vacuumlazy.c:1666:34: error: ‘LVRelState’ {aka ‘struct LVRelState’} has no 
member named ‘NewRelminMxid’
  Assert(MultiXactIdIsValid(vacrel->NewRelminMxid));
          ^~

```

Best regards,
Hayato Kuroda
FUJITSU LIMITED



RE: [Proposal] Add foreign-server health checks infrastructure

2024-07-24 Thread Hayato Kuroda (Fujitsu)
Dear Fujii-san,

> On 2024/07/18 19:49, Hayato Kuroda (Fujitsu) wrote:
> >> Shouldn't this test also check if the returned user_name is valid?
> >
> > You meant to say that we should print the user_name, right? Done.
> 
> Yes, I think it's better to test if the value in the user_name column is as 
> expected.

But this might cause a test failure by cfbot See below comment.

> > - I found an inconsistency of name between source and doc,
> >so I unified to postgres_fdw_can_verify_connection().
> 
> I'm unsure about the necessity of introducing a standalone function to check
> POLLRDHUP availability. Instead of providing
> postgres_fdw_can_verify_connection(),
> could we modify postgres_fdw_get_connections() to return NULL in the "closed"
> column on platforms where POLLRDHUP isn't supported?

Initially I felt that user might not be able to distinguish whether 1) the
connection has not been established yet or 2) the checking cannot be done on 
this
platform. But after considering more, such servers are not listed in the 
function.
So modified like that.

> > - Also, test patch (0002) was combined into this.
> > 0002:
> > - user_name was added after the `valid` to preserve ordering of the old 
> > version.
> 
> Do we really need to keep this ordering? Wouldn't it be more intuitive to
> have the user_name column next to server_name? In pg_stat_statements,
> for example, the ordering isn't always preserved, as seen with
> WAL-related columns being added in the middle.

I also prefer the style, so changed accordingly.

> > Thanks for reviewing the patch! PSA new version.
> 
> Thanks for updating the patches!
> 
> 
> The regression test for postgres_fdw failed with the following diff.
> 
> --
>   SELECT * FROM postgres_fdw_get_connections() ORDER BY 1;
>server_name | valid | user_name | used_in_xact | closed
>   -+---+---+--+
> - loopback| f | hayato| t|
> + loopback| f | runner| t|
>| f |   | t|
>   (2 rows)
> --

This was because the user_name is quite depends on the environment.
I think it looks not so good, but one approach is to print `user_name = 
CURRENT_USER`
to test the feature. For loopback3, the user_name is set to NULL so that the 
column
will be NULL as well. How do you think?  Do you have better idea?

> +  * If requested and the connection is not invalidated,
> check the
> +  * status of the remote connection from the backend
> process and
> +  * return the result. Otherwise returns NULL.
> +  */
> + if (require_verify && !entry->invalidated &&
> entry->conn)
> 
> Should we also consider checking invalidated connections? Even though
> a connection is marked as invalidated, it can still be used until
> the current transaction ends. Therefore, if an invalidated connection
> is used in this transaction (i.e., used_in_xact = true) and has
> already been closed (closed = true), users might want to consider
> rolling back the transaction promptly. Thought?

I confirmed the meaning of `invalidated` attribute. IIUC:

- It is set to true when the server or user mapping is altered, but
- This connection has already been opened within the transaction.

If the entry is invalided, the server_name of postgres_fdw_get_connections is 
set
set to NULL (because the entry may be modified). Also, the connection is 
discarded
when the transaction ends.
Based on the unserstanding, yes, the connection should be also checked. One 
concern
is that user may not recognize which connection is lost (because the column may 
be blank).

> + -- Set client_min_messages to ERROR temporary because the
> following
> + -- function only throws a WARNING on the supported platform.
> 
> Is this still true? From my reading of the code, it doesn't appear
> that the function throws a WARNING.

Good finding, removed.

Attached patches contain above fixes and comment improvements per request from 
GPT-4o.

Best regards,
Hayato Kuroda
FUJITSU LIMITED



v42-0001-postgres_fdw-Allow-postgres_fdw_get_connections-.patch
Description:  v42-0001-postgres_fdw-Allow-postgres_fdw_get_connections-.patch


v42-0002-Extend-postgres_fdw_get_connections-to-return-us.patch
Description:  v42-0002-Extend-postgres_fdw_get_connections-to-return-us.patch


Found issues related with logical replication and 2PC

2024-07-23 Thread Hayato Kuroda (Fujitsu)
Hi hackers,

While creating a patch which allows ALTER SUBSCRIPTION SET (two_phase) [1],
we found some issues related with logical replication and two_phase. I think 
this
can happen not only HEAD but PG14+, but for now I shared patches for HEAD.

Issue #1

When handling a PREPARE message, the subscriber mistook the wrong lsn position
(the end position of the last commit) as the end position of the current 
prepare.
This can be fixed by adding a new global variable to record the end position of
the last prepare. 0001 patch fixes the issue.

Issue #2

When the subscriber enables two-phase commit but doesn't set 
max_prepared_transaction >0
and a transaction is prepared on the publisher, the apply worker reports an 
ERROR
on the subscriber. After that, the prepared transaction is not replayed, which
means it's lost forever. Attached script can emulate the situation.

--
ERROR:  prepared transactions are disabled
HINT:  Set "max_prepared_transactions" to a nonzero value.
--

The reason is that we advanced the origin progress when aborting the
transaction as well (RecordTransactionAbort->replorigin_session_advance). So,
after setting replorigin_session_origin_lsn, if any ERROR happens when preparing
the transaction, the transaction aborts which incorrectly advances the origin 
lsn.

An easiest fix is to reset session replication origin before calling the
RecordTransactionAbort(). I think this can happen when 1) LogicalRepApplyLoop()
raises an ERROR or 2) apply worker exits. 0002 patch fixes the issue.


How do you think?

[1]: 
https://www.postgresql.org/message-id/flat/8fab8-65d74c80-1-2f28e880@39088166

Best regards,
Hayato Kuroda
FUJITSU LIMITED



test_2pc.sh
Description: test_2pc.sh


0001-Add-XactLastPrepareEnd-to-indicate-the-last-PREPARE-.patch
Description:  0001-Add-XactLastPrepareEnd-to-indicate-the-last-PREPARE-.patch


0002-Prevent-origin-progress-advancement-if-failed-to-app.patch
Description:  0002-Prevent-origin-progress-advancement-if-failed-to-app.patch


RE: pg_upgrade and logical replication

2024-07-22 Thread Hayato Kuroda (Fujitsu)
Dear Amit, Michael,

> > I am not sure to get the reason why get_old_cluster_logical_slot_infos()
> > could not be optimized, TBH.  LogicalReplicationSlotHasPendingWal()
> > uses the fast forward mode where no changes are generated, hence there
> > should be no need for a dependency to a connection to a specific
> > database :)
> >
> > Combined to a hash table based on the database name and/or OID to know
> > to which dbinfo to attach the information of a slot, then it should be
> > possible to use one query, making the slot info gathering closer to
> > O(N) rather than the current O(N^2).
> >
> 
> The point is that unlike subscriptions logical slots are not
> cluster-level objects. So, this needs more careful design decisions
> rather than a fix-up patch for PG-17. One more thing after collecting
> slot-level, we also want to consider the creation of slots which again
> are created at per-database level.

I also considered the combination with the optimization (parallelization) of
pg_upgrade [1]. IIUC, the patch tries to connect to some databases in parallel
and run commands. The current style of create_logical_replication_slots() can be
easily adapted because tasks are divided per database.

However, if we change like get_old_cluster_logical_slot_infos() to do in a 
single
pass, we may have to shift LogicalSlotInfoArr to cluster-wide data and store the
database name in LogicalSlotInfo. Also, in create_logical_replication_slots(),
we may have to check the located database for every slot and connect to the
appropriate database. These changes make it difficult to parallelize the 
operation.

[1]: 
https://www.postgresql.org/message-id/flat/20240516211638.GA1688936@nathanxps13

Best regards,
Hayato Kuroda
FUJITSU LIMITED


RE: Slow catchup of 2PC (twophase) transactions on replica in LR

2024-07-21 Thread Hayato Kuroda (Fujitsu)
Dear Amit,

> + /*
> + * Do not allow changing the option if the subscription is enabled. This
> + * is because both failover and two_phase options of the slot on the
> + * publisher cannot be modified if the slot is currently acquired by the
> + * existing walsender.
> + */
> + if (sub->enabled)
> + ereport(ERROR,
> + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> + errmsg("cannot set %s for enabled subscription",
> + option)));
> 
> As per my understanding, the above comment is not true when we are
> changing 'two_phase' option from 'false' to 'true' because in that
> case, the existing walsender will only change it. So, ideally, we can
> allow toggling two_phase from 'false' to 'true' without the above
> restriction.

Hmm, yes. In "false" -> "true" case, the parameter of the slot is not changed by
the backend process. In this case, the subtwophasestate is changed to PENDING
once, then the walsender will change to ENABLED based on the worker requests.

> If this is correct then we don't even need to error for the case
> "cannot alter two_phase when logical replication worker is still
> running" when 'two_phase' option is changed from 'false' to 'true'.

Basically right, one note is that there is an Assert in 
maybe_reread_subscription(),
it should be also modified.

> Now, assuming the above observations are correct, we may still want to
> have the same behavior when toggling two_phase option but we can at
> least note down that in the comments so that if required the same can
> be changed when toggling 'two_phase' option from 'false' to 'true' in
> future.
> 
> Thoughts?

+1 to add comments in CheckAlterSubOption(). How about the below draft?

```
@@ -1089,6 +1089,12 @@ CheckAlterSubOption(Subscription *sub, const char 
*option,
  * is because both failover and two_phase options of the slot on the
  * publisher cannot be modified if the slot is currently acquired by the
  * existing walsender.
+ *
+ * XXX: when toggling two_phase from "false" to "true", the slot parameter
+ * is not modified by the backend process so that the lock conflict won't
+ * occur. The restarted walsender will do the alternation. Therefore, we
+ * can allow to switch without the restriction. This can be changed in
+ * the future based on the requirement.
```


Best regards,
Hayato Kuroda
FUJITSU LIMITED



RE: [Proposal] Add foreign-server health checks infrastructure

2024-07-18 Thread Hayato Kuroda (Fujitsu)
Dear Fujii-san,

Hi, long time no see :-).
Thanks for reviewing the patch! PSA new version.

> I've just started reviewing them.
> 
> Here are the review comments for 0001 patch:
> 
> Do we really need postgres_fdw_verify_connection_all()? The proposed feature
> aims to check if all postgres_fdw connections used within the transaction
> are still open. If any of those connections are closed, the transaction
> can't be committed successfully, so users can roll back immediately upon
> detecting a closed connection.
> 
> However, postgres_fdw_verify_connection_all() checks all connections made in
> the session, not just those used in the current transaction. This means
> users can't determine if they should roll back the transaction based on
> its return value. Therefore, I'm concerned that
> postgres_fdw_verify_connection_all() might not be very useful. Thoughts?

Right. My primal motivation is to detect the disconnection from remotes and 
abort
current transaction as soon as possible. In this case, as I posted in [1],
...all()  is not so helpful.
I agreed to remove the function from patch set. Done.

> Considering the purpose of this feature, wouldn't it be better to extend
> postgres_fdw_get_connections() to include a "used_in_xact" column
> (indicating whether the connection has been used in the current transaction)
> and a "closed" column (indicating whether the connection has been closed)?
> This approach might be more effective than introducing a new function
> like the postgres_fdw_verify_connection family?
> 
> If it's too much to check if the connection is closed by default whenever
> calling postgres_fdw_get_connections(), we could modify it to accept
> an argument indicating whether to perform this check. Thoughts?

Sounds interesting. If we can accept to change the definition of pre-existing
function, it seems better. To keep the default behavior, an input parameter
should be added. Attached patch tried to implement.

> Here are the review comments for 0003 patch:
> 
> The source comment in postgres_fdw_get_connections() should mention
> the return value user_name.

Document was updated.

> We also need to handle the case where the user mapping used by
> the connection cache has been dropped. Otherwise, this could
> lead to an error.
> 
> -
> =# BEGIN;
> =*# SELECT * FROM postgres_fdw_get_connections();
>   server_name | user_name | valid
> -+---+---
>   loopback| public| t
> (1 row)
> 
> =*# DROP USER MAPPING FOR public SERVER loopback ;
> =*# SELECT * FROM postgres_fdw_get_connections();
> ERROR:  cache lookup failed for user mapping 16409
> -

Fixed.

> -SELECT * FROM postgres_fdw_get_connections() ORDER BY 1;
> +SELECT server_name, valid FROM postgres_fdw_get_connections() ORDER BY
> 1;
> 
> Shouldn't this test also check if the returned user_name is valid?

You meant to say that we should print the user_name, right? Done.

> + server_name | user_name | validvalid
> +-+
> + loopback1   | postgres  | t
> + loopback2   |   | f
> 
> The column name "validvalid" should be "valid".

Right, fixed.

> How can we cause the record with server_name != NULL but user_name = NULL?

Now this can happen when user_name is dropped, but the example was updated.

Below contains a summary of changes.
0001:
- Instead of adding new functions, postgres_fdw_get_connections() was extended.
  Some tricks were added to support old versions. I followed 
pg_stat_statements.c.
  Attributes were added after the `valid` to preserve ordering of the old 
version.
- I found an inconsistency of name between source and doc,
  so I unified to postgres_fdw_can_verify_connection().
- Also, test patch (0002) was combined into this.
0002:
- user_name was added after the `valid` to preserve ordering of the old version.
- GetUserMappingFromOid() is allowed to miss a tuple.


[1]: 
https://www.postgresql.org/message-id/TYAPR01MB58668728393648C2F7DC7C85F5399%40TYAPR01MB5866.jpnprd01.prod.outlook.com

Best regards,
Hayato Kuroda
FUJITSU LIMITED



v41-0001-postgres_fdw-Allow-postgres_fdw_get_connections-.patch
Description:  v41-0001-postgres_fdw-Allow-postgres_fdw_get_connections-.patch


v41-0002-Extend-postgres_fdw_get_connections-to-return-us.patch
Description:  v41-0002-Extend-postgres_fdw_get_connections-to-return-us.patch


RE: Slow catchup of 2PC (twophase) transactions on replica in LR

2024-07-17 Thread Hayato Kuroda (Fujitsu)
Dear Peter,

Thanks for giving comments! PSA new version.
I think most of comments were addressed, and I ran pgindent/pgperltidy again.

Regarding the CheckAlterSubOption(), the ordering is still preserved
because I preferred to keep some specs. But I can agree that yours
make codes simpler.

BTW, I started to think patches can be merged in future versions because
they must be included at once and codes are not so long. Thought?

Best regards,
Hayato Kuroda
FUJITSU LIMITED



v20-0001-Allow-altering-of-two_phase-option-of-a-SUBSCRIP.patch
Description:  v20-0001-Allow-altering-of-two_phase-option-of-a-SUBSCRIP.patch


v20-0002-Alter-slot-option-two_phase-only-when-altering-t.patch
Description:  v20-0002-Alter-slot-option-two_phase-only-when-altering-t.patch


v20-0003-Notify-users-to-roll-back-prepared-transactions.patch
Description:  v20-0003-Notify-users-to-roll-back-prepared-transactions.patch


RE: speed up a logical replica setup

2024-07-17 Thread Hayato Kuroda (Fujitsu)
Dear Amit,

> Your analysis sounds correct to me.

Okay, so we could have a same picture...

> > IIUC, the root cause is that pg_create_logical_replication_slot() returns a 
> > LSN
> > which is not generated yet. So, I think both mine [1] and Euler's approach 
> > [2]
> > can solve the issue. My proposal was to add an extra WAL record after the 
> > final
> > slot creation, and Euler's one was to use a restart_lsn as the
> recovery_target_lsn.
> >
> 
> I don't think it is correct to set restart_lsn as consistent_lsn point
> because the same is used to set replication origin progress. Later
> when we start the subscriber, the system will use that LSN as a
> start_decoding_at point which is the point after which all the commits
> will be replicated. So, we will end up incorrectly using restart_lsn
> (LSN from where we start reading the WAL) as start_decoding_at point.
> How could that be correct?

I didn't say we could use restart_lsn as consistent point of logical 
replication,
but I could agree the approach has issues.

> Now, even if we use restart_lsn as recovery_target_lsn and the LSN
> returned by pg_create_logical_replication_slot() as consistent LSN to
> set replication progress, that also could lead to data loss because
> the subscriber may never get data between restart_lsn value and
> consistent LSN value.

You considered the case, e.g., tuples were inserted just after the restart_lsn
but before the RUNNING_XACT record? In this case, yes, the streaming replication
finishes before replicating tuples but logical replication will skip them.
Euler's approach cannot be used as-is.

Best regards,
Hayato Kuroda
FUJITSU LIMITED



RE: speed up a logical replica setup

2024-07-17 Thread Hayato Kuroda (Fujitsu)
Dear Alexander, Euler, Amit,

I also analyzed this failure, let me share it. Here, I think events in below
ordering were occurred.

1. Backend created a publication on $db2,
2. BGWriter generated RUNNING_XACT record, then
3. Backend created a replication slot on $db2.

In this case, the recovery_target_lsn is ahead of the RUNNING_XACT record 
generated
at step 3. Also, since both bgwriter and slot creation mark the record as
*UNIMPORTANT* one, the writer won't start again even after the
LOG_SNAPSHOT_INTERVAL_MS. The rule is written in BackgroundWriterMain():

```
/*
 * Only log if enough time has passed and interesting 
records have
 * been inserted since the last snapshot.  Have to 
compare with <=
 * instead of < because GetLastImportantRecPtr() points 
at the
 * start of a record, whereas last_snapshot_lsn points 
just past
 * the end of the record.
 */
if (now >= timeout &&
last_snapshot_lsn <= GetLastImportantRecPtr())
{
last_snapshot_lsn = LogStandbySnapshot();
last_snapshot_ts = now;
}
```

Therefore, pg_createsubscriber waited until a new record was replicated, but no
activities were recorded, causing a timeout. Since this is a timing issue, 
Alexander
could reproduce the failure with shorter time duration and parallel running.

IIUC, the root cause is that pg_create_logical_replication_slot() returns a LSN
which is not generated yet. So, I think both mine [1] and Euler's approach [2]
can solve the issue. My proposal was to add an extra WAL record after the final
slot creation, and Euler's one was to use a restart_lsn as the 
recovery_target_lsn.
In case of primary server, restart_lsn is set to the latest WAL insert position 
and
then RUNNING_XACT record is generated later.

How do you think?

[1]: 
https://www.postgresql.org/message-id/osbpr01mb25521b15bf950d2523bbe143f5...@osbpr01mb2552.jpnprd01.prod.outlook.com
[2]: 
https://www.postgresql.org/message-id/b1f0f8c7-8f01-4950-af77-339df3dc4684%40app.fastmail.com

Best regards,
Hayato Kuroda
FUJITSU LIMITED



RE: Slow catchup of 2PC (twophase) transactions on replica in LR

2024-07-16 Thread Hayato Kuroda (Fujitsu)
Dear Hou, Peter,

Thanks for giving comments! PSA new version.
Almost comments were addressed.
What's new:
0001 - IsTwoPhaseTransactionGidForSubid() was updated per comment from Hou-san 
[1].
  Some nitpicks were accepted.
0002 - An argument in CheckAlterSubOption() was renamed to " slot_needs_update "
  Some nitpicks were accepted.
0003 - Some nitpicks were accepted.

Below part contains the reason why I rejected some comments.

> CommonChecksForFailoverAndTwophase:
> nitpick - added Assert for the generic-looking "option" parameter name

The style looks strange for me, using multiple strcmp() is more straightforward.
Added like that.

> 1c.
> If the error checks can be moved to be done up-front, then all the 
> 'needs_update'
> can be combined. Avoiding multiple checks to 'needs_update' will make this 
> function simpler.

This style was needed to preserve error condition for failover option. Not 
changed.

[1]: 
https://www.postgresql.org/message-id/OS3PR01MB571834FBD3E6D3804484038F94A32%40OS3PR01MB5718.jpnprd01.prod.outlook.com

Best regards,
Hayato Kuroda
FUJITSU LIMITED



v19-0001-Allow-altering-of-two_phase-option-of-a-SUBSCRIP.patch
Description:  v19-0001-Allow-altering-of-two_phase-option-of-a-SUBSCRIP.patch


v19-0002-Alter-slot-option-two_phase-only-when-altering-t.patch
Description:  v19-0002-Alter-slot-option-two_phase-only-when-altering-t.patch


v19-0003-Notify-users-to-roll-back-prepared-transactions.patch
Description:  v19-0003-Notify-users-to-roll-back-prepared-transactions.patch


RE: Slow catchup of 2PC (twophase) transactions on replica in LR

2024-07-15 Thread Hayato Kuroda (Fujitsu)
Dear Amit, Hou,

Thanks for giving comments! PSA new versions.
What's new:

0001: included Hou's patch [1] not to overwrite slot options.
  Some other comments were also addressed.
0002: not so changed, just rebased.
0003: Typo was fixed, s/Alter/After/.

[1]: 
https://www.postgresql.org/message-id/OS3PR01MB57184E0995521300AC06CB4B94A72%40OS3PR01MB5718.jpnprd01.prod.outlook.com
 

Best regards,
Hayato Kuroda
FUJITSU LIMITED



v18-0001-Allow-altering-of-two_phase-option-of-a-SUBSCRIP.patch
Description:  v18-0001-Allow-altering-of-two_phase-option-of-a-SUBSCRIP.patch


v18-0002-Alter-slot-option-two_phase-only-when-altering-t.patch
Description:  v18-0002-Alter-slot-option-two_phase-only-when-altering-t.patch


v18-0003-Notify-users-to-roll-back-prepared-transactions.patch
Description:  v18-0003-Notify-users-to-roll-back-prepared-transactions.patch


RE: Slow catchup of 2PC (twophase) transactions on replica in LR

2024-07-09 Thread Hayato Kuroda (Fujitsu)
> 0001 - Codes for SUBOPT_TWOPHASE_COMMIT are moved per requirement [1].
>Also, checks for failover and two_phase are unified into one function.
> 0002 - updated accordingly. An argument for the check function is added.
> 0003 - this contains documentation changes required in [2].

Previous patch set could not be accepted due to the initialization miss.
PSA new version. 

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



v17-0001-Allow-altering-of-two_phase-option-of-a-SUBSCRIP.patch
Description:  v17-0001-Allow-altering-of-two_phase-option-of-a-SUBSCRIP.patch


v17-0002-Alter-slot-option-two_phase-only-when-altering-t.patch
Description:  v17-0002-Alter-slot-option-two_phase-only-when-altering-t.patch


v17-0003-Notify-users-to-roll-back-prepared-transactions.patch
Description:  v17-0003-Notify-users-to-roll-back-prepared-transactions.patch


RE: Slow catchup of 2PC (twophase) transactions on replica in LR

2024-07-09 Thread Hayato Kuroda (Fujitsu)
Dear Amit,

> > I see that in 0003/0004, the patch first aborts pending prepared
> > transactions, update's catalog, and then change slot's property via
> > walrcv_alter_slot. What if there is any ERROR (say the remote node is
> > not reachable or there is an error while updating the catalog) after
> > we abort the pending prepared transaction? Won't we end up with lost
> > prepared transactions in such a case?

Yes, v16 could happen the case, becasue FinishPreparedTransaction() itself is 
not
the transactional operation. In below example, the subscription was altered 
after
stopping the publisher. You could see that prepared transaction were rollbacked.

```
subscriber=# SELECT gid FROM pg_prepared_xacts ;
   gid
--
 pg_gid_16390_741
 pg_gid_16390_742
(2 rows)
subscriber=# ALTER SUBSCRIPTION sub SET (TWO_PHASE = off, FORCE_ALTER = on);
NOTICE:  requested altering to two_phase = false but there are prepared 
transactions done by the subscription
DETAIL:  Such transactions are being rollbacked.
ERROR:  could not connect to the publisher: connection to server on socket 
"/tmp/.s.PGSQL.5431" failed: No such file or directory
Is the server running locally and accepting connections on that socket?
subscriber=# SELECT gid FROM pg_prepared_xacts ;
 gid 
-
(0 rows)
```

> Considering the above is a problem the other possibility I thought of
> is to change the order like abort prepared xacts after slot update.
> That is also dangerous because any failure while aborting could make a
> slot change permanent whereas the subscription option will still be
> old value. Now, because the slot's two_phase property is off, at
> commit, it can resend the entire transaction which can create a
> problem because the corresponding prepared transaction will already be
> present.

I feel it is rare case but still possible. E.g., race condition by 
TwoPhaseStateLock
locking, oom, disk failures and so on.
And since prepared transactions hold locks, duplicated arrival of transactions
may cause table-lock failures. 

> One more thing to think about in this regard is what if we fail after
> aborting a few prepared transactions and not all?

It's bit hard to emulate, but I imagine part of prepared transactions remains.

> At this stage, I am not able to think of a good solution for these
> problems. So, if we don't get a solution for these, we can document
> that users can first manually abort prepared transactions and then
> switch off the two_phase option using Alter Subscription command.

I'm also not sure what should we do. Ideally, it may be happy to make
FinishPreparedTransaction() transactional, but not sure it is realistic. So
changes for aborting prepared txns are removed, documentation patch was added
instead.

Here is a summary of updates for patches. Dropping-prepared-transaction patch
was removed for now.

0001 - Codes for SUBOPT_TWOPHASE_COMMIT are moved per requirement [1].
   Also, checks for failover and two_phase are unified into one function.
0002 - updated accordingly. An argument for the check function is added.
0003 - this contains documentation changes required in [2].

[1]: 
https://www.postgresql.org/message-id/CAA4eK1%2BFRrL_fLWLsWQGHZRESg39ixzDX_S9hU8D7aFtU%2Ba8uQ%40mail.gmail.com
[2]: 
https://www.postgresql.org/message-id/CAA4eK1Khy_YWFoQ1HOF_tGtiixD8YoTg86coX1-ckxt8vK3U%3DQ%40mail.gmail.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



v17-0001-Allow-altering-of-two_phase-option-of-a-SUBSCRIP.patch
Description:  v17-0001-Allow-altering-of-two_phase-option-of-a-SUBSCRIP.patch


v17-0002-Alter-slot-option-two_phase-only-when-altering-t.patch
Description:  v17-0002-Alter-slot-option-two_phase-only-when-altering-t.patch


v17-0003-Notify-users-to-roll-back-prepared-transactions.patch
Description:  v17-0003-Notify-users-to-roll-back-prepared-transactions.patch


RE: Slow catchup of 2PC (twophase) transactions on replica in LR

2024-07-09 Thread Hayato Kuroda (Fujitsu)
Dear Amit,

Thanks for giving comments! Here I wanted to reply one of comments.

> What should be the behavior if one tries to set slot_name to NONE and
> also tries to toggle two_pahse option?

You mentioned like below case, right?

```
ALTER SUBSCRIPTION sub SET (two_phase = false, slot_name = NONE);
```

For now, we accept such a command. The replication slot which previously 
specified
is altered. As you know, this behavior is same as failover's one.

> I feel both options together
> don't makes sense because there is no use in changing two_phase for
> some slot which we are disassociating the subscription from. The same
> could be said for the failover option as well, so if we agree with
> some different behavior here, we can follow the same for failover
> option as well.

While considering more, I started to think the combination of slot_name and
two_phase should not be allowed. Even if both of them are altered at the same 
time,
the *old* slot will be modified by the backend process. I feel this 
inconsistency
should not be happened. In next patch, this check will be added. I also think
failover option should be also fixed, but not touched here. Let's make the scope
narrower.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



RE: Parallel heap vacuum

2024-07-05 Thread Hayato Kuroda (Fujitsu)
Dear Sawada-san,

> The parallel vacuum we have today supports only for index vacuuming.
> Therefore, while multiple workers can work on different indexes in
> parallel, the heap table is always processed by the single process.
> I'd like to propose $subject, which enables us to have multiple
> workers running on the single heap table. This would be helpful to
> speedup vacuuming for tables without indexes or tables with
> INDEX_CLENAUP = off.

Sounds great. IIUC, vacuuming is still one of the main weak point of postgres.

> I've attached a PoC patch for this feature. It implements only
> parallel heap scans in lazyvacum. We can extend this feature to
> support parallel heap vacuum as well in the future or in the same
> patch.

Before diving into deep, I tested your PoC but found unclear point.
When the vacuuming is requested with parallel > 0 with almost the same workload
as yours, only the first page was scanned and cleaned up. 

When parallel was set to zero, I got:
```
INFO:  vacuuming "postgres.public.test"
INFO:  finished vacuuming "postgres.public.test": index scans: 0
pages: 0 removed, 2654868 remain, 2654868 scanned (100.00% of total)
tuples: 12000 removed, 48000 remain, 0 are dead but not yet removable
removable cutoff: 752, which was 0 XIDs old when operation ended
new relfrozenxid: 739, which is 1 XIDs ahead of previous value
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
index scan not needed: 0 pages from table (0.00% of total) had 0 dead item 
identifiers removed
avg read rate: 344.639 MB/s, avg write rate: 344.650 MB/s
buffer usage: 2655045 hits, 2655527 misses, 2655606 dirtied
WAL usage: 1 records, 1 full page images, 937 bytes
system usage: CPU: user: 39.45 s, system: 20.74 s, elapsed: 60.19 s
```

This meant that all pages were surely scanned and dead tuples were removed.
However, when parallel was set to one, I got another result:

```
INFO:  vacuuming "postgres.public.test"
INFO:  launched 1 parallel vacuum worker for table scanning (planned: 1)
INFO:  finished vacuuming "postgres.public.test": index scans: 0
pages: 0 removed, 2654868 remain, 1 scanned (0.00% of total)
tuples: 12 removed, 0 remain, 0 are dead but not yet removable
removable cutoff: 752, which was 0 XIDs old when operation ended
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
index scan not needed: 0 pages from table (0.00% of total) had 0 dead item 
identifiers removed
avg read rate: 92.952 MB/s, avg write rate: 0.845 MB/s
buffer usage: 96 hits, 660 misses, 6 dirtied
WAL usage: 1 records, 1 full page images, 937 bytes
system usage: CPU: user: 0.05 s, system: 0.00 s, elapsed: 0.05 s
```

It looked like that only a page was scanned and 12 tuples were removed.
It looks very strange for me...

Attached script emulate my test. IIUC it was almost the same as yours, but
the instance was restarted before vacuuming.

Can you reproduce and see the reason? Based on the requirement I can
provide further information.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



parallel_test.sh
Description: parallel_test.sh


RE: Slow catchup of 2PC (twophase) transactions on replica in LR

2024-07-05 Thread Hayato Kuroda (Fujitsu)
Dear Amit,

Sorry, I forgot to say one content.

> > But that is not a good reason for this operation to stop workers
> > first. Instead, we should prohibit this operation if any worker is
> > present. The reason is that there is always a chance that if any
> > worker is alive, it can prepare a new transaction after we have
> > checked for the presence of any prepared transactions.
> 
> I used the function because it internally waits until all workers are exited.
> But OK, I modified like you suggested (logicalrep_workers_find() is used).

Based on the reason, after the above modification, test codes prior to v14
sometimes failed because backend could execute ALTER SUBSCRIPTION ... SET 
(two_phase).
So I added lines in test codes to poll until workers are exited, e.g.,

```
+# Alter subscription two_phase to false
+$node_subscriber->safe_psql('postgres',
+"ALTER SUBSCRIPTION tap_sub_copy DISABLE;");
+$node_subscriber->poll_query_until('postgres',
+"SELECT count(*) = 0 FROM pg_stat_activity WHERE backend_type = 'logical 
replication worker'"
+);
+$node_subscriber->safe_psql(
+   'postgres', "
+ALTER SUBSCRIPTION tap_sub_copy SET (two_phase = false);
+ALTER SUBSCRIPTION tap_sub_copy ENABLE;");
```

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



RE: speed up a logical replica setup

2024-07-03 Thread Hayato Kuroda (Fujitsu)
Dear Amit,

> Your analysis looks correct to me. The test could fail due to
> autovacuum. See the following comment in
> 040_standby_failover_slots_sync.
> 
> # Disable autovacuum to avoid generating xid during stats update as otherwise
> # the new XID could then be replicated to standby at some random point making
> # slots at primary lag behind standby during slot sync.
> $publisher->append_conf('postgresql.conf', 'autovacuum = off');
>

Oh, I could not find the comment. I felt it should be added even in
040_pg_createsubscriber.pl. Done.

> > # Descriptions for attached files
> >
> > An attached script can be used to reproduce the first failure without
> pg_createsubscriber.
> > It requires to modify the code like [1].
> 
> > 0003 patch disables autovacuum for node_p and node_s. I think node_p is
> enough, but did
> > like that just in case. This fixes a second failure.
> >
> 
> Disabling on the primary node should be sufficient. Let's do the
> minimum required to stabilize this test.

+1, removed.

PSA new version. 0001 has not been changed yet. A comment was added
in 0002 to clarify why we must wait. For 0003, a comment was added and
setting for standby was reverted.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



v3-0001-emit-dummy-message-while-setting-up-the-publisher.patch
Description:  v3-0001-emit-dummy-message-while-setting-up-the-publisher.patch


v3-0002-wait-until-RUNNING_XACT-is-replicated.patch
Description: v3-0002-wait-until-RUNNING_XACT-is-replicated.patch


v3-0003-disable-autovacuum-while-testing.patch
Description: v3-0003-disable-autovacuum-while-testing.patch


RE: speed up a logical replica setup

2024-07-02 Thread Hayato Kuroda (Fujitsu)
   This operaion is done in 
CreateInitDecodingContext()->GetOldestSafeDecodingContext().
5. After that, the transaction at step2 is reached to the standby node and it
   updates the nextXid.
6. Finally runs pg pg_sync_replication_slots() on the standby. It finds a 
failover
   slot on the primary and tries to create on the standby. However, the
   catalog_xmin on the primary (743) is older than the nextXid of the standby 
(744)
   so that it skips to create a slot.

To avoid the issue, we can disable the autovacuuming while testing.

# Descriptions for attached files

An attached script can be used to reproduce the first failure without 
pg_createsubscriber.
It requires to modify the code like [1].

0001 patch can reduce the wait time of test. See [3]. I know the approach is 
bit hacky but
it worked.
0002 patch waits until the WAL record is replicated. This fixes a first failure.
0003 patch disables autovacuum for node_p and node_s. I think node_p is enough, 
but did
like that just in case. This fixes a second failure.


[1]: 
https://www.postgresql.org/message-id/0dffca12-bf17-4a7a-334d-225569de5e6e%40gmail.com
[2]: 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=adder&dt=2024-07-02%2008%3A45%3A39
[3]: 
https://www.postgresql.org/message-id/OSBPR01MB25521B15BF950D2523BBE143F5D32%40OSBPR01MB2552.jpnprd01.prod.outlook.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



v2-0001-emit-dummy-message-while-setting-up-the-publisher.patch
Description:  v2-0001-emit-dummy-message-while-setting-up-the-publisher.patch


v2-0002-wait-until-RUNNING_XACT-is-replicated.patch
Description: v2-0002-wait-until-RUNNING_XACT-is-replicated.patch


v2-0003-disable-autovacuum-while-testing.patch
Description: v2-0003-disable-autovacuum-while-testing.patch


repro_skipping_slot_sync.sh
Description: repro_skipping_slot_sync.sh


RE: speed up a logical replica setup

2024-07-02 Thread Hayato Kuroda (Fujitsu)
Dear Amit,

> IIUC, the problem is that the consistent_lsn value returned by
> setup_publisher() is the "end +1" location of the required LSN whereas
> the recovery_target_lsn used in wait_for_end_recovery() expects the
> LSN value to be "start" location of required LSN.

Yeah, right. It is same as my understanding.

> This sounds like an ugly hack to me and don't know if we can use it.

I also think it is hacky, but I could not find better solutions.

> The ideal way to fix this is to get the start_lsn from the
> create_logical_slot functionality or have some parameter like
> recover_target_end_lsn but I don't know if this is a good time to
> extend such a functionality.

I felt that such approach might be used for HEAD, but not suitable for PG17.
Alternative approach I came up with was to insert a tuple while waiting the
promotion. It can generate a WAL record so that standby can finish after the
application. But I'm not sure how do we do and it seems to lead an additional
timing issue. Also, this does not improve the behavior of the command - normal
user may have to wait some time by the command.

Do you have any other idea?

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



RE: speed up a logical replica setup

2024-07-01 Thread Hayato Kuroda (Fujitsu)
Dear Tom,

> I have a different but possibly-related complaint: why is
> 040_pg_createsubscriber.pl so miserably slow?  On my machine it
> runs for a bit over 19 seconds, which seems completely out of line
> (for comparison, 010_pg_basebackup.pl takes 6 seconds, and the
> other test scripts in this directory take much less).  It looks
> like most of the blame falls on this step:
> 
> [12:47:22.292](14.534s) ok 28 - run pg_createsubscriber on node S
> 
> AFAICS the amount of data being replicated is completely trivial,
> so that it doesn't make any sense for this to take so long --- and
> if it does, that suggests that this tool will be impossibly slow
> for production use.  But I suspect there is a logic flaw causing
> this.

I analyzed the issue. My elog() debugging said that wait_for_end_recovery() was
wasted some time. This was caused by the recovery target seeming unsatisfactory.

We are setting recovery_target_lsn by the return value of 
pg_create_logical_replication_slot(),
which returns the end of the RUNNING_XACT record. If we use the returned value 
as
recovery_target_lsn as-is, however, we must wait for additional WAL generation
because the parameter requires that the replicated WAL overtake a certain point.
On my env, the function waited until the bgwriter emitted the 
XLOG_RUNNING_XACTS record.

One simple solution is to add an additional WAL record at the end of the 
publisher
setup. IIUC, an arbitrary WAL insertion can reduce the waiting time. The 
attached
patch inserts a small XLOG_LOGICAL_MESSAGE record, which could reduce much 
execution
time on my environment.

```
BEFORE
(13.751s) ok 30 - run pg_createsubscriber on node S
AFTER
(0.749s) ok 30 - run pg_createsubscriber on node S
```

However, even after the modification, the reported failure [1] could not be 
resolved on my env.

How do you think?

[1]: 
https://www.postgresql.org/message-id/0dffca12-bf17-4a7a-334d-225569de5e6e%40gmail.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 


emit_dummy_message.diff
Description: emit_dummy_message.diff


RE: pg_createsubscriber: drop pre-existing subscriptions from the converted node

2024-06-30 Thread Hayato Kuroda (Fujitsu)
Dear Amit,

Thanks for giving comments! PSA new version.

> Thanks, this is a better approach. I have changed a few comments and
> made some other cosmetic changes. See attached.

I checked your attached and LGTM. Based on that, I added some changes
like below:

- Made dbname be escaped while listing up pre-existing subscriptions
 Previous version could not pass tests by recent commits.
- Skipped dropping subscriptions in dry_run mode
  I found the issue while poring the test to 040_pg_createsubscriber.pl.
- Added info-level output to follow other drop_XXX functions

> BTW, why have you created a separate test file for this test? I think
> we should add a new test to one of the existing tests in
> 040_pg_createsubscriber.

I was separated a test file for just confirmation purpose, I've planned to 
merge.
New patch set did that.

> You can create a dummy subscription on node_p
> and do a test similar to what we are doing in "# Create failover slot
> to test its removal".

Your approach looks better than mine. I followed the approach.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



v4-0001-pg_createsubscriber-remove-unused-attribute.patch
Description: v4-0001-pg_createsubscriber-remove-unused-attribute.patch


v4-0002-pg_createsubscriber-Drop-pre-existing-subscriptio.patch
Description:  v4-0002-pg_createsubscriber-Drop-pre-existing-subscriptio.patch


RE: pg_createsubscriber: drop pre-existing subscriptions from the converted node

2024-06-26 Thread Hayato Kuroda (Fujitsu)
Dear Amit, Vingesh,

Thanks for giving comments!

> It seems disabling subscriptions on the primary can make the primary
> stop functioning for some duration of time. I feel we need some
> solution where after converting to subscriber, we disable and drop
> pre-existing subscriptions. One idea could be that we use the list of
> new subscriptions created by the tool such that any subscription not
> existing in that list will be dropped.

Previously I avoided coding like yours, because there is a room that converted
node can connect to another publisher. But per off-list discussion, we can skip
it by setting max_logical_replication_workers = 0. I refactored with the 
approach.
Note that the GUC is checked at verification phase, so an attribute is added to
start_standby_server() to select the workload.

Most of comments by Vignesh were invalidated due to the code change, but I hoped
I checked your comments were not reproduced. Also, 0001 was created to remove an
unused attribute.

> Shouldn't this be an open item for PG17?

Added this thread to wikipage.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 


v2-0001-pg_createsubscriber-remove-unused-attribute.patch
Description: v2-0001-pg_createsubscriber-remove-unused-attribute.patch


v2-0002-pg_createsubscriber-Drop-pre-existing-subscriptio.patch
Description:  v2-0002-pg_createsubscriber-Drop-pre-existing-subscriptio.patch


RE: speed up a logical replica setup

2024-06-25 Thread Hayato Kuroda (Fujitsu)
Dear Amit, Euler,

Thanks for giving comment! 0001 was modified per suggestions.

> Yeah, it is a good idea to add a new option for two_phase but that
> should be done in the next version. For now, I suggest updating the
> docs and probably raising a warning (if max_prepared_transactions !=
> 0) as suggested by Noah. This WARNING is useful because one could
> expect that setting max_prepared_transactions != 0 means apply will
> happen at prepare time after the subscriber is created by this tool.
> The WARNING will be useful even if we support two_phase option as the
> user may have set the non-zero value of max_prepared_transactions but
> didn't use two_phase option.

I also think it should be tunable, in PG18+. 0002 adds a description in doc.
Also, this executable warns if the max_prepared_transactions != 0 on the
publisher. Subscriber-side is not checked now because I don't think it is 
helpful,
but it can be easily added. I did not add test for that, because current test 
code
does not check outputs.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



v2-0001-pg_createsubscriber-Fix-cases-which-connection-pa.patch
Description:  v2-0001-pg_createsubscriber-Fix-cases-which-connection-pa.patch


v2-0002-pg_createsubscriber-Warn-the-two-phase-is-disable.patch
Description:  v2-0002-pg_createsubscriber-Warn-the-two-phase-is-disable.patch


RE: Pgoutput not capturing the generated columns

2024-06-25 Thread Hayato Kuroda (Fujitsu)
Dear Shlok,

Thanks for updating patches! Below are my comments, maybe only for 0002.

01. General

IIUC, we are not discussed why ALTER SUBSCRIPTION ... SET 
include_generated_columns
is prohibit. Previously, it seems okay because there are exclusive options. But 
now,
such restrictions are gone. Do you have a reason in your mind? It is just not 
considered
yet?

02. General

According to the doc, we allow to alter a column to non-generated one, by ALTER
TABLE ... ALTER COLUMN ... DROP EXPRESSION command. Not sure, what should be
when the command is executed on the subscriber while copying the data? Should
we continue the copy or restart? How do you think?

03. Tes tcode

IIUC, REFRESH PUBLICATION can also lead the table synchronization. Can you add
a test for that?

04. Test code (maybe for 0001)

Please test the combination with TABLE ... ALTER COLUMN ... DROP EXPRESSION 
command.

05. logicalrep_rel_open

```
+/*
+ * In case 'include_generated_columns' is 'false', we should skip 
the
+ * check of missing attrs for generated columns.
+ * In case 'include_generated_columns' is 'true', we should check 
if
+ * corresponding column for the generated column in publication 
column
+ * list is present in the subscription table.
+ */
+if (!MySubscription->includegencols && attr->attgenerated)
+{
+entry->attrmap->attnums[i] = -1;
+continue;
+}
```

This comment is not very clear to me, because here we do not skip anything.
Can you clarify the reason why attnums[i] is set to -1 and how will it be used?

06. make_copy_attnamelist

```
+gencollist = palloc0(MaxTupleAttributeNumber * sizeof(bool));
```

I think this array is too large. Can we reduce a size to (desc->natts * 
sizeof(bool))?
Also, the free'ing should be done.

07. make_copy_attnamelist

```
+/* Loop to handle subscription table generated columns. */
+for (int i = 0; i < desc->natts; i++)
```

IIUC, the loop is needed to find generated columns on the subscriber side, 
right?
Can you clarify as comment?

08. copy_table

```
+/*
+ * Regular table with no row filter and 'include_generated_columns'
+ * specified as 'false' during creation of subscription.
+ */
```

I think this comment is not correct. After patching, all tablesync command 
becomes
like COPY (SELECT ...) if include_genereted_columns is set to true. Is it right?
Can we restrict only when the table has generated ones?

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



RE: Slow catchup of 2PC (twophase) transactions on replica in LR

2024-06-25 Thread Hayato Kuroda (Fujitsu)
> Dear hackers,
> 
> I found that v12 patch set could not be accepted by the cfbot. PSA new 
> version.

To make others more trackable, I shared changes just in case. All failures were 
occurred
on the pg_dump code. I added an attribute in pg_subscription and modified 
pg_dump code,
but it was wrong. A constructed SQL became incomplete. I.e., in [1]: 

```
pg_dump: error: query failed: ERROR:  syntax error at or near "."
LINE 15:  s.subforcealter
   ^
pg_dump: detail: Query was: SELECT s.tableoid, s.oid, s.subname,
 s.subowner,
 s.subconninfo, s.subslotname, s.subsynccommit,
 s.subpublications,
 s.subbinary,
 s.substream,
 s.subtwophasestate,
 s.subdisableonerr,
 s.subpasswordrequired,
 s.subrunasowner,
 s.suborigin,
 NULL AS suboriginremotelsn,
 false AS subenabled,
 s.subfailover
 s.subforcealter
FROM pg_subscription s
WHERE s.subdbid = (SELECT oid FROM pg_database
   WHERE datname = current_database())
```

Based on that I just added a comma in 0004 patch.

[1]: https://cirrus-ci.com/task/6710166165389312

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



RE: speed up a logical replica setup

2024-06-23 Thread Hayato Kuroda (Fujitsu)
Dear Noah,

> pg_createsubscriber fails on a dbname containing a space.  Use
> appendConnStrVal() here and for other params in get_sub_conninfo().  See the
> CVE-2016-5424 commits for more background.  For one way to test this
> scenario,
> see generate_db() in the pg_upgrade test suite.

Thanks for pointing out. I made a fix patch. Test code was also modified 
accordingly.

> > +static char *
> > +create_logical_replication_slot(PGconn *conn, struct LogicalRepInfo 
> > *dbinfo)
> > +{
> > +   PQExpBuffer str = createPQExpBuffer();
> > +   PGresult   *res = NULL;
> > +   const char *slot_name = dbinfo->replslotname;
> > +   char   *slot_name_esc;
> > +   char   *lsn = NULL;
> > +
> > +   Assert(conn != NULL);
> > +
> > +   pg_log_info("creating the replication slot \"%s\" on database \"%s\"",
> > +   slot_name, dbinfo->dbname);
> > +
> > +   slot_name_esc = PQescapeLiteral(conn, slot_name, strlen(slot_name));
> > +
> > +   appendPQExpBuffer(str,
> > + "SELECT lsn FROM
> pg_catalog.pg_create_logical_replication_slot(%s, 'pgoutput', false, false, 
> false)",
> 
> This is passing twophase=false, but the patch does not mention prepared
> transactions.  Is the intent to not support workloads containing prepared
> transactions?  If so, the documentation should say that, and the tool likely
> should warn on startup if max_prepared_transactions != 0.

IIUC, We decided because it is a default behavior of logical replication. See 
[1].
+1 for improving a documentation, but not sure it is helpful for adding output.
I want to know opinions from others.

> > +static void
> > +create_publication(PGconn *conn, struct LogicalRepInfo *dbinfo)
> > +{
> 
> > +   appendPQExpBuffer(str, "CREATE PUBLICATION %s FOR ALL TABLES",
> > + ipubname_esc);
> 
> This tool's documentation says it "guarantees that no transaction will be
> lost."  I tried to determine whether achieving that will require something
> like the fix from
> https://postgr.es/m/flat/de52b282-1166-1180-45a2-8d8917ca74c6@enterprise
> db.com.
> (Not exactly the fix from that thread, since that thread has not discussed the
> FOR ALL TABLES version of its race condition.)  I don't know.  On the one
> hand, pg_createsubscriber benefits from creating a logical slot after creating
> the publication.  That snapbuild.c process will wait for running XIDs.  On the
> other hand, an INSERT/UPDATE/DELETE acquires its RowExclusiveLock and
> builds
> its relcache entry before assigning an XID, so perhaps the snapbuild.c process
> isn't enough to prevent that thread's race condition.  What do you think?

IIUC, documentation just intended to say that a type of replication will be
switched from stream to logical one, at the certain point. Please give sometime
for analyzing.

[1]: 
https://www.postgresql.org/message-id/270ad9b8-9c46-40c3-b6c5-3d25b91d3a7d%40app.fastmail.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/

 


0001-pg_createsubscriber-Fix-cases-which-connection-param.patch
Description:  0001-pg_createsubscriber-Fix-cases-which-connection-param.patch


RE: Pgoutput not capturing the generated columns

2024-06-22 Thread Hayato Kuroda (Fujitsu)
Hi Shubham,

Thanks for sharing new patch! You shared as v9, but it should be v10, right?
Also, since there are no commitfest entry, I registered [1]. You can rename the
title based on the needs. Currently CFbot said OK.

Anyway, below are my comments.

01. General
Your patch contains unnecessary changes. Please remove all of them. E.g., 

```
 " s.subpublications,\n");
-
```
And
```
appendPQExpBufferStr(query, " o.remote_lsn AS 
suboriginremotelsn,\n"
-" s.subenabled,\n");
+   " s.subenabled,\n");
```

02. General
Again, please run the pgindent/pgperltidy.

03. test_decoding
Previously I suggested to the default value of to be include_generated_columns
should be true, so you modified like that. However, Peter suggested opposite
opinion [3] and you just revised accordingly. I think either way might be okay, 
but
at least you must clarify the reason why you preferred to set default to false 
and
changed accordingly.

04. decoding_into_rel.sql
According to the comment atop this file, this test should insert result to a 
table.
But added case does not - we should put them at another place. I.e., create 
another
file.

05. decoding_into_rel.sql
```
+-- when 'include-generated-columns' is not set
```
Can you clarify the expected behavior as a comment?

06. getSubscriptions
```
+   else
+   appendPQExpBufferStr(query,
+   " false AS 
subincludegencols,\n");
```
I think the comma is not needed.
Also, this error meant that you did not test to use pg_dump for instances prior 
PG16.
Please verify whether we can dump subscriptions and restore them accordingly.

[1]: https://commitfest.postgresql.org/48/5068/
[2]: 
https://www.postgresql.org/message-id/OSBPR01MB25529997E012DEABA8E15A02F5E52%40OSBPR01MB2552.jpnprd01.prod.outlook.com
[3]: 
https://www.postgresql.org/message-id/CAHut%2BPujrRQ63ju8P41tBkdjkQb4X9uEdLK_Wkauxum1MVUdfA%40mail.gmail.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



pg_createsubscriber: drop pre-existing subscriptions from the converted node

2024-06-21 Thread Hayato Kuroda (Fujitsu)
Dear Hackers,

This is a follow-up thread for pg_createsubscriber [1]. I started a new thread
since there is no activity around here.

## Problem

Assuming that there is a cascading replication like below:

node A --(logical replication)--> node B --(streaming replication)--> node C

In this case, subscriptions exist even on node C, but it does not try to connect
to node A because the logical replication launcher/worker won't be launched.
After the conversion, node C becomes a subscriber for node B, and the 
subscription
toward node A remains. Therefore, another worker that tries to connect to node A
will be launched, raising an ERROR [2]. This failure may occur even during the
conversion.

## Solution

The easiest solution is to drop pre-existing subscriptions from the converted 
node.
To avoid establishing connections during the conversion, slot_name is set to 
NONE
on the primary first, then drop on the standby. The setting will be restored on 
the
primary node.
Attached patch implements the idea. Test script is also included, but not sure 
it should
be on the HEAD

BTW, I found that LogicalRepInfo.oid won't be used. If needed, I can create
another patch to remove the attribute.

How do you think?

[1]: 
https://www.postgresql.org/message-id/CAA4eK1J22UEfrqx222h5j9DQ7nxGrTbAa_BC%2B%3DmQXdXs-RCsew%40mail.gmail.com
[2]: 
https://www.postgresql.org/message-id/CANhcyEWvimA1-f6hSrA%3D9qkfR5SonFb56b36M%2B%2BvT%3DLiFj%3D76g%40mail.gmail.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



0001-pg_createsubscriber-Drop-pre-existing-subscriptions-.patch
Description:  0001-pg_createsubscriber-Drop-pre-existing-subscriptions-.patch


RE: 001_rep_changes.pl fails due to publisher stuck on shutdown

2024-06-18 Thread Hayato Kuroda (Fujitsu)
Dear Horiguchi-san,

Thanks for sharing the patch! I agree this approach (ensure WAL records are 
flushed)
Is more proper than others.

I have an unclear point. According to the comment atop GetInsertRecPtr(), it 
just
returns the approximated value - the position of the last full WAL page [1].
If there is a continuation WAL record which across a page, will it return the
halfway point of the WAL record (end of the first WAL page)? If so, the proposed
fix seems not sufficient. We have to point out the exact the end of the record.

[1]:
/*
 * GetInsertRecPtr -- Returns the current insert position.
 *
 * NOTE: The value *actually* returned is the position of the last full
 * xlog page. It lags behind the real insert position by at most 1 page.
 * For that, we don't need to scan through WAL insertion locks, and an
 * approximation is enough for the current usage of this function.
 */
XLogRecPtr
GetInsertRecPtr(void)

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 





RE: speed up a logical replica setup

2024-06-07 Thread Hayato Kuroda (Fujitsu)
Dear Euler,

Thanks for making the follow-up patch! I was looking forward to your updates.
I think this patch set is the solution for the found buildfarm error. However,
there are remained claims raised by others. You should reply what you think for
them. At least:

1) There are some misleading messages [1]. I think v3-0005 patch set can solve
   the issue.
2) pg_createsubscriber may fail If the primary has subscriptions [2]. IIUC
   possible approaches are A)"keep subscriptions disabled at the end",
   B)"by default drop the pre-existing subscriptions",
   C) "do nothing, just document the risk".


> Before sending this email I realized that I did nothing about physical
> replication slots on the standby. I think we should also remove them too
> unconditionally.

I also considered around here, but it might be difficult to predict the
expectation by users. Can we surely say that it's not intentional one? Regarding
the failover slot, it is OK because that's meaningful only on the standby,
but not sure other slots. I personally think we can keep current spec, but
how do other think? 


Below parts are comments for each patches.

0001
Basically LGTM. I was bit confused because the default timeout is not set, but
it seemed to follow the suggestion by Tomas [3].

0002
If you want to improve the commit message, please add that 
sync_replication_slots
is disabled during the conversion.

0003
Confirmed it followed the discussion [4].

0004
Basically LGTM.

Other minor comments are included by the attached diff file. It contains changes
to follow conventions and pgindent/pgperltidy.

[1]: 
https://www.postgresql.org/message-id/CAA4eK1J2fAvsJ2HihbWJ_GxETd6sdqSMrZdCVJEutRZRpm1MEQ%40mail.gmail.com
[2]: 
https://www.postgresql.org/message-id/CANhcyEWvimA1-f6hSrA%3D9qkfR5SonFb56b36M%2B%2BvT%3DLiFj%3D76g%40mail.gmail.com
[3]: 
https://www.postgresql.org/message-id/5d5dd4cd-6359-4109-88e8-c8e13035ae16%40enterprisedb.com
[4]: 
https://www.postgresql.org/message-id/CAA4eK1LZxYxcbeiOn3Q5hjXVtZKhJWj-fQtndAeTCvZrPev8BA%40mail.gmail.com


Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/global/ 



minor_fix_by_kuroda.diff
Description: minor_fix_by_kuroda.diff


RE: Pgoutput not capturing the generated columns

2024-06-05 Thread Hayato Kuroda (Fujitsu)
Dear Shlok and Shubham,

Thanks for updating the patch!

I briefly checked the v5-0002. IIUC, your patch allows to copy generated
columns unconditionally. I think the behavior affects many people so that it is
hard to get agreement.

Can we add a new option like `GENERATED_COLUMNS [boolean]`? If the default is 
set
to off, we can keep the current specification.

Thought?

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



RE: State of pg_createsubscriber

2024-05-22 Thread Hayato Kuroda (Fujitsu)
Dear Amit, Robert,

> So, we have the following options: (a) by default drop the
> pre-existing subscriptions, (b) by default disable the pre-existing
> subscriptions, and add a Note in the docs that users can take
> necessary actions to enable or drop them. Now, we can even think of
> providing a switch to retain the pre-existing subscriptions or
> publications as the user may have some use case where it can be
> helpful for her. For example, retaining publications can help in
> creating a bi-directional setup.

Another point we should consider is the replication slot. If standby server has
had slots and they were forgotten, WAL files won't be discarded so disk full
failure will happen. v2-0004 proposed in [1] drops replication slots when their
failover option is true. This can partially solve the issue, but what should be
for other slots?

[1]: 
https://www.postgresql.org/message-id/CANhcyEV6q1Vhd37i1axUeScLi0UAGVxta1LDa0BV0Eh--TcPMg%40mail.gmail.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



RE: Pgoutput not capturing the generated columns

2024-05-22 Thread Hayato Kuroda (Fujitsu)
Dear Shubham,

Thanks for updating the patch! I checked your patches briefly. Here are my 
comments.

01. API

Since the option for test_decoding is enabled by default, I think it should be 
renamed.
E.g., "skip-generated-columns" or something.

02. ddl.sql

```
+-- check include-generated-columns option with generated column
+CREATE TABLE gencoltable (a int PRIMARY KEY, b int GENERATED ALWAYS AS (a * 2) 
STORED);
+INSERT INTO gencoltable (a) VALUES (1), (2), (3);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 
'include-xids', '0', 'skip-empty-xacts', '1', 'include-generated-columns', '1');
+data 
+-
+ BEGIN
+ table public.gencoltable: INSERT: a[integer]:1 b[integer]:2
+ table public.gencoltable: INSERT: a[integer]:2 b[integer]:4
+ table public.gencoltable: INSERT: a[integer]:3 b[integer]:6
+ COMMIT
+(5 rows)
```

We should test non-default case, which the generated columns are not generated.

03. ddl.sql

Not sure new tests are in the correct place. Do we have to add new file and 
move tests to it?
Thought?

04. protocol.sgml

Please keep the format of the sgml file.

05. protocol.sgml

The option is implemented as the streaming option of pgoutput plugin, so they 
should be
located under "Logical Streaming Replication Parameters" section.

05. AlterSubscription

```
+   if (IsSet(opts.specified_opts, 
SUBOPT_GENERATED_COLUMN))
+   {
+   ereport(ERROR,
+   
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+errmsg("toggling 
generated_column option is not allowed.")));
+   }
```

If you don't want to support the option, you can remove SUBOPT_GENERATED_COLUMN
macro from the function. But can you clarify the reason why you do not want?

07. logicalrep_write_tuple

```
-   if (!column_in_column_list(att->attnum, columns))
+   if (!column_in_column_list(att->attnum, columns) && 
!att->attgenerated)
+   continue;
+
+   if (att->attgenerated && !publish_generated_column)
continue;
```

I think changes in v2 was reverted or wrongly merged.

08. test code

Can you add tests that generated columns are replicated by the logical 
replication?

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



RE: Slow catchup of 2PC (twophase) transactions on replica in LR

2024-05-15 Thread Hayato Kuroda (Fujitsu)
Dear Peter,

> You wrote "Fixed" for that patch v9-0004 suggestion but I don't think
> anything was changed at all. Accidentally missed?

Sorry, I missed to do `git add` after the revision.
The change was also included in new patch [1].

[1]: 
https://www.postgresql.org/message-id/OSBPR01MB25522052F9F3E3AAD3BA2A8CF5ED2%40OSBPR01MB2552.jpnprd01.prod.outlook.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



RE: Fix src/test/subscription/t/029_on_error.pl test when wal_debug is enabled

2024-05-15 Thread Hayato Kuroda (Fujitsu)
Dear Amit, Ian,

> I guess it could be more work if we want to enhance the test for
> ERRORs other than the primary key violation. One simple fix is to
> update the log_offset to the location of the LOG after successful
> replication of un-conflicted data. For example, the Log location after
> we execute the below line in the test:
> 
> # Check replicated data
>my $res =
>  $node_subscriber->safe_psql('postgres', "SELECT
> count(*) FROM tbl");
>is($res, $expected, $msg);

I made a patch for confirmation purpose. This worked well on my environment.
Ian, how about you?

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



fix_029.diff
Description: fix_029.diff


RE: Slow catchup of 2PC (twophase) transactions on replica in LR

2024-05-14 Thread Hayato Kuroda (Fujitsu)
Dear Peter,

Thanks for reviewing! New patch is available in [1].

> I'm having second thoughts about how these patches mention the option
> values "on|off". These are used in the ALTER SUBSCRIPTION document
> page for 'two_phase' and 'failover' parameters, and then those
> "on|off" get propagated to the code comments, error messages, and
> tests...
> 
> Now I see that on the CREATE SUBSCRIPTION page [1], every boolean
> parameter (even including 'two_phase' and 'failover') is described in
> terms of "true|false" (not "on|off").

Hmm. But I could sentences like "The default value is off,...". Also, in 
alter_subscription.sgml,
"on|off" notation has already been used. Not sure, but I felt there are no 
rules around here.

> In hindsight, it is probably better to refer only to true|false
> everywhere for these boolean parameters, instead of sometimes using
> different values like on|off.
> 
> What do you think?

It's OK for me to make message/code comments consistent. Not sure the 
documentation,
but followed only my part.

[1]: 
https://www.postgresql.org/message-id/OSBPR01MB2552F66463EFCFD654E87C09F5E32%40OSBPR01MB2552.jpnprd01.prod.outlook.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



RE: Slow catchup of 2PC (twophase) transactions on replica in LR

2024-05-13 Thread Hayato Kuroda (Fujitsu)
/'force_alter' option is not specified as 
> true./

Fixed.

> 
> 7.
> +# Verify the prepared transaction still exists
> +$result = $node_subscriber->safe_psql('postgres',
> +"SELECT count(*) FROM pg_prepared_xacts;");
> +is($result, q(1), "prepared transaction still exits");
> +
> 
> TYPO: /exits/exists/

Fixed.

> 
> ~~~
> 
> 8.
> +# Alter the two_phase with the force_alter option. Apart from the above, the
> +# command will abort the prepared transaction and succeed.
> +$node_subscriber->safe_psql('postgres',
> +"ALTER SUBSCRIPTION regress_sub SET (two_phase = off, force_alter
> = true);");
> +$node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION
> regress_sub ENABLE;");
> +
> 
> What does "Apart from the above" mean? Be more explicit.

Clarified like "Apart from the last ALTER SUBSCRIPTION command...".

> 9.
> +# Verify the prepared transaction are aborted
>  $result = $node_subscriber->safe_psql('postgres',
>  "SELECT count(*) FROM pg_prepared_xacts;");
>  is($result, q(0), "prepared transaction done by worker is aborted");
> 
> /transaction are aborted/transaction was aborted/

Fixed.

[1]: 
https://www.postgresql.org/message-id/OSBPR01MB2552FEA48D265EA278AA9F7AF5E22%40OSBPR01MB2552.jpnprd01.prod.outlook.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



RE: Slow catchup of 2PC (twophase) transactions on replica in LR

2024-05-13 Thread Hayato Kuroda (Fujitsu)
Dear Peter,

Thanks for reviewing! Patch can be available in [1].

> ==
> src/sgml/ref/alter_subscription.sgml
> 
> 1.
> + 
> +  The two_phase parameter can only be altered when
> the
> +  subscription is disabled. When altering the parameter from
> on
> +  to off, the backend process checks prepared
> +  transactions done by the logical replication worker and aborts them.
> + 
> 
> The text may be OK as-is, but I was wondering if it might be better to
> give a more verbose explanation.
> 
> BEFORE
> ... the backend process checks prepared transactions done by the
> logical replication worker and aborts them.
> 
> SUGGESTION
> ... the backend process checks for any incomplete prepared
> transactions done by the logical replication worker (from when
> two_phase parameter was still "on") and, if any are found, those are
> aborted.
>

Fixed.

> ==
> src/backend/commands/subscriptioncmds.c
> 
> 2. AlterSubscription
> 
> - /*
> - * Since the altering two_phase option of subscriptions
> - * also leads to the change of slot option, this command
> - * cannot be rolled back. So prevent we are in the
> - * transaction block.
> + * If two_phase was enabled, there is a possibility the
> + * transactions has already been PREPARE'd. They must be
> + * checked and rolled back.
>   */
> 
> BEFORE
> ... there is a possibility the transactions has already been PREPARE'd.
> 
> SUGGESTION
> ... there is a possibility that transactions have already been PREPARE'd.

Fixed.

> 3. AlterSubscription
> + /*
> + * Since the altering two_phase option of subscriptions
> + * (especially on->off case) also leads to the
> + * change of slot option, this command cannot be rolled
> + * back. So prevent we are in the transaction block.
> + */
>   PreventInTransactionBlock(isTopLevel,
> "ALTER SUBSCRIPTION ... SET (two_phase = off)");
> 
> 
> This comment is a bit vague and includes some typos, but IIUC these
> problems will already be addressed by the 0002 patch changes.AFAIK
> patch 0003 is only moving the 0002 comment.

Yeah, the comment was updated accordingly.

> 
> ==
> src/test/subscription/t/099_twophase_added.pl
> 
> 4.
> +# Check the case that prepared transactions exist on the subscriber node
> +#
> +# If the two_phase is altering from "on" to "off" and there are prepared
> +# transactions on the subscriber, they must be aborted. This test checks it.
> +
> 
> Similar to the comment that I gave for v8-0002. I think there should
> be  comment for the major test comment to
> distinguish it from comments for the sub-steps.

Added.

> 5.
> +# Verify the prepared transaction are aborted because two_phase is changed to
> +# "off".
> +$result = $node_subscriber->safe_psql('postgres',
> +"SELECT count(*) FROM pg_prepared_xacts;");
> +is($result, q(0), "prepared transaction done by worker is aborted");
> +
> 
> /the prepared transaction are aborted/any prepared transactions are aborted/

Fixed.

[1]: 
https://www.postgresql.org/message-id/OSBPR01MB2552FEA48D265EA278AA9F7AF5E22%40OSBPR01MB2552.jpnprd01.prod.outlook.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



RE: Improving the latch handling between logical replication launcher and worker processes.

2024-05-09 Thread Hayato Kuroda (Fujitsu)
Dear Vignesh,

Thanks for raising idea!

> a) Introduce a new latch to handle worker attach and exit.

Just to confirm - there are three wait events for launchers, so I feel we may be
able to create latches per wait event. Is there a reason to introduce
"a" latch?

> b) Add a new GUC launcher_retry_time which gives more flexibility to
> users as suggested by Amit at [1]. Before 5a3a953, the
> wal_retrieve_retry_interval plays a similar role as the suggested new
> GUC launcher_retry_time, e.g. even if a worker is launched, the
> launcher only wait wal_retrieve_retry_interval time before next round.

Hmm. My concern is how users estimate the value. Maybe the default will be
3min, but should users change it? If so, how long? I think even if it becomes
tunable, they cannot control well.

> c) Don't reset the latch at worker attach and allow launcher main to
> identify and handle it. For this there is a patch v6-0002 available at
> [2].

Does it mean that you want to remove ResetLatch() from
WaitForReplicationWorkerAttach(), right? If so, what about the scenario?

1) The launcher waiting the worker is attached in 
WaitForReplicationWorkerAttach(),
and 2) subscription is created before attaching. In this case, the launcher will
become un-sleepable because the latch is set but won't be reset. It may waste 
the
CPU time.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 


RE: Slow catchup of 2PC (twophase) transactions on replica in LR

2024-05-09 Thread Hayato Kuroda (Fujitsu)
Dear Peter,

> 
> ==
> Commit message
> 
> 1.
> A detailed commit message is needed to describe the purpose and
> details of this patch.

Added.

> ==
> doc/src/sgml/ref/alter_subscription.sgml
> 
> 2. CREATE SUBSCRIPTION
> 
> Shouldn't there be an entry for "force_alter" parameter in the CREATE
> SUBSCRIPTION "parameters" section, instead of just vaguely mentioning
> it in passing when describing the "two_phase" in ALTER SUBSCRIPTION?
>
> 3. ALTER SUBSCRIPTION - alterable parameters
> 
> And shouldn't this new option also be named in the ALTER SUBSCRIPTION
> list: "The parameters that can be altered are..."

Hmm, but the parameter cannot be used for CREATE SUBSCRIPTION. Should we
modify to accept and add the description in the doc? This was not accepted.

> ==
> src/backend/commands/subscriptioncmds.c
> 
> 4.
>   XLogRecPtr lsn;
> + bool twophase_force;
>  } SubOpts;
> 
> IMO this field ought to be called 'force_alter' to be the same as the
> option name. Sure, now it is only relevant for 'two_phase', but that
> might not always be the case in the future.

Modified.

> 5. AlterSubscription
> 
> + /*
> + * Abort prepared transactions if force option is also
> + * specified. Otherwise raise an ERROR.
> + */
> + if (!opts.twophase_force)
> + ereport(ERROR,
> + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> + errmsg("cannot alter %s when there are prepared transactions",
> + "two_phase = false")));
> +
> 
> 5a.
> /if force option is also specified/only if the 'force_alter' option is true/

Modified.

> 
> 5b.
> "two_phase = false" -- IMO that should say "two_phase = off"

Modified.

> 5c.
> IMO this ereport should include a errhint to tell the user they can
> use 'force_alter = true' to avoid getting this error.

Hint was added.

> 6.
> 
> + /* force_alter cannot be used standalone */
> + if (IsSet(opts.specified_opts, SUBOPT_FORCE_ALTER) &&
> + !IsSet(opts.specified_opts, SUBOPT_TWOPHASE_COMMIT))
> + ereport(ERROR,
> + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> + errmsg("%s must be specified with %s",
> + "force_alter", "two_phase")));
> +
> 
> IMO this rule is not necessary so the code should be removed. I think
> using 'force_alter' standalone doesn't do anything at all (certainly,
> it does no harm) so why add more complications (more rules, more code,
> more tests) just for the sake of it?

Removed. So standalone 'force_alter' is now no-op.

> src/test/subscription/t/099_twophase_added.pl
> 
> 7.
> +$node_subscriber->safe_psql('postgres',
> +"ALTER SUBSCRIPTION sub SET (two_phase = off, force_alter = on);");
> 
> "force" is a verb, so it is better to say 'force_alter = true' instead
> of 'force_alter = on'.

Fixed. Actually not sure it is better because I'm not a native.

> 8.
>  $result = $node_subscriber->safe_psql('postgres',
>  "SELECT count(*) FROM pg_prepared_xacts;");
>  is($result, q(0), "prepared transaction done by worker is aborted");
> 
> +$node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION sub
> ENABLE;");
> +
> 
> I felt the ENABLE statement should be above the SELECT statement so
> that the code is more like it was before applying the patch.

Fixed.

Please see attached patch set.


Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



v8-0001-Allow-altering-of-two_phase-option-of-a-SUBSCRIPT.patch
Description:  v8-0001-Allow-altering-of-two_phase-option-of-a-SUBSCRIPT.patch


v8-0002-Alter-slot-option-two_phase-only-when-altering-on.patch
Description:  v8-0002-Alter-slot-option-two_phase-only-when-altering-on.patch


v8-0003-Abort-prepared-transactions-while-altering-two_ph.patch
Description:  v8-0003-Abort-prepared-transactions-while-altering-two_ph.patch


v8-0004-Add-force_alter-option-for-ALTER-SUBSCIRPTION-.-S.patch
Description:  v8-0004-Add-force_alter-option-for-ALTER-SUBSCIRPTION-.-S.patch


RE: Slow catchup of 2PC (twophase) transactions on replica in LR

2024-05-09 Thread Hayato Kuroda (Fujitsu)
Dear Peter,

> Commit Message
> 
> 1.
> The patch needs a commit message to describe the purpose and highlight
> any limitations and other details.

Added.

> ==
> doc/src/sgml/ref/alter_subscription.sgml
> 
> 2.
> +
> + 
> +  The two_phase parameter can only be altered when
> the
> +  subscription is disabled. When altering the parameter from
> true
> +  to false, the backend process checks prepared
> +  transactions done by the logical replication worker and aborts them.
> + 
> 
> Here, the para is referring to "true" and "false" but earlier on this
> page it talks about "twophase = off". IMO it is better to use a
> consistent terminology like "on|off" everywhere instead of randomly
> changing the way it is described each time.

I checked contents and changed to "on|off".

> ==
> src/backend/commands/subscriptioncmds.c
> 
> 3. AlterSubscription
> 
>   if (IsSet(opts.specified_opts, SUBOPT_TWOPHASE_COMMIT))
>   {
> + List *prepared_xacts = NIL;
> 
> This 'prepared_xacts' can be declared at a lower scrope because it is
> only used if (!opts.twophase).
> 
> Furthermore, IIUC you don't need to assign NIL in the declaration
> because there is no chance for it to be unassigned anyway.

Made the namespace narrower and initialization was removed.

> ~~~
> 
> 4. AlterSubscription
> 
> + /*
> + * The changed two_phase option (true->false) of the
> + * slot can't be rolled back.
> + */
>   PreventInTransactionBlock(isTopLevel,
> "ALTER SUBSCRIPTION ... SET (two_phase = off)");
> 
> Here is another example of inconsistent mixing of the terminology
> where the comment says "true"/"false" but the message says "off".
> Let's keep everything consistent. (I prefer on|off).

Modified.

> ~~~
> 
> 5.
> + if (sub->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED &&
> + (prepared_xacts = GetGidListBySubid(subid)) != NIL)
> + {
> + ListCell *cell;
> +
> + /* Abort all listed transactions */
> + foreach(cell, prepared_xacts)
> + FinishPreparedTransaction((char *) lfirst(cell),
> +   false);
> +
> + list_free(prepared_xacts);
> + }
> 
> 5A.
> IIRC there is a cleaner way to write this loop without needing
> ListCell variable -- e.g. foreach_ptr() macro?

Changed.

> 5B.
> Shouldn't this be using list_free_deep() so the pstrdup gid gets freed too?

Yeah, fixed.

> ==
> src/test/subscription/t/099_twophase_added.pl
> 
> 6.
> +##
> +# Check the case that prepared transactions exist on subscriber node
> +##
> +
> 
> Give some more detailed comments here similar to the review comment of
> patch v7-0002 for the other part of this TAP test.
> 
> ~~~
> 
> 7. TAP test - comments
> 
> Same as for my v7-0002 review comments, I think this test case also
> needs a few more one-line comments to describe the sub-steps. e.g.:
> 
> # prepare a transaction to insert some rows to the table
> 
> # verify the prepared tx is replicated to the subscriber (because
> 'two_phase = on')
> 
> # toggle the two_phase to 'off' *before* the COMMIT PREPARED
> 
> # verify the prepared tx got aborted
> 
> # do the COMMIT PREPARED (note that now two_phase is 'off')
> 
> # verify the inserted rows got replicated ok

They were fixed based on your previous comments.

> 
> 8. TAP test - subscription name
> 
> It's better to rename the SUBSCRIPTION in this TAP test so you can
> avoid getting log warnings like:
> 
> psql::4: WARNING:  subscriptions created by regression test
> cases should have names starting with "regress_"
> psql::4: NOTICE:  created replication slot "sub" on publisher

Modified, but it was included in 0001.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 




RE: Slow catchup of 2PC (twophase) transactions on replica in LR

2024-05-09 Thread Hayato Kuroda (Fujitsu)
ingInfo(&cmd, "ALTER_REPLICATION_SLOT %s ( ",
> + quote_identifier(slotname));
> +
> + if (failover)
> + appendStringInfo(&cmd, "FAILOVER %s ",
> + (*failover) ? "true" : "false");
> +
> + if (two_phase)
> + appendStringInfo(&cmd, "TWO_PHASE %s%s ",
> + (*two_phase) ? "true" : "false",
> + failover ? ", " : "");
> +
> + appendStringInfoString(&cmd, ");");
> 
> 4a.
> IIUC the comma logic here was broken in v7 when you swapped the order.
> Anyway, IMO it will be better NOT to try combining that comma logic
> with the existing appendStringInfo. Doing it separately is both easier
> and less error-prone.
> 
> Furthermore, the parentheses like "(*two_phase)" instead of just
> "*two_phase" seemed a bit overkill.
> 
> SUGGESTION:
> + if (failover)
> + appendStringInfo(&cmd, "FAILOVER %s",
> + *failover ? "true" : "false");
> +
> +   if (failover && two_phase)
> +   appendStringInfo(&cmd, ", ");
> +
> + if (two_phase)
> + appendStringInfo(&cmd, "TWO_PHASE %s",
> + *two_phase ? "true" : "false");
> +
> + appendStringInfoString(&cmd, " );");

Fixed.

> 4b.
> Like I said above, IMO the current separator logic in v7 is broken. So
> it is a bit concerning the tests all passed anyway. How did that
> happen? I think this indicates that there needs to be an additional
> test scenario where both 'failover' and 'two_phase' get altered at the
> same time so this code gets exercised properly.

Right, it was added.

> ==
> src/test/subscription/t/099_twophase_added.pl
> 
> 5.
> +# Define pre-existing tables on both nodes
> 
> Why say they are "pre-existing"? They are not pre-existing because you
> are creating them right here!

Removed the word.

> 6.
> +##
> +# Check the case that prepared transactions exist on publisher node
> +##
> 
> I think this needs a slightly more detailed comment.
> 
> SUGGESTION (this is just an example, but you can surely improve it)
> 
> # Check the case that prepared transactions exist on the publisher node.
> #
> # Since two_phase is "off", then normally this PREPARE will do nothing until
> # the COMMIT PREPARED, but in this test, we toggle the two_phase to "on" again
> # before the COMMIT PREPARED happens.

Changed with adjustments.

> 7.
> Maybe this test case needs a few more one-line comments for each of
> the sub-steps. e.g.:
> 
> # prepare a transaction to insert some rows to the table
> 
> # verify the prepared tx is not yet replicated to the subscriber
> (because 'two_phase = off')
> 
> # toggle the two_phase to 'on' *before* the COMMIT PREPARED
> 
> # verify the inserted rows got replicated ok

Modified like yours, but changed based on the suggestion by Grammarly.

> 8.
> IIUC this test will behave the same even if you DON'T do the toggle
> 'two_phase = on'. So I wonder is there something more you can do to
> test this scenario more convincingly?

I found an indicator. When the apply starts, it outputs the current status of
two_phase option. I added wait_for_log() to ensure below appeared. Thought?

```
ereport(DEBUG1,
(errmsg_internal("logical replication apply worker for 
subscription \"%s\" two_phase is %s",
 MySubscription->name,
 
MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_DISABLED ? 
"DISABLED" :
 
MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING ? "PENDING" :
 
MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED ? "ENABLED" :
 "?")));
```
Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



RE: Slow catchup of 2PC (twophase) transactions on replica in LR

2024-05-09 Thread Hayato Kuroda (Fujitsu)
Dear Peter,

Thanks for reviewing! The patch will be posted in the upcoming post.

> ==
> src/backend/access/transam/twophase.c
> 
> 1. IsTwoPhaseTransactionGidForSubid
> 
> +/*
> + * IsTwoPhaseTransactionGidForSubid
> + * Check whether the given GID is formed by TwoPhaseTransactionGid.
> + */
> +static bool
> +IsTwoPhaseTransactionGidForSubid(Oid subid, char *gid)
> 
> I think the function comment should mention something about 'subid'.
> 
> SUGGESTION
> Check whether the given GID (as formed by TwoPhaseTransactionGid) is
> for the specified 'subid'.

Fixed.

> src/backend/commands/subscriptioncmds.c
> 
> 2. AlterSubscription
> 
> + if (!opts.twophase &&
> + form->subtwophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED
> &&
> + LookupGXactBySubid(subid))
> + /* Add error message */
> + ereport(ERROR,
> + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> + errmsg("cannot disable two_phase when uncommitted prepared
> transactions present"),
> + errhint("Resolve these transactions and try again")));
> 
> The comment "/* Add error message */" seems unnecessary.

Yeah, this was an internal flag. Removed.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



RE: Pgoutput not capturing the generated columns

2024-05-08 Thread Hayato Kuroda (Fujitsu)
Dear Shubham,

Thanks for creating a patch! Here are high-level comments.

1.
Please document the feature. If it is hard to describe, we should change the 
API.

2.
Currently, the option is implemented as streaming option. Are there any reasons
to choose the way? Another approach is to implement as slot option, like 
failover
and temporary.

3.
You said that subscription option is not supported for now. Not sure, is it mean
that logical replication feature cannot be used for generated columns? If so,
the restriction won't be acceptable. If the combination between this and initial
sync is problematic, can't we exclude them in CreateSubscrition and 
AlterSubscription?
E.g., create_slot option cannot be set if slot_name is NONE.

4.
Regarding the test_decoding plugin, it has already been able to decode the
generated columns. So... as the first place, is the proposed option really 
needed
for the plugin? Why do you include it?
If you anyway want to add the option, the default value should be on - which 
keeps
current behavior.

5.
Assuming that the feature become usable used for logical replicaiton. Not sure,
should we change the protocol version at that time? Nodes prior than PG17 may
not want receive values for generated columns. Can we control only by the 
option?

6. logicalrep_write_tuple()

```
-if (!column_in_column_list(att->attnum, columns))
+if (!column_in_column_list(att->attnum, columns) && !att->attgenerated)
+continue;
```

Hmm, does above mean that generated columns are decoded even if they are not in
the column list? If so, why? I think such columns should not be sent.

7.

Some functions refer data->publish_generated_column many times. Can we store
the value to a variable?

Below comments are for test_decoding part, but they may be not needed.

=

a. pg_decode_startup()

```
+else if (strcmp(elem->defname, "include_generated_columns") == 0)
```

Other options for test_decoding do not have underscore. It should be
"include-generated-columns".

b. pg_decode_change()

data->include_generated_columns is referred four times in the function.
Can you store the value to a varibable?


c. pg_decode_change()

```
-true);
+true, data->include_generated_columns );
```

Please remove the blank.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



RE: Slow catchup of 2PC (twophase) transactions on replica in LR

2024-05-08 Thread Hayato Kuroda (Fujitsu)
gt; There are double blank lines after the first if-block.

Removed.

> ==
> src/test/regress/sql/subscription.sql
> 
> 16.
> I know you do this already in the TAP test, but doesn't the test case
> to demonstrate that 'two-phase' option can be altered when the
> subscription is disabled actually belong here in the regression
> instead?

Actually it cannot be done at main regression test. Because altering two_phase
requires the connection between pub/sub, but it is not established in 
subscription.sql
file. Succeeded case for altering failover has not been tested neither, and
I think they have same reason.

> src/test/subscription/t/021_twophase.pl
> 
> 17.
> +# Disable the subscription and alter it to two_phase = false,
> +# verify that the altered subscription reflects the two_phase option.
> 
> /verify/then verify/

Fixed.

> 18.
> +# Now do a prepare on publisher and make sure that it is not replicated.
> +$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
> +$node_publisher->safe_psql(
> +   'postgres', qq{
> +BEGIN;
> +INSERT INTO tab_copy VALUES (100);
> +PREPARE TRANSACTION 'newgid';
> + });
> +
> 
> 18a.
> /on publisher/on the publisher/

Fixed.

> 18b.
> What is that "DROP SUBSCRIPTION tap_sub" doing here? It seems
> misplaced under this comment.

The subscription must be dropped because it also prepares a transaction.
Moved before the test case and added comments.

> 19.
> +# Make sure that there is 0 prepared transaction on the subscriber
> +$result = $node_subscriber->safe_psql('postgres',
> + "SELECT count(*) FROM pg_prepared_xacts;");
> +is($result, qq(0), 'transaction is prepared on subscriber');
> 
> 19a.
> SUGGESTION
> Make sure there are no prepared transactions on the subscriber

Fixed.

> 19b.
> /'transaction is prepared on subscriber'/'should be no prepared
> transactions on subscriber'/

Replaced/

> 20.
> +# Made sure that the commited transaction is replicated.
> 
> /Made sure/Make sure/
> 
> /commited/committed/

Fixed.

> 21.
> +# Make sure that the two-phase is enabled on the subscriber
> +$result = $node_subscriber->safe_psql('postgres',
> + "SELECT subtwophasestate FROM pg_subscription WHERE subname =
> 'tap_sub_copy';"
> +);
> +is($result, qq(e), 'two-phase is disabled');
> 
> The 'two-phase is disabled' is the identical message used in the
> opposite case earlier, so something is amiss. Maybe this one should
> say 'two-phase should be enabled' and the earlier counterpart should
> say 'two-phase should be disabled'.

Both of them were fixed.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 




v7-0001-Allow-altering-of-two_phase-option-of-a-SUBSCRIPT.patch
Description:  v7-0001-Allow-altering-of-two_phase-option-of-a-SUBSCRIPT.patch


v7-0002-Alter-slot-option-two_phase-only-when-altering-tr.patch
Description:  v7-0002-Alter-slot-option-two_phase-only-when-altering-tr.patch


v7-0003-Abort-prepared-transactions-while-altering-two_ph.patch
Description:  v7-0003-Abort-prepared-transactions-while-altering-two_ph.patch


v7-0004-Add-force_alter-option.patch
Description: v7-0004-Add-force_alter-option.patch


RE: Slow catchup of 2PC (twophase) transactions on replica in LR

2024-04-23 Thread Hayato Kuroda (Fujitsu)
Dear hackers,

Per recent commit (b29cbd3da), our patch needed to be rebased.
Here is an updated version.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 
 


v6-0001-Allow-altering-of-two_phase-option-of-a-SUBSCRIPT.patch
Description:  v6-0001-Allow-altering-of-two_phase-option-of-a-SUBSCRIPT.patch


v6-0002-Alter-slot-option-two_phase-only-when-altering-tr.patch
Description:  v6-0002-Alter-slot-option-two_phase-only-when-altering-tr.patch


v6-0003-Abort-prepared-transactions-while-altering-two_ph.patch
Description:  v6-0003-Abort-prepared-transactions-while-altering-two_ph.patch


v6-0004-Add-force_alter-option.patch
Description: v6-0004-Add-force_alter-option.patch


RE: Slow catchup of 2PC (twophase) transactions on replica in LR

2024-04-22 Thread Hayato Kuroda (Fujitsu)
> Dear Vitaly,
> 
> > I looked at the patch and realized that I can't try it easily in the near 
> > future
> > because the solution I'm working on is based on PG16 or earlier. This patch 
> > is
> > not easily applicable to the older releases. I have to port my solution to 
> > the
> > master, which is not done yet.
> 
> We also tried to port our patch for PG16, but the largest barrier was that a
> replication command ALTER_SLOT is not supported. Since the slot option
> two_phase
> can't be modified, it is difficult to skip decoding PREPARE command even when
> altering the option from true to false.
> IIUC, Adding a new feature (e.g., replication command) for minor updates is
> generally
> prohibited
> 
> We must consider another approach for backpatching, but we do not have
> solutions
> for now.

Attached patch set is a ported version for PG16, which breaks ABI. This can be 
used
for testing purpose, but it won't be pushed to REL_16_STABLE.
At least, this patchset can pass my github CI.

Can you apply and check whether your issue is solved?

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 
 


REL_16_0001-Allow-altering-of-two_phase-option-of-a-SUBSCRIPTION.patch
Description:  REL_16_0001-Allow-altering-of-two_phase-option-of-a-SUBSCRIPTION.patch


REL_16_0002-Alter-slot-option-two_phase-only-when-altering-true-.patch
Description:  REL_16_0002-Alter-slot-option-two_phase-only-when-altering-true-.patch


REL_16_0003-Abort-prepared-transactions-while-altering-two_phase.patch
Description:  REL_16_0003-Abort-prepared-transactions-while-altering-two_phase.patch


REL_16_0004-Add-force_alter-option.patch
Description: REL_16_0004-Add-force_alter-option.patch


RE: Slow catchup of 2PC (twophase) transactions on replica in LR

2024-04-22 Thread Hayato Kuroda (Fujitsu)
Dear hackers,

> Looking at this a bit more, maybe rolling back all prepared transactions on 
> the
> subscriber when toggling two_phase from true to false might not be desirable
> for the customer. Maybe we should have an option for customers to control
> whether transactions should be rolled back or not. Maybe transactions should
> only be rolled back if a "force" option is also set. What do people think?

And here is a patch for adds new option "force_alter" (better name is very 
welcome).
It could be used only when altering two_phase option. Let me share examples.

Assuming that there are logical replication system with two_phase = on, and
there are prepared transactions:

```
subscriber=# SELECT * FROM pg_prepared_xacts ;
 transaction |   gid|   prepared|  owner   | 
database 
-+--+---+--+--
 741 | pg_gid_16390_741 | 2024-04-22 08:02:34.727913+00 | postgres | 
postgres
 742 | pg_gid_16390_742 | 2024-04-22 08:02:34.729486+00 | postgres | 
postgres
(2 rows)
```

At that time, altering two_phase to false alone will be failed:

```
subscriber=# ALTER SUBSCRIPTION sub DISABLE ;
ALTER SUBSCRIPTION
subscriber=# ALTER SUBSCRIPTION sub SET (two_phase = off);
ERROR:  cannot alter two_phase = false when there are prepared transactions
```

It succeeds if force_alter is also expressly set. Prepared transactions will be
aborted at that time.

```
subscriber=# ALTER SUBSCRIPTION sub SET (two_phase = off, force_alter = on);
ALTER SUBSCRIPTION
subscriber=# SELECT * FROM pg_prepared_xacts ;
 transaction | gid | prepared | owner | database 
-+-+--+---+------
(0 rows)
```

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/global/ 



v5-0001-Allow-altering-of-two_phase-option-of-a-SUBSCRIPT.patch
Description:  v5-0001-Allow-altering-of-two_phase-option-of-a-SUBSCRIPT.patch


v5-0002-Alter-slot-option-two_phase-only-when-altering-tr.patch
Description:  v5-0002-Alter-slot-option-two_phase-only-when-altering-tr.patch


v5-0003-Abort-prepared-transactions-while-altering-two_ph.patch
Description:  v5-0003-Abort-prepared-transactions-while-altering-two_ph.patch


v5-0004-Add-force_alter-option.patch
Description: v5-0004-Add-force_alter-option.patch


RE: Slow catchup of 2PC (twophase) transactions on replica in LR

2024-04-22 Thread Hayato Kuroda (Fujitsu)
Dear Vitaly,

> I looked at the patch and realized that I can't try it easily in the near 
> future
> because the solution I'm working on is based on PG16 or earlier. This patch is
> not easily applicable to the older releases. I have to port my solution to the
> master, which is not done yet.

We also tried to port our patch for PG16, but the largest barrier was that a
replication command ALTER_SLOT is not supported. Since the slot option two_phase
can't be modified, it is difficult to skip decoding PREPARE command even when
altering the option from true to false.
IIUC, Adding a new feature (e.g., replication command) for minor updates is 
generally
prohibited

We must consider another approach for backpatching, but we do not have solutions
for now.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/global/ 

 


RE: Disallow changing slot's failover option in transaction block

2024-04-18 Thread Hayato Kuroda (Fujitsu)
Dear Hou,

Thanks for updating the patch! Let me confirm the content.

In your patch, the pg_dump.c was updated. IIUC the main reason was that
pg_restore executes some queries as single transactions so that ALTER
SUBSCRIPTION cannot be done, right?
Also, failover was synchronized only when we were in the upgrade mode, but
your patch seems to remove the condition. Can you clarify the reason?

Other than that, the patch LGTM.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



RE: Disallow changing slot's failover option in transaction block

2024-04-18 Thread Hayato Kuroda (Fujitsu)
Dear Shveta,

> When I said that option 2 aligns with ALTER-SUB documented behaviour,
> I meant the doc described in [1] stating "When altering the slot_name,
> the failover and two_phase property values of the named slot may
> differ from the counterpart failover and two_phase parameters
> specified in the subscription"
> 
> [1]:
> https://www.postgresql.org/docs/devel/sql-altersubscription.html#SQL-ALTER
> SUBSCRIPTION-PARAMS-SET

I see, thanks for the clarification. Agreed that the description is not 
conflict with
option 2.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



RE: Disallow changing slot's failover option in transaction block

2024-04-17 Thread Hayato Kuroda (Fujitsu)
Dear Shveta,

Sorry for delay response. I missed your post.

> +1.
> 
> Similar to ALTER SUB, CREATE SUB also needs to be fixed. This is
> because we call alter_replication_slot in CREATE SUB as well, for the
> case when slot_name is provided and create_slot=false. But the tricky
> part is we call alter_replication_slot() when creating subscription
> for both failover=true and false. That means if we want to fix it on
> the similar line of ALTER SUB, we have to disallow user from executing
> the CREATE SUBSCRIPTION (slot_name = xx) in a txn block, which seems
> to break some existing use cases. (previously, user can execute such a
> command inside a txn block).
> 
> So, we need to think if there are better ways to fix it.  After
> discussion with Hou-San offlist, here are some ideas:
> 1. do not alter replication slot's failover option when CREATE
> SUBSCRIPTION   WITH failover=false. This means we alter replication
> slot only when failover is set to true. And thus we can disallow
> CREATE SUB WITH (slot_name =xx, failover=true, create_slot=false)
> inside a txn block.
> 
> This option allows user to run CREATE-SUB(create_slot=false) with
> failover=false in txn block like earlier. But on the downside, it
> makes the behavior inconsistent for otherwise simpler option like
> failover,  i.e. with failover=true, CREATE SUB is not allowed in txn
> block while with failover=false, it is allowed. It makes it slightly
> complex to be understood by user.
> 
> 2. let's not disallow CREATE SUB in txn block as earlier, just don't
> perform internal alter-failover during CREATE SUB for existing
> slots(create_slot=false, slot_name=xx)  i.e. when create_slot is
> false, we will ignore failover parameter of CREATE SUB and it is
> user's responsibility to set it appropriately using ALTER SUB cmd. For
> create_slot=true, the behaviour of CREATE-SUB remains same as earlier.
> 
> This option does not add new restriction for CREATE SUB wrt txn block.
> In context of failover with create_slot=false, we already have a
> similar restriction (documented one) for ALTER SUB, i.e. with 'ALTER
> SUBSCRIPTION SET(slot_name = new)', user needs to alter the new slot's
> failover by himself. CREAT SUB can also be documented in similar way.
> This seems simpler to be understood considering existing ALTER SUB's
> behavior as well. Plus, this will make CREATE-SUB code slightly
> simpler and thus easily manageable.
> 
> 3. add a alter_slot option for CREATE SUBSCRIPTION, we can only alter
> the  slot's failover if alter_slot=true. And so we can disallow CREATE
> SUB WITH (slot_name =xx, alter_slot=true) inside a txn block.
> 
> This seems a clean way, as everything will be as per user's consent
> based on alter_slot parameter. But on the downside, this will need
> introducing additional parameter and also adding new restriction of
> running CREATE-sub in txn  block for a specific case.
> 
> 4. Don't alter replication in subscription DDLs. Instead, try to alter
> replication slot's failover in the apply worker. This means we need to
> execute alter_replication_slot each time before starting streaming in
> the apply worker.
> 
> This does not seem appealing to execute alter_replication_slot
> everytime the apply worker starts. But if others think it as a better
> option, it can be further analyzed.

Thanks for describing, I also prefer 2, because it seems bit strange that
CREATE statement leads ALTER.

> Currently, our preference is option 2, as that looks a clean solution
> and also aligns with ALTER-SUB behavior which is already documented.
> Thoughts?
> 
> 
> Note that we could not refer to the design of two_phase here, because
> two_phase can be considered as a streaming option, so it's fine to
> change the two_phase along with START_REPLICATION command. (the
> two_phase is not changed in subscription DDLs, but get changed in
> START_REPLICATION command). But the failover is closely related to a
> replication slot itself.
> 

Sorry, I cannot find statements. Where did you refer?

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



RE: Disallow changing slot's failover option in transaction block

2024-04-16 Thread Hayato Kuroda (Fujitsu)
Dear Hou,

> Kuroda-San reported an issue off-list that:
> 
> If user execute ALTER SUBSCRIPTION SET (failover) command inside a txn block
> and rollback, only the subscription option change can be rolled back, while 
> the
> replication slot's failover change is preserved.
> 
> This is because ALTER SUBSCRIPTION SET (failover) command internally
> executes
> the replication command ALTER_REPLICATION_SLOT to change the replication
> slot's
> failover property, but this replication command execution cannot be
> rollback.
> 
> To fix it, I think we can prevent user from executing ALTER SUBSCRIPTION set
> (failover) inside a txn block, which is also consistent to the ALTER
> SUBSCRIPTION REFRESH/DROP SUBSCRIPTION command. Attach a small
> patch to address this.

Thanks for posting the patch, the fix is same as my expectation.
Also, should we add the restriction to the doc? I feel [1] can be updated.


[1]:https://www.postgresql.org/docs/devel/sql-altersubscription.html#SQL-ALTERSUBSCRIPTION-PARAMS-SET

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 





RE: Slow catchup of 2PC (twophase) transactions on replica in LR

2024-04-15 Thread Hayato Kuroda (Fujitsu)
Dear Amit,

> > FYI - We also considered the idea which walsender waits until all prepared
> transactions
> > are resolved before decoding and sending changes, but it did not work well
> > - the restarted walsender sent only COMMIT PREPARED record for
> transactions which
> > have been prepared before disabling the subscription. This happened because
> > 1) if the two_phase option of slots is false, the confirmed_flush can be 
> > ahead of
> >PREPARE record, and
> > 2) after the altering and restarting, start_decoding_at becomes same as
> >confirmed_flush and records behind this won't be decoded.
> >
> 
> I don't understand the exact problem you are facing. IIUC, if the
> commit is after start_decoding_at point and prepare was before it, we
> expect to send the entire transaction followed by a commit record. The
> restart_lsn should be before the start of such a transaction and we
> should have recorded the changes in the reorder buffer.

This behavior is right for two_phase = false case. But if the parameter is
altered between PREPARE and COMMIT PREPARED, there is a possibility that only
COMMIT PREPARED is sent. As the first place, the executed workload is below.

1. created a subscription with (two_phase = false)
2. prepared a transaction on publisher
3. disabled the subscription once
4. altered the subscription to two_phase = true
5. enabled the subscription again
6. did COMMIT PREPARED on the publisher

-> Apply worker would raise an ERROR while applying COMMIT PREPARED record:
ERROR:  prepared transaction with identifier "pg_gid_XXX_YYY" does not exist

Below part describes why the ERROR occurred.

==

### Regarding 1) the confirmed_flush can be ahead of PREPARE record,

If two_phase is off, as you might know, confirmed_flush can be ahead of PREPARE
record by keepalive mechanism.

Walsender sometimes sends a keepalive message in WalSndKeepalive(). Here the LSN
is written, which is lastly decoded record. Since the PREPARE record is skipped
(just handled by ReorderBufferProcessXid()), sometimes the written LSN in the
message can be ahead of PREPARE record. If the WAL records are aligned like 
below,
the LSN can point CHECKPOINT_ONLINE.

...
INSERT
PREPARE txn1
CHECKPOINT_ONLINE
...

On worker side, when it receives the keepalive, it compares the LSN in the
message and lastly received LSN, and advance last_received. Then, the worker 
replies
to the walsender, and at that time it replies that last_recevied record has been
flushed on the subscriber. See send_feedback().
 
On publisher, when the walsender receives the message from subscriber, it reads
the message and advance the confirmed_flush to the written value. If the 
walsender
sends LSN which locates ahead PREPARE, the confirmed flush is updated as well.

### Regarding 2) after the altering, records behind the confirmed_flush are not 
decoded

Then, at decoding phase. The snapshot builder determines the point where 
decoding
is resumed, as start_decoding_at. After the restart, the value is same as
confirmed_flush of the slot. Since the confiremed_fluish is ahead of PREPARE,
the start_decoding_at becomes ahead as well, so whole of prepared transactions
are not decoded.

==

Attached zip file contains the PoC and used script. You can refer what I really 
did.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 

<>


RE: Slow catchup of 2PC (twophase) transactions on replica in LR

2024-04-15 Thread Hayato Kuroda (Fujitsu)
Dear Amit,

> Vitaly, does the minimal solution provided by the proposed patch
> (Allow to alter two_phase option of a subscriber provided no
> uncommitted
> prepared transactions are pending on that subscription.) address your use 
> case?

I think we do not have to handle cases which there are prepared transactions on
publisher/subscriber, as the first step. It leads additional complexity and we
do not have smarter solutions, especially for problem 2.
IIUC it meets the Vitaly's condition, right?

> > 1. While toggling two_phase from true to false, we could probably get a 
> > list of
> prepared transactions for this subscriber id and rollback/abort the prepared
> transactions. This will allow the transactions to be re-applied like a normal
> transaction when the commit comes. Alternatively, if this isn't appropriate 
> doing it
> in the ALTER SUBSCRIPTION context, we could store the xids of all prepared
> transactions of this subscription in a list and when the corresponding xid is 
> being
> committed by the apply worker, prior to commit, we make sure the previously
> prepared transaction is rolled back. But this would add the overhead of 
> checking
> this list every time a transaction is committed by the apply worker.
> >
> 
> In the second solution, if you check at the time of commit whether
> there exists a prior prepared transaction then won't we end up
> applying the changes twice? I think we can first try to achieve it at
> the time of Alter Subscription because the other solution can add
> overhead at each commit?

Yeah, at least the second solution might be problematic. I prototyped
the first one and worked well. However, to make the feature more consistent,
it is prohibit to exist prepared transactions on subscriber for now.
We can ease based on the requirement.

> > 2. No solution yet.
> >
> 
> One naive idea is that on the publisher we can remember whether the
> prepare has been sent and if so then only send commit_prepared,
> otherwise send the entire transaction. On the subscriber-side, we
> somehow, need to ensure before applying the first change whether the
> corresponding transaction is already prepared and if so then skip the
> changes and just perform the commit prepared. One drawback of this
> approach is that after restart, the prepare flag wouldn't be saved in
> the memory and we end up sending the entire transaction again. One way
> to avoid this overhead is that the publisher before sending the entire
> transaction checks with subscriber whether it has a prepared
> transaction corresponding to the current commit. I understand that
> this is not a good idea even if it works but I don't have any better
> ideas. What do you think?

I considered but not sure it is good to add such mechanism. Your idea requires
additional wait-loop, which might lead bugs and unexpected behavior. And it may
degrade the performance based on the network environment.
As for the another solution (worker sends a list of prepared transactions), it
is also not so good because list of prepared transactions may be huge.

Based on above, I think we can reject the case for now.

FYI - We also considered the idea which walsender waits until all prepared 
transactions
are resolved before decoding and sending changes, but it did not work well
- the restarted walsender sent only COMMIT PREPARED record for transactions 
which
have been prepared before disabling the subscription. This happened because
1) if the two_phase option of slots is false, the confirmed_flush can be ahead 
of
   PREPARE record, and
2) after the altering and restarting, start_decoding_at becomes same as
   confirmed_flush and records behind this won't be decoded.

> > 3. We could mandate that the altering of two_phase state only be done after
> disabling the subscription, just like how it is handled for failover option.
> >
> 
> makes sense.

OK, this spec was added.

According to above, I updated the patch with Ajin.
0001 - extends ALTER SUBSCRIPTION statement. A tab-completion was added.
0002 - mandates the subscription has been disabled. Since no need to change 
   AtEOXact_ApplyLauncher(), the change is reverted.
   If no objections, this can be included to 0001.
0003 - checks whether there are transactions prepared by the worker. If found,
   rejects the ALTER SUBSCRIPTION command.
0004 - checks whether there are transactions prepared on publisher. The backend
   connects to the publisher and confirms it. If found, rejects the ALTER
   SUBSCRIPTION command.
0005 - adds TAP test for it.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



v3-0001-Allow-altering-of-two_phase-option-of-a-SUBSCRIPT.patch
Description:  v3-0001-Allow-altering-of-two_phase-option-of-a-SUBSCRIPT.patch


v3-0002-Mandate-the-subscription-h

RE: Slow catchup of 2PC (twophase) transactions on replica in LR

2024-04-10 Thread Hayato Kuroda (Fujitsu)
Dear Amit,

> One naive idea is that on the publisher we can remember whether the
> prepare has been sent and if so then only send commit_prepared,
> otherwise send the entire transaction. On the subscriber-side, we
> somehow, need to ensure before applying the first change whether the
> corresponding transaction is already prepared and if so then skip the
> changes and just perform the commit prepared. One drawback of this
> approach is that after restart, the prepare flag wouldn't be saved in
> the memory and we end up sending the entire transaction again. One way
> to avoid this overhead is that the publisher before sending the entire
> transaction checks with subscriber whether it has a prepared
> transaction corresponding to the current commit. I understand that
> this is not a good idea even if it works but I don't have any better
> ideas. What do you think?

Alternative idea is that worker pass a list of prepared transaction as new
option in START_REPLICATION. This can reduce the number of inter-node
communications, but sometimes the list may be huge.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 


RE: Improve eviction algorithm in ReorderBuffer

2024-04-10 Thread Hayato Kuroda (Fujitsu)
Dear Heikki,

I also prototyped the idea, which has almost the same shape.
I attached just in case, but we may not have to see.

Few comments based on the experiment.

```
+   /* txn_heap is ordered by transaction size */
+   buffer->txn_heap = pairingheap_allocate(ReorderBufferTXNSizeCompare, 
NULL);
```

I think the pairing heap should be in the same MemoryContext with the buffer.
Can we add MemoryContextSwithTo()?

```
+   /* Update the max-heap */
+   if (oldsize != 0)
+   pairingheap_remove(rb->txn_heap, &txn->txn_node);
+   pairingheap_add(rb->txn_heap, &txn->txn_node);
...
+   /* Update the max-heap */
+   pairingheap_remove(rb->txn_heap, &txn->txn_node);
+   if (txn->size != 0)
+   pairingheap_add(rb->txn_heap, &txn->txn_node);
```

Since the number of stored transactions does not affect to the insert 
operation, we may able
to add the node while creating ReorederBufferTXN and remove while cleaning up 
it. This can
reduce branches in ReorderBufferChangeMemoryUpdate().

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 

<>


RE: Is this a problem in GenericXLogFinish()?

2024-04-09 Thread Hayato Kuroda (Fujitsu)
Dear Michael,

> On Fri, Apr 05, 2024 at 06:22:58AM +0000, Hayato Kuroda (Fujitsu) wrote:
> > Thanks for pointing out. Yes, we fixed a behavior by the commit aa5edbe37,
> > but we missed the redo case. I made a fix patch based on the suggestion [1].
> 
> +   boolmod_buf = false;
> 
> Perhaps you could name that mod_wbuf, to be consistent with the WAL
> insert path.

Right, fixed.

> I'm slightly annoyed by the fact that there is no check on
> is_prim_bucket_same_wrt for buffer 1 in the BLK_NEEDS_REDO case to
> show the symmetry between the insert and replay paths.  Shouldn't
> there be at least an assert for that in the branch when there are no
> tuples (imagine an else to cover xldata->ntups == 0)?  I mean between
> just before updating the hash page's opaque area when
> is_prev_bucket_same_wrt.

Indeed, added an Assert() in else part. Was it same as your expectation?

> I've been thinking about ways to make such cases detectable in an
> automated fashion.  The best choice would be
> verifyBackupPageConsistency(), just after RestoreBlockImage() on the
> copy of the block image stored in WAL before applying the page masking
> which would mask the LSN.  A problem, unfortunately, is that we would
> need to transfer the knowledge of REGBUF_NO_CHANGE as a new BKPIMAGE_*
> flag so we would be able to track if the block rebuilt from the record
> has the *same* LSN as the copy used for the consistency check.  So
> this edge consistency case would come at a cost, I am afraid, and the
> 8 bits of bimg_info are precious :/

I could not decide that the change has more benefit than its cost, so I did
nothing for it.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



v2_add_modbuf_flag.diff
Description: v2_add_modbuf_flag.diff


RE: Is this a problem in GenericXLogFinish()?

2024-04-04 Thread Hayato Kuroda (Fujitsu)
Dear Michael,

> There is still some divergence between the code path of
> _hash_freeovflpage() and the replay in hash_xlog_squeeze_page() when
> squeezing a page: we do not set the LSN of the write buffer if
> (xlrec.ntups <= 0 && xlrec.is_prim_bucket_same_wrt &&
> !xlrec.is_prev_bucket_same_wrt) when generating the squeeze record,
> but at replay we call PageSetLSN() on the write buffer and mark it
> dirty in this case.  Isn't that incorrect?  It seems to me that we
> should be able to make the conditional depending on the state of the
> xl_hash_squeeze_page record, no?

Thanks for pointing out. Yes, we fixed a behavior by the commit aa5edbe37,
but we missed the redo case. I made a fix patch based on the suggestion [1].

Below part contained my analysis and how I reproduced.
I posted for clarifying the issue to others.


=

## Pointed issue

Assuming the case
- xlrec.ntups is 0,
- xlrec.is_prim_bucket_same_wrt is true, and 
- xlrec.is_prev_bucket_same_wrt is false.
This meant that there are several overflow pages and the tail deadpage is 
removing.
In this case, the primary page is not have to be modified.

While doing the normal operation, the removal is done in _hash_freeovflpage().
If above condition is met, mod_wbuf is not set to true so PageSetLSN() is 
skipped.

While doing the recovery, the squeeze and removal is done in 
hash_xlog_squeeze_page().
In this function PageSetLSN() is set unconditionally.
He said in this case the PageSetLSN should be avoided as well.

## Analysis

IIUC there is the same issue with pointed by [2].
He said the PageSetLSN() should be set when the page is really modified.
In the discussing case, wbug is not modified (just removing the tail entry) so 
that no need
to assign LSN to it. However, we might miss to update the redo case as well.

## How to reproduce

I could reproduce the case below steps.
1. Added the debug log like [3]
2. Constructed a physical replication.
3. Ran hash_index.sql
4. Found the added debug log.

[1]: 
https://www.postgresql.org/message-id/CAA4eK1%2BayneM-8mSRC0iWpDMnm39EwDoqgiOCSqrrMLcdnUnAA%40mail.gmail.com
[2]: https://www.postgresql.org/message-id/ZbyVVG_7eW3YD5-A%40paquier.xyz
[3]:
```
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -713,6 +713,11 @@ hash_xlog_squeeze_page(XLogReaderState *record)
writeopaque->hasho_nextblkno = xldata->nextblkno;
}

+if (xldata->ntups == 0 &&
+xldata->is_prim_bucket_same_wrt &&
+!xldata->is_prev_bucket_same_wrt)
+        elog(LOG, "XXX no need to set PageSetLSN");
+
```

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



add_modbuf_flag.diff
Description: add_modbuf_flag.diff


RE: Improve eviction algorithm in ReorderBuffer

2024-03-29 Thread Hayato Kuroda (Fujitsu)
Dear Sawada-san,

> Agreed.
> 
> I think the patch is in good shape. I'll push the patch with the
> suggestion next week, barring any objections.

Thanks for working on this. Agreed it is committable.
Few minor comments:

```
+ * Either txn or change must be non-NULL at least. We update the memory
+ * counter of txn if it's non-NULL, otherwise change->txn.
```

IIUC no one checks the restriction. Should we add Assert() for it, e.g,:
Assert(txn || change)? 

```
+/* make sure enough space for a new node */
...
+/* make sure enough space for a new node */
```

Should be started with upper case?

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



RE: Synchronizing slots from primary to standby

2024-03-28 Thread Hayato Kuroda (Fujitsu)
Dear Hou,

Thanks for updating the patch! Here is a comment for it.

```
+/*
+ * By advancing the restart_lsn, confirmed_lsn, and xmin using
+ * fast-forward logical decoding, we can verify whether a consistent
+ * snapshot can be built. This process also involves saving necessary
+ * snapshots to disk during decoding, ensuring that logical decoding
+ * efficiently reaches a consistent point at the restart_lsn without
+ * the potential loss of data during snapshot creation.
+ */
+pg_logical_replication_slot_advance(remote_slot->confirmed_lsn,
+found_consistent_point);
+ReplicationSlotsComputeRequiredLSN();
+updated_lsn = true;
```

You added them like pg_replication_slot_advance(), but the function also calls
ReplicationSlotsComputeRequiredXmin(false) at that time. According to the 
related
commit b48df81 and discussions [1], I know it is needed only for physical slots,
but it makes more consistent to call requiredXmin() as well, per [2]:

```
This may be a waste if no advancing is done, but it could also be an
advantage to enforce a recalculation of the thresholds for each
function call.  And that's more consistent with the slot copy, drop
and creation.
```

How do you think?

[1]: 
https://www.postgresql.org/message-id/20200609171904.kpltxxvjzislidks%40alap3.anarazel.de
[2]: https://www.postgresql.org/message-id/20200616072727.GA2361%40paquier.xyz

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



RE: pg_upgrade and logical replication

2024-03-27 Thread Hayato Kuroda (Fujitsu)
Dear Vignesh,

> 
> Recently there was a failure in 004_subscription tap test at [1].
> In this failure, the tab_upgraded1 table was expected to have 51
> records but has only 50 records. Before the upgrade both publisher and
> subscriber have 50 records.

Good catch!

> After the upgrade we have inserted one record in the publisher, now
> tab_upgraded1 will have 51 records in the publisher. Then we start the
> subscriber after changing max_logical_replication_workers so that
> apply workers get started and apply the changes received. After
> starting we enable regress_sub5, wait for sync of regress_sub5
> subscription and check for tab_upgraded1 and tab_upgraded2 table data.
> In a few random cases the one record that was inserted into
> tab_upgraded1 table will not get replicated as we have not waited for
> regress_sub4 subscription to apply the changes from the publisher.
> The attached patch has changes to wait for regress_sub4 subscription
> to apply the changes from the publisher before verifying the data.
> 
> [1] -
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mamba&dt=2024-03-
> 26%2004%3A23%3A13

Yeah, I think it is an oversight in f17529. Previously subscriptions which
receiving changes were confirmed to be caught up, I missed to add the line while
restructuring the script. +1 for your fix.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



RE: speed up a logical replica setup

2024-03-25 Thread Hayato Kuroda (Fujitsu)
Dear Amit, Euler,

> 
> This only drops the publications created by this tool, not the
> pre-existing ones that we discussed in the link provided.

Another concern around here is the case which primary subscribes changes from 
others.
After the conversion, new subscriber also tries to connect to another publisher 
as
well - this may lead conflicts. This causes because both launcher/workers start
after recovery finishes. So, based on the Ashutosh's point, should we remove
such replication objects?

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



RE: speed up a logical replica setup

2024-03-25 Thread Hayato Kuroda (Fujitsu)
Dear Bharath, Peter,

> Looks like BF animals aren't happy, please check -
> > https://buildfarm.postgresql.org/cgi-bin/show_failures.pl.
> 
> Looks like sanitizer failures.  There were a few messages about that
> recently, but those were all just about freeing memory after use, which
> we don't necessarily require for client programs.  So maybe something else.

It seems that there are several time of failures, [1] and [2].

## Analysis for failure 1

The failure caused by a time lag between walreceiver finishes and 
pg_is_in_recovery()
returns true.

According to the output [1], it seems that the tool failed at 
wait_for_end_recovery()
with the message "standby server disconnected from the primary". Also, lines
"redo done at..." and "terminating walreceiver process due to administrator 
command"
meant that walreceiver was requested to shut down by XLogShutdownWalRcv().

According to the source, we confirm that walreceiver is shut down in
StartupXLOG()->FinishWalRecovery()->XLogShutdownWalRcv(). Also, 
SharedRecoveryState
is changed to RECOVERY_STATE_DONE (this meant the pg_is_in_recovery() return 
true)
at the latter part of StartupXLOG().

So, if there is a delay between FinishWalRecovery() and change the state, the 
check
in wait_for_end_recovery() would be failed during the time. Since we allow to 
miss
the walreceiver 10 times and it is checked once per second, the failure occurs 
if
the time lag is longer than 10 seconds.

I do not have a good way to fix it. One approach is make NUM_CONN_ATTEMPTS 
larger,
but it's not a fundamental solution.

## Analysis for failure 2

According to [2], the physical replication slot which is specified as 
primary_slot_name
was not used by the walsender process. At that time walsender has not existed.

```
...
pg_createsubscriber: publisher: current wal senders: 0
pg_createsubscriber: command is: SELECT 1 FROM pg_catalog.pg_replication_slots 
WHERE active AND slot_name = 'physical_slot'
pg_createsubscriber: error: could not obtain replication slot information: got 
0 rows, expected 1 row
...
```

Currently standby must be stopped before the command and current code does not
block the flow to ensure the replication is started. So there is a possibility
that the checking is run before walsender is launched.

One possible approach is to wait until the replication starts. Alternative one 
is
to ease the condition.

How do you think?

[1]: 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=serinus&dt=2024-03-25%2013%3A03%3A07
[2]: 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=culicidae&dt=2024-03-25%2013%3A53%3A58

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 


RE: speed up a logical replica setup

2024-03-25 Thread Hayato Kuroda (Fujitsu)
> IIUC, added options were inspired by Tomas. And he said the limitation
> (pub/sub/slot
> name cannot be specified) was acceptable for the first version. I agree with 
> him.
> (To be honest, I feel that options should be fewer for the first release)

Just to confirm - I don't think it is not completely needed. If we can agree a 
specification
in sometime, it's OK for me to add them. If you ask me, I still prefer Tomas's 
approach.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



RE: speed up a logical replica setup

2024-03-24 Thread Hayato Kuroda (Fujitsu)
Dear Amit, Euler,

I also want to share my opinion, just in case.

> On Thu, Mar 21, 2024 at 8:00 PM Euler Taveira  wrote:
> >
> > On Thu, Mar 21, 2024, at 10:33 AM, vignesh C wrote:
> >
> > If we commit this we might not be able to change the way the option
> > behaves once the customers starts using it. How about removing these
> > options in the first version and adding it in the next version after
> > more discussion.
> >
> >
> > We don't need to redesign this one if we want to add a format string in a 
> > next
> > version. A long time ago, pg_dump started to accept pattern for tables 
> > without
> > breaking or deprecating the -t option. If you have 100 databases and you 
> > don't
> > want to specify the options or use a script to generate it for you, you also
> > have the option to let pg_createsubscriber generate the object names for 
> > you.
> > Per my experience, it will be a rare case.
> >
> 
> But, why go with some UI in the first place which we don't think is a
> good one, or at least don't have a broader agreement that it is a good
> one? The problem with self-generated names for users could be that
> they won't be able to make much out of it. If one has to always use
> those internally then probably that would be acceptable. I would
> prefer what Tomas proposed a few emails ago as compared to this one as
> that has fewer options to be provided by users but still, they can
> later identify objects. But surely, we should discuss if you or others
> have better alternatives.

IIUC, added options were inspired by Tomas. And he said the limitation 
(pub/sub/slot
name cannot be specified) was acceptable for the first version. I agree with 
him.
(To be honest, I feel that options should be fewer for the first release)

> The user choosing to convert a physical standby to a subscriber would
> in some cases be interested in converting it for all the databases
> (say for the case of upgrade [1]) and giving options for each database
> would be cumbersome for her.

+1 for the primary use case.

> > Currently dry-run will do the check and might fail on identifying a
> > few failures like after checking subscriber configurations. Then the
> > user will have to correct the configuration and re-run then fix the
> > next set of failures. Whereas the suggest-config will display all the
> > optimal configuration for both the primary and the standby in a single
> > shot. This is not a must in the first version, it can be done as a
> > subsequent enhancement.
> >
> >
> > Do you meant publisher, right? Per order, check_subscriber is done before
> > check_publisher and it checks all settings on the subscriber before 
> > exiting. In
> > v30, I changed the way it provides the required settings. In a previous 
> > version,
> > it fails when it found a wrong setting; the current version, check all 
> > settings
> > from that server before providing a suitable error.
> >
> > pg_createsubscriber: checking settings on publisher
> > pg_createsubscriber: primary has replication slot "physical_slot"
> > pg_createsubscriber: error: publisher requires wal_level >= logical
> > pg_createsubscriber: error: publisher requires 2 replication slots, but 
> > only 0
> remain
> > pg_createsubscriber: hint: Consider increasing max_replication_slots to at 
> > least
> 3.
> > pg_createsubscriber: error: publisher requires 2 wal sender processes, but 
> > only
> 0 remain
> > pg_createsubscriber: hint: Consider increasing max_wal_senders to at least 
> > 3.
> >
> > If you have such an error, you will fix them all and rerun using dry run 
> > mode
> > again to verify everything is ok. I don't have a strong preference about 
> > it. It
> > can be changed easily (unifying the check functions or providing a return 
> > for
> > each of the check functions).
> >
> 
> We can unify the checks but not sure if it is worth it. I am fine
> either way. It would have been better if we provided a way for a user
> to know the tool's requirement rather than letting him know via errors
> but I think it should be okay to extend it later as well.

Both ways are OK, but I prefer to unify checks a bit. The number of working 
modes
in the same executables should be reduced as much as possible.

Also, I feel the current specification is acceptable. pg_upgrade checks one by
one and exits soon in bad cases. If both old and new clusters have issues, the
first run only reports the old one's issues. After DBA fixes and runs again,
issues on the new cluster are output.

[1]: 
https://www.postgresql.org/message-id/8d52c226-7e34-44f7-a919-759bf8d81541%40enterprisedb.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



RE: Have pg_basebackup write "dbname" in "primary_conninfo"?

2024-03-19 Thread Hayato Kuroda (Fujitsu)
Dear Sawada-san,

Thanks for giving comments!

> This behavior makes sense to me. But do we want to handle the case of
> using environment variables too? 

Yeah, v5 does not consider which libpq parameters are specified by environment
variables. Such a variable should be used when the dbname is not expressly 
written
in the connection string.
Such a path was added in the v6 patch. If the dbname is not determined after
parsing the connection string, we call PQconndefaults() to get settings from
environment variables and service files [1], then start to search dbname again.
Below shows an example.

```
PGPORT=5431 PGUSER=kuroda PGDATABASE=postgres pg_basebackup -D data_N2 -R -v
->
primary_conninfo = 'user=kuroda ... port=5431 ... dbname=postgres ... '
```

> IIUC,
>
> pg_basebackup -D tmp -d "user=masahiko dbname=test_db"
>
> is equivalent to:
>
> PGDATABASE="user=masahiko dbname=test_db" pg_basebackup -D tmp

The case won't work. I think You assumed that expanded_dbname like
PQconnectdbParams() [2] can be used for enviroment variables, but it is not 
correct
- it won't parse as connection string again.

In the libpq layer, connection parameters are parsed in 
PQconnectStartParams()->conninfo_array_parse().
When expand_dbname is specified, the entry "dbname" is firstly checked and
parsed its value. They are done at fe-connect.c:5846.

The environment variables are checked and parsed in conninfo_add_defaults(), 
which
is called from conninfo_array_parse(). However, it is done at fe-connect.c:5956 
- the
expand_dbname has already been done at that time. This means there is no chance
that PGDATABASE is parsed as an expanded style.

For example, if the pg_basebackup runs like below:

PGDATABASE="user=kuroda dbname=postgres" pg_basebackup -D data_N2 -R -v

The primary_conninfo written in the file will be:

primary_conninfo = 'user=hayato ... dbname=''user=kuroda dbname=postgres'''

[1]: https://www.postgresql.org/docs/devel/libpq-pgservice.html
[2]: 
https://www.postgresql.org/docs/devel/libpq-connect.html#LIBPQ-PQCONNECTDBPARAMS

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



pg_basebackup-write-dbname.v0006.patch
Description: pg_basebackup-write-dbname.v0006.patch


RE: speed up a logical replica setup

2024-03-18 Thread Hayato Kuroda (Fujitsu)
Dear Peter,

Thanks for giving comments. I want to reply some of them.

> > 17.
> >
> > "specifies promote"
> >
> > We can do double-quote for the word promote.
> 
> The v30 patch has promote, which I think is adequate.

Opps. Actually I did look v29 patch while firstly reviewing. Sorry for noise.

> 
> > 20.
> >
> > New options must be also documented as well. This helps not only users but
> also
> > reviewers.
> > (Sometimes we cannot identify that the implementation is intentinal or not.)
> 
> Which ones are missing?

In v29, newly added options (publication/subscription/replication-slot) was not 
added.
Since they have been added, please ignore.

> > 21.
> >
> > Also, not sure the specification is good. I preferred to specify them by 
> > format
> > string. Because it can reduce the number of arguments and I cannot find use
> cases
> > which user want to control the name of objects.
> >
> > However, your approach has a benefit which users can easily identify the
> generated
> > objects by pg_createsubscriber. How do other think?
> 
> I think listing them explicitly is better for the first version.  It's
> simpler to implement and more flexible.

OK.

> > 22.
> >
> > ```
> > #define BASE_OUTPUT_DIR
>   "pg_createsubscriber_output.d"
> > ```
> >
> > No one refers the define.
> 
> This is gone in v30.

I wrote due to the above reason. Please ignore...

> 
> > 31.
> >
> > ```
> > /* Create replication slot on publisher */
> > if (lsn)
> > pg_free(lsn);
> > ```
> >
> > I think allocating/freeing memory is not so efficient.
> > Can we add a flag to create_logical_replication_slot() for controlling the
> > returning value (NULL or duplicated string)? We can use the condition (i ==
> num_dbs-1)
> > as flag.
> 
> Nothing is even using the return value of
> create_logical_replication_slot().  I think this can be removed altogether.

> > 37.
> >
> > ```
> > /* Register a function to clean up objects in case of failure */
> > atexit(cleanup_objects_atexit);
> > ```
> >
> > Sorry if we have already discussed. I think the registration can be moved 
> > just
> > before the boot of the standby. Before that, the callback will be no-op.
> 
> But it can also stay where it is.  What is the advantage of moving it later?

I thought we could reduce the risk of bugs. Previously some bugs were reported
because the registration is too early. However, this is not a strong opinion.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



RE: PostgreSQL 17 Release Management Team & Feature Freeze

2024-03-18 Thread Hayato Kuroda (Fujitsu)
Dear Michael,

> If you think that this is OK, and as far as I can see this looks OK on
> the thread, then this open item should be moved under "resolved before
> 17beta1", mentioning the commit involved in the fix.  Please see [1]
> for examples.

OK, I understood that I could wait checking from you. Thanks.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 





RE: speed up a logical replica setup

2024-03-17 Thread Hayato Kuroda (Fujitsu)
here another rule?

23.

```
static int  num_dbs = 0;
static int  num_pubs = 0;
static int  num_subs = 0;
static int  num_replslots = 0;
```

I think the name is bit confusing. The number of generating 
publications/subscriptions/replication slots
are always same as the number of databases. They just indicate the number of
specified.

My idea is num_custom_pubs or something. Thought?

24.

```
/* standby / subscriber data directory */
static char *subscriber_dir = NULL;
```

It is bit strange that only subscriber_dir is a global variable. Caller requires
the CreateSubscriberOptions as an argument, except cleanup_objects_atexit() and
main. So, how about makeing `CreateSubscriberOptions opt` to global one?

25.

```
 * Replication slots, publications and subscriptions are created. Depending on
 * the step it failed, it should remove the already created objects if it is
 * possible (sometimes it won't work due to a connection issue).
```

I think it should be specified here that subscriptions won't be removed with the
reason. 

26.

```

/*
 * If the server is promoted, there is no way to use the current setup
 * again. Warn the user that a new replication setup should be done 
before
 * trying again.
 */
```

Per comment 25, we can add a reference like "See comments atop the function"

27.

usage() was not updated based on recent changes.

28.

```
if (strcmp(conn_opt->keyword, "dbname") == 0 && conn_opt->val 
!= NULL)
{
if (dbname)
*dbname = pg_strdup(conn_opt->val);
continue;
}
```

There is a memory-leak if multiple dbname are specified in the conninfo.

29.

```
pg_prng_seed(&prng_state, (uint64) (getpid() ^ time(NULL)));
```

No need to initialize the seed every time. Can you reuse pg_prng_state?

30.

```
if (num_replslots == 0)
dbinfo[i].replslotname = pg_strdup(genname);
```

I think the straightforward way is to use the name of subscription if no name
is specified. This follows the rule for CREATE SUBSCRIPTION.

31.

```
/* Create replication slot on publisher */
if (lsn)
pg_free(lsn);
```

I think allocating/freeing memory is not so efficient.
Can we add a flag to create_logical_replication_slot() for controlling the
returning value (NULL or duplicated string)? We can use the condition (i == 
num_dbs-1)
as flag.

32.

```
/*
 * Close the connection. If exit_on_error is true, it has an undesired
 * condition and it should exit immediately.
 */
static void
disconnect_database(PGconn *conn, bool exit_on_error)
```

In case of disconnect_database(), the second argument should have different 
name.
If it is true, the process exits unconditionally.
Also, comments atop the function must be fixed.


33.

```
wal_level = strdup(PQgetvalue(res, 0, 0));
```

pg_strdup should be used here.

34.

```
{"config-file", required_argument, NULL, 1},
{"publication", required_argument, NULL, 2},
{"replication-slot", required_argument, NULL, 3},
{"subscription", required_argument, NULL, 4},
```

The ordering looks strange for me. According to pg_upgarade and pg_basebackup,
options which do not have short notation are listed behind.

35.

```
opt.sub_port = palloc(16);
```

Per other lines, pg_alloc() should be used.

36.

```
pg_free(opt.sub_port);
```

You said that the leak won't be concerned here. If so, why only 'p' has 
pg_free()?

37.

```
/* Register a function to clean up objects in case of failure */
atexit(cleanup_objects_atexit);
```

Sorry if we have already discussed. I think the registration can be moved just
before the boot of the standby. Before that, the callback will be no-op.

38.

```
/* Subscriber PID file */
snprintf(pidfile, MAXPGPATH, "%s/postmaster.pid", subscriber_dir);

/*
 * If the standby server is running, stop it. Some parameters (that can
 * only be set at server start) are informed by command-line options.
 */
if (stat(pidfile, &statbuf) == 0)
```

Hmm. pidfile is used only here, but it is declared in main(). Can it be
separated into another funtion like is_standby_started()?

39.

Or, we may able to introcue "restart_standby_if_needed" or something.

40.

```
 * XXX this code was extracted from BootStrapXLOG().
```

So, can we extract the common part to somewhere? Since system identifier is 
related
with the controldata file, I think it can be located in controldata_util.c.

41.

You said like below in [1], but I could not find the related fix. Can you 
clarify?

> That's a good point. We should state in the documentation that GUCs specified 
> in
> the command-line options are ignored during the execution.

[1]: 
https://www.postgresql.org/message-id/40595e73-c7e1-463a-b8be-49792e870007%40app.fastmail.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/global/ 



RE: PostgreSQL 17 Release Management Team & Feature Freeze

2024-03-17 Thread Hayato Kuroda (Fujitsu)
Dear Michael,

> We are pleased to announce the Release Management Team (RMT) (cc'd)
> for the PostgreSQL 17 release:
> - Robert Haas
> - Heikki Linnakangas
> - Michael Paquier

Thanks for managing the release of PostgreSQL to proceed the right direction!

> You can track open items for the PostgreSQL 17 release here:
> https://wiki.postgresql.org/wiki/PostgreSQL_17_Open_Items

I think the entry can be closed:

```
pg_upgrade with --link mode failing upgrade of publishers
Commit: 29d0a77fa660
Owner: Amit Kapila
```

The reported failure was only related with the test script, not the feature.
The issue has already been analyzed and the fix patch was pushed as f17529b710.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 





RE: speed up a logical replica setup

2024-03-10 Thread Hayato Kuroda (Fujitsu)
Dear Vignesh,

Thanks for updating the patch, but cfbot still got angry [1].
Note that two containers (autoconf and meson) failed at different place,
so I think it is intentional one. It seems that there may be a bug related with 
32-bit build.

We should see and fix as soon as possible.

[1]: http://cfbot.cputube.org/highlights/all.html#4637

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



RE: speed up a logical replica setup

2024-03-08 Thread Hayato Kuroda (Fujitsu)
Dear Tomas, Euler,

Thanks for starting to read the thread! Since I'm not an original author,
I want to reply partially.

> I decided to take a quick look on this patch today, to see how it works
> and do some simple tests. I've only started to get familiar with it, so
> I have only some comments / questions regarding usage, not on the code.
> It's quite possible I didn't understand some finer points, or maybe it
> was already discussed earlier in this very long thread, so please feel
> free to push back or point me to the past discussion.
> 
> Also, some of this is rather opinionated, but considering I didn't see
> this patch before, my opinions may easily be wrong ...

I felt your comments were quit valuable.

> 1) SGML docs
> 
> It seems the SGML docs are more about explaining how this works on the
> inside, rather than how to use the tool. Maybe that's intentional, but
> as someone who didn't work with pg_createsubscriber before I found it
> confusing and not very helpful.
>
> For example, the first half of the page is prerequisities+warning, and
> sure those are useful details, but prerequisities are checked by the
> tool (so I can't really miss this) and warnings go into a lot of details
> about different places where things may go wrong. Sure, worth knowing
> and including in the docs, but maybe not right at the beginning, before
> I learn how to even run the tool?

Hmm, right. I considered below improvements. Tomas and Euler, how do you think?

* Adds more descriptions in "Description" section.
* Moves prerequisities+warning to "Notes" section.
* Adds "Usage" section which describes from a single node.

> I'm not sure FOR ALL TABLES is a good idea. Or said differently, I'm
> sure it won't work for a number of use cases. I know large databases
> it's common to create "work tables" (not necessarily temporary) as part
> of a batch job, but there's no need to replicate those tables.

Indeed, the documentation does not describe that all tables in the database
would be included in the publication.

> I do understand that FOR ALL TABLES is the simplest approach, and for v1
> it may be an acceptable limitation, but maybe it'd be good to also
> support restricting which tables should be replicated (e.g. blacklist or
> whitelist based on table/schema name?).

May not directly related, but we considered that accepting options was a 
next-step [1].

> Note: I now realize this might fall under the warning about DDL, which
> says this:
> 
> Executing DDL commands on the source server while running
> pg_createsubscriber is not recommended. If the target server has
> already been converted to logical replica, the DDL commands must
> not be replicated so an error would occur.

Yeah, they would not be replicated, but not lead ERROR.
So should we say like "Creating tables on the source server..."?

> 5) slot / publication / subscription name
> 
> I find it somewhat annoying it's not possible to specify names for
> objects created by the tool - replication slots, publication and
> subscriptions. If this is meant to be a replica running for a while,
> after a while I'll have no idea what pg_createsubscriber_569853 or
> pg_createsubscriber_459548_2348239 was meant for.
> 
> This is particularly annoying because renaming these objects later is
> either not supported at all (e.g. for replication slots), or may be
> quite difficult (e.g. publications).
> 
> I do realize there are challenges with custom names (say, if there are
> multiple databases to replicate), but can't we support some simple
> formatting with basic placeholders? So we could specify
> 
> --slot-name "myslot_%d_%p"
> 
> or something like that?

Not sure we can do in the first version, but looks nice. One concern is that I
cannot find applications which accepts escape strings like log_line_prefix.
(It may just because we do not have use-case.) Do you know examples?

> BTW what will happen if we convert multiple standbys? Can't they all get
> the same slot name (they all have the same database OID, and I'm not
> sure how much entropy the PID has)?

I tested and the second try did not work. The primal reason was the name of 
publication
- pg_createsubscriber_%u (oid).
FYI - previously we can reuse same publications, but based on my comment [2] the
feature was removed. It might be too optimistic.

[1]: 
https://www.postgresql.org/message-id/TY3PR01MB9889CCBD4D9DAF8BD2F18541F56F2%40TY3PR01MB9889.jpnprd01.prod.outlook.com
[2]: 
https://www.postgresql.org/message-id/TYCPR01MB12077756323B79042F29DDAEDF54C2%40TYCPR01MB12077.jpnprd01.prod.outlook.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



RE: speed up a logical replica setup

2024-03-06 Thread Hayato Kuroda (Fujitsu)
nvironment
 * variable sets it. If not, obtain the operating system name of the 
user
 * running it.
 */
```
Unnecessary blank.

08. main
```
char   *errstr = NULL;
```
 
This declaration can be at else-part.

09. main.

Also, as the first place, do we have to get username if not specified?
I felt libpq can handle the case if we skip passing the info.

10. main
```
appendPQExpBuffer(sub_conninfo_str, "host=%s port=%u user=%s 
fallback_application_name=%s",
  opt.socket_dir, opt.sub_port, 
opt.sub_username, progname);
sub_base_conninfo = get_base_conninfo(sub_conninfo_str->data, NULL);
```

Is it really needed to call get_base_conninfo? I think no need to define
sub_base_conninfo.

11. main

```
/*
 * In dry run mode, the server is restarted with the provided 
command-line
 * options so validation can be applied in the target server. In order 
to
 * preserve the initial state of the server (running), start it without
 * the command-line options.
 */
if (dry_run)
start_standby_server(&opt, pg_ctl_path, NULL, false);
```

I think initial state of the server may be stopped. Now both conditions are 
allowed.
And I think it is not good not to specify the logfile.

12. others

As Peter E pointed out [1], the main function is still huge. It has more than 
400 lines.
I think all functions should have less than 100 line to keep the readability.

I considered separation idea like below. Note that this may require to change
orderings. How do you think?

* add parse_command_options() which accepts user options and verifies them
* add verification_phase() or something which checks system identifier and 
calls check_XXX
* add catchup_phase() or something which creates a temporary slot, writes 
recovery parameters,
  and wait until the end of recovery
* add cleanup_phase() or something which removes primary_slot and modifies the
  system identifier
* stop/start server can be combined into one wrapper.

Attached txt file is proofs the concept.

13. others

PQresultStatus(res) is called 17 times in this source code, it may be redundant.
I think we can introduce a function like executeQueryOrDie() and gather in one 
place.

14. others

I found that pg_createsubscriber does not refer functions declared in other 
files.
Is there a possibility to use them, e.g., streamutils.h?

15. others 

While reading the old discussions [2], Amit suggested to keep the comment and 
avoid
creating a temporary slot. You said "Got it" but temp slot still exists.
Is there any reason? Can you clarify your opinion?

16. others

While reading [2] and [3], I was confused the decision. You and Amit discussed
the combination with pg_createsubscriber and slot sync and how should handle
slots on the physical standby. You seemed to agree to remove such a slot, and
Amit also suggested to raise an ERROR. However, you said in [8] that such
handlings is not mandatory so should raise an WARNING in dry_run. I was quite 
confused.
Am I missing something?

17. others

Per discussion around [4], we might have to consider an if the some options like
data_directory and config_file was initially specified for standby server. 
Another
easy approach is to allow users to specify options like -o in pg_upgrade [5],
which is similar to your idea. Thought?


18. others

How do you handle the reported failure [6]?

19. main
```
char   *pub_base_conninfo = NULL;
char   *sub_base_conninfo = NULL;
char   *dbname_conninfo = NULL;
```

No need to initialize pub_base_conninfo and sub_base_conninfo.
These variables would not be free'd.

20. others

IIUC, slot creations would not be finished if there are prepared transactions.
Should we detect it on the verification phase and raise an ERROR?

21. others

As I said in [7], the catch up would not be finished if long 
recovery_min_apply_delay
is used. Should we overwrite during the catch up?

22. pg_createsubscriber.sgml
```

 Check
 Write recovery parameters into the target data...
```

Not sure, but "Check" seems not needed.

[1]: 
https://www.postgresql.org/message-id/b9aa614c-84ba-a869-582f-8d5e3ab57424%40enterprisedb.com
[2]: 
https://www.postgresql.org/message-id/9fd3018d-0e5f-4507-aee6-efabfb5a4440%40app.fastmail.com
[3]: 
https://www.postgresql.org/message-id/CAA4eK1L%2BE-bdKaOMSw-yWizcuprKMyeejyOwWjq_57%3DUqh-f%2Bg%40mail.gmail.com
[4]: 
https://www.postgresql.org/message-id/TYCPR01MB12077B63D81B49E9DFD323661F55A2%40TYCPR01MB12077.jpnprd01.prod.outlook.com
[5]: 
https://www.postgresql.org/docs/devel/pgupgrade.html#:~:text=options%20to%20be%20passed%20directly%20to%20the%20old%20postgres%20command%3B%20multiple%20option%20invocations%20are%20appended
[6]: 
https://www.postgresql.org/message-id/CAHv8Rj%2B5mzK9Jt%2B7ECogJzfm5czvDCCd5jO1_rCx0bTEYpBE5g%40mail.gmail.com
[7]: 
https:

  1   2   3   4   5   6   >