Hello,

At Materialize we observed a strange behavior of walsender during logical
replication where it sends a flood of keepalive messages to the subscriber
when the database walsender is running on is itself syncing a large table
from another unrelated database.

The setup that reproduces the problem requires three database instances: A,
B, and C. When the following actions are performed it leads to B sending a
flood of keepalive messages to C:

First, in db A, we create a large table:

CREATE TABLE large (a int);
INSERT INTO large (SELECT generate_series(1, 100000000));
CREATE PUBLICATION large_pub FOR TABLE large;

Then, in db B, we create a tiny table:

CREATE TABLE tiny (a int);
INSERT INTO tiny VALUES (1);
CREATE PUBLICATION tiny_pub FOR TABLE tiny;

Then, in db C, we subscribe to tiny_pub:

CREATE TABLE tiny(a int);
CREATE SUBSCRIPTION tiny_sub CONNECTION 'host=B' PUBLICATION tiny_pub;

At this point db C receives a keepalive message rarely, according to the
wal_sender_timeout parameter.

Finally, in db B, we subscribe to large_pub:

CREATE TABLE large(a int);
CREATE SUBSCRIPTION large_sub CONNECTION 'host=A' PUBLICATION large_pub;

This triggers a flood of keepalive messages from B to C, even though C
doesn't need to learn anything about the large table, nor does it seem to
perform any actions with the knowledge transferred by the keepalive
messages.

I used the patch included in this message to produce a log every time a
keepalive was received in order to count them and observed a rate of 20
keepalives per second lasting multiple minutes.

We identified the code that is sending these keepalives to be this one:

https://github.com/postgres/postgres/blob/REL_18_STABLE/src/backend/replication/walsender.c#L1895-L1906

The comment of this if statement seems to imply that these keepalives are
relevant to synchronous replication and shutdown but neither of those are
actually happening in the reproduction. There is another section of
walsender with a similar looking comment which does have an explicit check
for synchronous replication:
https://github.com/postgres/postgres/blob/REL_18_STABLE/src/backend/replication/walsender.c#L1695-L1697

Is it expected that this happens? Does the identified if statement also
need synchronous replication guards?

For full context, the real system this was observed on was a Materialize
instance which supports importing tables from PostgreSQL databases using
logical replication. In our implementation, receiving keepalives triggers a
non-trivial amount of work and the flood of keepalives caught us by
surprise, causing high CPU usage.

Best,
Petros

Attachment: 0001-logical-worker-log-received-keepalive-messages.patch
Description: Binary data

Reply via email to