Hi all, We recently ran into a memory leak in a production logical-replication WAL-sender process. A simplified reproduction script is attached. If you run the script and then call MemoryContextStats(TopMemoryContext). you will see something like: "logical replication cache context: 562044928 total in 77 blocks;" meaning “cachectx” has grown to ~500 MB, and it keeps growing as the number of tables increases. The workload can be summarised as follows: 1. CREATE PUBLICATION FOR ALL TABLES 2. CREATE SUBSCRIPTION 3. Repeatedly CREATE TABLE and DROP TABLE cachectx is used mainly for entry->old_slot, entry->new_slot and entry->attrmap allocations. When a DROP TABLE causes an invalidation we only set entry->replicate_valid = false; we do not free those allocations immediately. They are freed only if the same entry is used again. In some workloads an entry may never be reused, or it may be reused briefly and then become unreachable forever (The WAL sender may still need to decode WAL records for tables that have already been dropped while it is processing the invalidation.) Given the current design I don’t see a simple fix. Perhaps RelationSyncCache needs some kind of eviction/cleanup policy to prevent this memory growth in such scenarios. Does anyone have ideas or suggestions?
100_cachectx_oom.pl
Description: Binary data