On Tue, Oct 13, 2020 at 11:05 AM Tom Lane <t...@sss.pgh.pa.us> wrote: > > Amit Kapila <amit.kapil...@gmail.com> writes: > >> It is possible that MAXALIGN stuff is playing a role here and or the > >> background transaction stuff. I think if we go with the idea of > >> testing spill_txns and spill_count being positive then the results > >> will be stable. I'll write a patch for that. > > Here's our first failure on a MAXALIGN-8 machine: > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=grison&dt=2020-10-13%2005%3A00%3A08 > > So this is just plain not stable. It is odd though. I can > easily think of mechanisms that would cause the WAL volume > to occasionally be *more* than the "typical" case. What > would cause it to be *less*, if MAXALIGN is ruled out? >
The original theory I have given above [1] which is an interleaved autovacumm transaction. Let me try to explain in a bit more detail. Say when transaction T-1 is performing Insert ('INSERT INTO stats_test SELECT 'serialize-topbig--1:'||g.i FROM generate_series(1, 5000) g(i);') a parallel autovacuum transaction occurs. The problem as seen in buildfarm will happen when autovacuum transaction happens after 80% or more of the Insert is done. In such a situation we will start decoding 'Insert' first and need to spill multiple times due to the amount of changes (more than threshold logical_decoding_work_mem) and then before we encounter Commit of transaction that performed Insert (and probably some more changes from that transaction) we will encounter a small transaction (autovacuum transaction). The decode of that small transaction will send the stats collected till now which will lead to the problem shown in buildfarm. [1] - https://www.postgresql.org/message-id/CAA4eK1Jo0U1oSJyxrdA7i-bOOTh0Hue-NQqdG-CEqwGtDZPjyw%40mail.gmail.com -- With Regards, Amit Kapila.