On Fri, Nov 8, 2019 at 10:05 AM vignesh C <vignes...@gmail.com> wrote:
> On Thu, Nov 7, 2019 at 10:01 PM Andres Freund <and...@anarazel.de> wrote:
> >
> > Hi,
> >
> > On 2019-11-07 17:03:44 +0530, Amit Kapila wrote:
> > > On Thu, Nov 7, 2019 at 4:48 PM Tomas Vondra
> > > <tomas.von...@2ndquadrant.com> wrote:
> > > >
> > > > I'm a bit confused - does this happen only with the logical_work_mem
> > > > patches, or with clean master too?
> > > >
> > >
> > > This occurs with the clean master.  This is a base code problem
> > > revealed while doing stress testing of logical_work_mem patches.
> >
> > As far as I can tell there are no repro steps included? Any chance to
> > get those?
> >

I think it will be bit tricky to write a test case, but the callstack
indicates the problem.  For ex. for issue-1, see the below part of

#2  0x0000000000ec514d in ExceptionalCondition
(conditionName=0x10eab34 "!dlist_is_empty(head)", errorType=0x10eab24
    fileName=0x10eab00 "../../../../src/include/lib/ilist.h",
lineNumber=458) at assert.c:54
#3  0x0000000000b4fd13 in dlist_tail_element_off (head=0x338fe60,
off=48) at ../../../../src/include/lib/ilist.h:458
#4  0x0000000000b547b7 in ReorderBufferAbortOld (rb=0x32ae7a0,
oldestRunningXid=895) at reorderbuffer.c:1910
#5  0x0000000000b3cb5e in DecodeStandbyOp (ctx=0x33424b0,
buf=0x7fff7e7b1e40) at decode.c:332

The Assertion failure indicates that the changes list is empty when we
are trying to get 'lsn' of the last change.  I think this is possible
when immediately after serializing the transaction changes the server
got restarted (say it crashed or somebody switched off and restarted
it).  The reason is that after serializing the changes, the changes
list will be empty and serialized flag for txn will be set.  I am not
if there is any reason to set the final_lsn if the changes list is

Similarly, if we see the call stack of issue-2, the problem is clear.

#2  0x0000000000ec4e1d in ExceptionalCondition
(conditionName=0x10ead30 "txn->final_lsn != InvalidXLogRecPtr",
errorType=0x10ea284 "FailedAssertion",
    fileName=0x10ea2d0 "reorderbuffer.c", lineNumber=3052) at assert.c:54
#3  0x0000000000b577e0 in ReorderBufferRestoreCleanup (rb=0x2ae36b0,
txn=0x2bafb08) at reorderbuffer.c:3052
#4  0x0000000000b52b1c in ReorderBufferCleanupTXN (rb=0x2ae36b0,
txn=0x2bafb08) at reorderbuffer.c:1318
#5  0x0000000000b5279d in ReorderBufferCleanupTXN (rb=0x2ae36b0,
txn=0x2b9d778) at reorderbuffer.c:1257
#6  0x0000000000b5475c in ReorderBufferAbortOld (rb=0x2ae36b0,
oldestRunningXid=3835) at reorderbuffer.c:1973

I think this also has a symptom similar to the prior issue but for
sub-transactions.  The ReorderBufferAbortOld() tries to set the
final_lsn of toplevel transaction, but not for subtransactions, later
in that path ReorderBufferRestoreCleanup expects it to be set even for
subtransaction.  Is that Assert in  ReorderBufferRestoreCleanup()
required, because immediately after assert, we are anyway setting the
value of final_lsn.

> This problem does not occur consistently. When I was reviewing and testing 
> "logical streaming for large in-progress transactions" link [1] I found the 
> crashes.
> This issue does not occur directly, meaning this issue will occur only when 
> some crash occurs in postgres process(not from reorderbuffer but due to some 
> other issue), after the original non-reorderbuffer crash this reorderbuffer 
> crash appears.
> To simplify the reorderbuffer crash, I used the following steps:
> 1) Make replication setup with publisher/subscriber for some table
> 2) Prepare a sql file with the below:
> begin;
> 4096 insert statements;
> select pg_sleep(120)
> 3) Execute the above script.
> 4) Attach the postgres process when pg_sleep is in progress.
> 5) call abort() from attached gdb.

Isn't it important to call this abort immediately after the changes
are serialized as explained above?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Reply via email to