> On 2 Oct 2017, at 19:59, Petr Jelinek <petr.jeli...@2ndquadrant.com> wrote:
> 
>> 
>> Program terminated with signal SIGABRT, Aborted.
>> #0  0x00007f3608f8ec37 in __GI_raise (sig=sig@entry=6) at
>> ../nptl/sysdeps/unix/sysv/linux/raise.c:56
>> 56    ../nptl/sysdeps/unix/sysv/linux/raise.c: No such file or directory.
>> (gdb) bt
>> #0  0x00007f3608f8ec37 in __GI_raise (sig=sig@entry=6) at
>> ../nptl/sysdeps/unix/sysv/linux/raise.c:56
>> #1  0x00007f3608f92028 in __GI_abort () at abort.c:89
>> #2  0x00000000009f5740 in ExceptionalCondition (conditionName=0xbf6b30
>> "!(((xid) != ((TransactionId) 0)))",
>>     errorType=0xbf69af "FailedAssertion", fileName=0xbf69a8 "lmgr.c",
>> lineNumber=582) at assert.c:54
>> #3  0x000000000086ac1d in XactLockTableWait (xid=0, rel=0x0, ctid=0x0,
>> oper=XLTW_None) at lmgr.c:582
>> #4  0x000000000081c9f7 in SnapBuildWaitSnapshot (running=0x2749198,
>> cutoff=898498) at snapbuild.c:1400
>> #5  0x000000000081c7c7 in SnapBuildFindSnapshot (builder=0x2807910,
>> lsn=798477760, running=0x2749198) at snapbuild.c:1311
>> #6  0x000000000081c339 in SnapBuildProcessRunningXacts
>> (builder=0x2807910, lsn=798477760, running=0x2749198) at snapbuild.c:1106
> 
> 
> Hmm this would suggest that XLOG_RUNNING_XACTS record contains invalid
> transaction ids but we specifically skip those in
> GetRunningTransactionData(). Can you check pg_waldump output for that
> record (lsn is shown in the backtrace)?

  I investigated this case and it seems that XactLockTableWait() in 
SnapBuildWaitSnapshot()
not always work as expected. XactLockTableWait() waits on LockAcquire() for xid 
to be 
completed and if we finally got this lock but transaction is still in progress 
then such xid
assumed to be a subxid. However LockAcquire/LockRelease cycle can happen after 
transaction
set xid, but before XactLockTableInsert().

Namely following history happened for xid 5225 and lead to crash:

[backend] LOG:  AssignTransactionId: XactTopTransactionId = 5225
   [walsender] LOG:  LogCurrentRunningXacts: Wrote RUNNING_XACTS xctn=1, 
xid[0]=5225
   [walsender] LOG:  XactLockTableWait: LockAcquire 5225
   [walsender] LOG:  XactLockTableWait: LockRelease 5225
[backend] LOG:  AssignTransactionId: LockAcquire ExclusiveLock 5225
   [walsender] LOG:  TransactionIdIsInProgress: SVC->latestCompletedXid=5224 < 
xid=5225 => true
[backend] LOG:  CommitTransaction: ProcArrayEndTransaction xid=5225, ipw=0
[backend] LOG:  CommitTransaction: ResourceOwnerRelease locks xid=5225

I’ve quickly checked other usages of XactLockTableWait() and it seems that it 
is used mostly with
xids from heap so tx definetly set it lock somewhere in the past.

Not sure what it the best approach to handle that. May be write running xacts 
only if they already
set their lock?

Also attaching pgbench script that can reproduce failure. I run it over usual 
pgbench database
in 10 threads. It takes several minutes to crash.

Attachment: reload.pgb
Description: Binary data

Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to