On Tue, Nov 20, 2012 at 8:22 PM, Andres Freund <and...@2ndquadrant.com>wrote:
> Those aren't unexpected. Perhaps I should not make it a warning then... > > A short explanation: > > We can only decode tuples we see in the WAL when we already have a > timetravel catalog snapshot before that transaction started. To build > such a snapshot we need to collect information about committed which > changed the catalog. Unfortunately we can't diagnose whether a txn > changed the catalog without a snapshot so we just assume all committed > ones do - it just costs a bit of memory. Thats the background of the > "forced to assume catalog changes for ..." message. > > The reason for the ABORTs is related but different. We start out in the > "SNAPBUILD_START" state when we try to build a snapshot. When we find > initial information about running transactions (i.e. xl_running_xacts) > we switch to the "SNAPBUILD_FULL_SNAPSHOT" state which means we can > decode all changes in transactions that start *after* the current > lsn. Earlier transactions might have tuples on a catalog state we can't > query. > Only when all transactions we observed as running before the > FULL_SNAPSHOT state have finished we switch to SNAPBUILD_CONSISTENT. > As we want a consistent/reproducible set of transactions to produce > output via the logstream we only pass transactions to the output plugin > if they commit *after* CONSISTENT (they can start earlier though!). This > allows us to produce a pg_dump compatible snapshot in the moment we get > consistent that contains exactly the changes we won't stream out. > > Makes sense? > > > 3) Assertion failure while running pgbench, I was just curious to see > how > > it reacted when logical replication was put under a little bit of load. > > TRAP: FailedAssertion("!(((xid) >= ((TransactionId) 3)) && > > ((snapstate->xmin_running) >= ((TransactionId) 3)))", File: > "snapbuild.c", > > Line: 877) > > => pgbench -i postgres; pgbench -T 500 -c 8 postgres > > Can you reproduce this one? I would be interested in log output. Because > I did run pgbench for quite some time and I haven't seen that one after > fixing some issues last week. > > It implies that snapstate->nrrunning has lost touch with reality... > Yes, I can reproduce in 10-20 seconds in one of my linux boxes. I haven't outputted anything in the logs, but here is the backtrace of the core file produced. #2 0x0000000000865145 in ExceptionalCondition (conditionName=0xa15100 "!(((xid) >= ((TransactionId) 3)) && ((snapstate->xmin_running) >= ((TransactionId) 3)))", errorType=0xa14f3b "FailedAssertion", fileName=0xa14ed0 "snapbuild.c", lineNumber=877) at assert.c:54 #3 0x000000000070c409 in SnapBuildTxnIsRunning (snapstate=0x19e4f10, xid=0) at snapbuild.c:877 #4 0x000000000070b8e4 in SnapBuildProcessChange (reorder=0x19e4e80, snapstate=0x19e4f10, xid=0, buf=0x1a0a368, relfilenode=0x1a0a450) at snapbuild.c:388 #5 0x000000000070c088 in SnapBuildDecodeCallback (reorder=0x19e4e80, snapstate=0x19e4f10, buf=0x1a0a368) at snapbuild.c:732 #6 0x00000000007080b9 in DecodeRecordIntoReorderBuffer (reader=0x1a08300, state=0x19e4e20, buf=0x1a0a368) at decode.c:84 #7 0x0000000000708cad in replay_finished_record (state=0x1a08300, buf=0x1a0a368) at logicalfuncs.c:54 #8 0x00000000004d8033 in XLogReaderRead (state=0x1a08300) at xlogreader.c:965 #9 0x000000000070f7c3 in XLogSendLogical (caughtup=0x7fffb22c35fb "") at walsender.c:1721 #10 0x000000000070ea05 in WalSndLoop (send_data=0x70f6e2 <XLogSendLogical>) at walsender.c:1184 #11 0x000000000070e0eb in StartLogicalReplication (cmd=0x190eaa8) at walsender.c:726 #12 0x000000000070e3ac in exec_replication_command (cmd_string=0x19a65c8 "START_LOGICAL_REPLICATION 'id-0' 0/7E1855C") at walsender.c:853 #13 0x0000000000753ee0 in PostgresMain (argc=2, argv=0x18f63d8, username=0x18f62a8 "michael") at postgres.c:3974 #14 0x00000000006f13ea in BackendRun (port=0x1912600) at postmaster.c:3668 #15 0x00000000006f0b76 in BackendStartup (port=0x1912600) at postmaster.c:3352 #16 0x00000000006ed900 in ServerLoop () at postmaster.c:1431 #17 0x00000000006ed208 in PostmasterMain (argc=13, argv=0x18f40a0) at postmaster.c:1180 #18 0x0000000000657517 in main (argc=13, argv=0x18f40a0) at main.c:197 I'm keeping this core and the binary btw. -- Michael Paquier http://michael.otacoo.com