As it turns out, it was prepared transactions at fault. I restored my test bed, forced a rollback and the locks were gone. At least now I can connect those dots. :-)
On Fri, Feb 22, 2013 at 1:15 AM, Albe Laurenz <laurenz.a...@wien.gv.at>wrote: > Ned Wolpert wrote: > > I'm doing a postmortem on a corruption event we had. I have an idea on > what happened, but not sure. > > I figure I'd share what happened and see if I'm close to right here. > > > > Event: Running 9.1.6 with hot-standby, archiving 4 months of wal > files, and even a nightly pg_dump > > all. 50G database. Trying to update or delete a row in a small (21 row, > but heavily used table) would > > lock up completely. Never finish. Removed all clients, restarted the db > instance, no joy. Check > > pg_stat_activity, and nothing that wasn't idle.... run the delete, > locked up. > > > > Ran (SELECT*FROM pg_class JOIN pg_locks ON pg_locks.relation = > pg_class.oid;) with no clients > > touching this row, fresh restart of the db, and saw virtualtransactions > against this same table where > > the values would be -1/nnnnn were nnnnn was a huge number. Turns out we > had about 159 entries from > > different tables in the database. Checked hot-standby and, of course, > no locks or anything. Switched > > to hot-standby. > > > > Hot-standby instantly gained these locks, Also noticed that 2 prepared > transactions migrated to the > > hot-standby. Binary upgraded to 9.1.8, locks still existed. Ended up > executing the one thing we knew > > would work. Take the site down, pg_dumpall to fresh instance. Everything > is fine. > > > > A little more background: We were running 9.1.4 back when 9.1.6 came > out. We saw there was possible > > corruption issues and did a binary upgrade and reindexing. Everything > seemed to be fine, but I wonder > > if we really had problems back then. We rebuilt the hot-standby after > the binary upgrade via normal > > restore and wal-file replays. I should also note that this row that had > the lock on it that would not > > go away, was created by an app server that was killed (via -9) since it > was non-responsive, and the > > row 'create date' (in db and logs) is the exact time the app server was > killed. > > > > I was wondering if a) these virtualtransactions that start with '-1/' > indicate a problem, b) if this > > could have happened from pre 9.1.6 corruption that was fixed in 9.1.6. > Or, could this have occurred > > when we killed that app server? Or.... am I looking in the wrong place. > > > > I do still have the old data directories so I can start them up and > check out the dataset. Any advice? > > What do you get for > SELECT * FROM pg_prepared_xacts; > > Have you tried to ROLLBACK PREPARED those prepared transactions? > > The "-1" is quite alright: > > BEGIN; > INSERT INTO addr VALUES (10, '127.0.0.1'); > PREPARE TRANSACTION 'test'; > > test=> SELECT * FROM pg_prepared_xacts; > transaction | gid | prepared | owner | database > -------------+------+-------------------------------+---------+---------- > 2275 | test | 2013-02-22 09:09:04.627137+01 | laurenz | test > > test=> SELECT database, relation, transactionid, virtualtransaction, mode > FROM pg_locks; > database | relation | transactionid | virtualtransaction | mode > > ----------+----------+---------------+--------------------+------------------ > 16421 | 11069 | | 2/15 | AccessShareLock > | | | 2/15 | ExclusiveLock > | | 2275 | -1/2275 | ExclusiveLock > 16421 | 24711 | | -1/2275 | > RowExclusiveLock > (4 rows) > > The last two rows are the prepared transaction. > > ROLLBACK PREPARED 'test'; > > Gets rid of them. > > Yours, > Laurenz Albe > -- Virtually, Ned Wolpert "Settle thy studies, Faustus, and begin..." --Marlowe