My bad, I'll re-post this to pgsql-general with more data on my running environment....
On Thu, Feb 21, 2013 at 11:16 AM, Ned Wolpert <ned.wolp...@imemories.com>wrote: > Folks- > > I'm doing a postmortem on a corruption event we had. I have an idea on > what happened, but not sure. I figure I'd share what happened and see if > I'm close to right here. > > Event: Running 9.1.6 with hot-standby, archiving 4 months of wal files, > and even a nightly pg_dump all. 50G database. Trying to update or delete a > row in a small (21 row, but heavily used table) would lock up completely. > Never finish. Removed all clients, restarted the db instance, no joy. Check > pg_stat_activity, and nothing that wasn't idle.... run the delete, locked > up. > > Ran (SELECT*FROM pg_class JOIN pg_locks ON pg_locks.relation = > pg_class.oid;) with no clients touching this row, fresh restart of the db, > and saw virtualtransactions against this same table where the values would > be -1/nnnnn were nnnnn was a huge number. Turns out we had about 159 > entries from different tables in the database. Checked hot-standby and, of > course, no locks or anything. Switched to hot-standby. > > Hot-standby instantly gained these locks, Also noticed that 2 prepared > transactions migrated to the hot-standby. Binary upgraded to 9.1.8, locks > still existed. Ended up executing the one thing we knew would work. Take > the site down, pg_dumpall to fresh instance. Everything is fine. > > A little more background: We were running 9.1.4 back when 9.1.6 came out. > We saw there was possible corruption issues and did a binary upgrade and > reindexing. Everything seemed to be fine, but I wonder if we really had > problems back then. We rebuilt the hot-standby after the binary upgrade via > normal restore and wal-file replays. I should also note that this row that > had the lock on it that would not go away, was created by an app server > that was killed (via -9) since it was non-responsive, and the row 'create > date' (in db and logs) is the exact time the app server was killed. > > I was wondering if a) these virtualtransactions that start with '-1/' > indicate a problem, b) if this could have happened from pre 9.1.6 > corruption that was fixed in 9.1.6. Or, could this have occurred when we > killed that app server? Or.... am I looking in the wrong place. > > I do still have the old data directories so I can start them up and check > out the dataset. Any advice? > > -- > Virtually, Ned Wolpert > > "Settle thy studies, Faustus, and begin..." --Marlowe > -- Virtually, Ned Wolpert "Settle thy studies, Faustus, and begin..." --Marlowe