Good evening, 2018-05-22 23:19 GMT+02:00 Andres Freund <[email protected]>:
> Hi, > > On 2018-05-22 21:43:01 +0200, Paolo Crosato wrote: > > > Could you report the result of > > > select ctid, xmin, xmax from pg_authid ; > > > > > > > This is the result: > > > > postgres=# select ctid, xmin, xmax from pg_authid ; > > > (0,16) | 3031994631 | 0 > > 16 | 6496 | 1 | 144 | 3031994631 | 0 | 0 | > (0,16) > > | 32779 | 10507 | 32 | 1111111111000000 | 675851 | > > \x6e6167696f7300000000000000000000000000000000000000000000000 > 000000000000000000000000000000000000000000000000000000000000 > 0000000000001000001000000ffffffff496d64356333633236616163636 > 439616665346437383061396239613464663634653639 > > > postgres=# select relfrozenxid from pg_class where relname='pg_authid'; > > relfrozenxid > > -------------- > > 400011439 > > That tuple looks, to me, like it indeed shouldn't exist, given the > relfrozenxid. Decoding infomask (10507 / 0x290B), if I did it right, > shows: > > HEAP_HASNULL > HEAP_HASVARWIDTH > HEAP_HASOID > HEAP_XMIN_COMMITTED > HEAP_XMAX_INVALID > HEAP_UPDATED > > so it's not frozen. That suggests there's either a bug, or you had > corruption in your cluster. > > > Could you give a bit of "history" about that postgres instance? What > version of postgres was it run on earliest? Was it ever pg_upgraded? > Were there any OS crashes? Other hardware issues? Was the cluster ever > used with fsync=off or full_page_writes=off? > > Greetings, > > Andres Freund > The cluster is made of a primary master instance, and a secondary slave in hot standby, with streaming replication. There is a barman server for the backups. The first version it ran on was 10.2, the cluster was promoted in production in mid february. It was never pg_upgraded, we upgraded it yesterday to 10.4 and restarted, in hope that this would resolve the issue. The restart was performed by the upgrading process. There was never any crash or hardware issue, the instance run without any problem. we never enabled fsync=off or full_page_writes=off. This is the only real issue so far. We have many nagios alerts monitoring the instance, the only problem is we are creating many temporary files, most of them are caused by a query that doesn't overflow work_mem when run alone with explain analyze buffers, which is weird. I should mention that we had a major creation of temp files in the first hours of the cluster history, however that was quickly resolved. I managed to recover the log of the first time we run into the issue, the error was the same but on template1: May 8 11:26:46 xxx postgres[32543]: [1154-1] user=,db=,client= ERROR: found xmin 2600758304 from before relfrozenxid 400011439 May 8 11:26:46 xxx postgres[32543]: [1154-2] user=,db=,client= CONTEXT: automatic vacuum of table "template1.pg_catalog.pg_authid" Deleting the row worked, but, as you see, the problem with 400011439 was there already. This happened when we added a user, shortly afterwards these errors started popping up in the log. We found the issue because the autovacuums stopped working, they were blocked on the pg_authid table. The same time we found out the long running transaction, it was a multi session kill that someway got stuck. Killing the transaction and deleting the offending row was enough for the autovacuums to restart. We also had a materialized view that was created months before and nobody used. Since i found in the past there was a similar error related to materialized views, I dropped it. The instance is doing a lot of transactions every day, probably around 100 millions. I would need to do more checks to provide an accurate measure. I still have the logs since the first time the error appeared, so I can provide further details if needed. Thanks, Best Regards Paolo Crosato
