On Fri, Mar 9, 2018 at 10:43 AM, Jeremy Finzel <finz...@gmail.com> wrote:
> Hello - > > Here is our cluster setup: > > cluster_a 9.5.11 Ubuntu 16.04.4 LTS > --> cluster_b (streamer) 9.5.11 Ubuntu 16.04.4 LTS > --> cluster_c (streamer) 9.5.11 Ubuntu 16.04.4 LTS > > Very recently, we started seeing these errors when running a query on a > specific table on the streamer: > > 2018-03-09 08:28:16.280 CST,"uname","foo",18692,"0.0.0.0:0 > ",5aa29292.4904,4,"SELECT",2018-03-09 07:56:34 CST,18/15992,0,*ERROR* > ,58P01,"*could not access status of transaction 1035047007*","*Could not > open file ""pg_commit_ts/9A45*"": No such file or directory." > > A little history on the cluster: > > - The most recent change we made was a point release upgrade > from 9.5.5 to 9.5.11 on the master, and 9.5.9 to 9.5.11 for the 2 streamers > - It is a very high WAL traffic reporting system. > - We actually have synchronous_commit set to off. It's possible this > could have bitten us and we are just now seeing issues, however there have > been no crashes since the table in question was created. > - We have run pg_repack on many tables on this cluster, but that also > has not happened since over a month > - We had a similar error of missing pg_commit_ts file over a year ago > after an actual crash. We had serious issues getting the cluster to start, > and had to resort to recreating the missing pg_commit_ts with null > bytes (IIRC, we had a snapshot of the system which still showed the file), > which worked but left us questioning what really caused the issue. > > > The table that is causing the error has been in production and used fine > since 2/15/2018 when it was created. It is fed by pglogical replication (v. > 2.1.1 on subscriber) from a system running 9.6.1 and pglogical v. 1.2.1. > The point release upgrade from earlier 9.5 did take place *after* this. > > However, we *only* just started seeing errors in the past 12 hours. The > table was autovacuumed on master at 2018-03-08 18:18:15.532137-06, which > was about 3 hours before the first user query errored, however, I saw that > 2 hours after the autovac, there was another user query that worked > successfully on the table. Not sure if related? > > Any insight/ideas would be much appreciated! > > Thanks, > Jeremy > UPDATE: what is actually failing is a call to pg_xact_commit_timestamp(xmin) on a given table under the view. We still think we must have some corruption though with pg_commit_ts.