On Fri, Mar 9, 2018 at 10:43 AM, Jeremy Finzel <finz...@gmail.com> wrote:

> Hello -
>
> Here is our cluster setup:
>
> cluster_a 9.5.11 Ubuntu 16.04.4 LTS
>  --> cluster_b (streamer) 9.5.11 Ubuntu 16.04.4 LTS
>  --> cluster_c (streamer) 9.5.11 Ubuntu 16.04.4 LTS
>
> Very recently, we started seeing these errors when running a query on a
> specific table on the streamer:
>
> 2018-03-09 08:28:16.280 CST,"uname","foo",18692,"0.0.0.0:0
> ",5aa29292.4904,4,"SELECT",2018-03-09 07:56:34 CST,18/15992,0,*ERROR*
> ,58P01,"*could not access status of transaction 1035047007*","*Could not
> open file ""pg_commit_ts/9A45*"": No such file or directory."
>
> A little history on the cluster:
>
>    - The most recent change we made was a point release upgrade
>    from 9.5.5 to 9.5.11 on the master, and 9.5.9 to 9.5.11 for the 2 streamers
>    - It is a very high WAL traffic reporting system.
>    - We actually have synchronous_commit set to off.  It's possible this
>    could have bitten us and we are just now seeing issues, however there have
>    been no crashes since the table in question was created.
>    - We have run pg_repack on many tables on this cluster, but that also
>    has not happened since over a month
>    - We had a similar error of missing pg_commit_ts file over a year ago
>    after an actual crash.  We had serious issues getting the cluster to start,
>    and had to resort to recreating the missing pg_commit_ts with null
>    bytes (IIRC, we had a snapshot of the system which still showed the file),
>    which worked but left us questioning what really caused the issue.
>
>
> The table that is causing the error has been in production and used fine
> since 2/15/2018 when it was created. It is fed by pglogical replication (v.
> 2.1.1 on subscriber) from a system running 9.6.1 and pglogical v. 1.2.1.
> The point release upgrade from earlier 9.5 did take place *after* this.
>
> However, we *only* just started seeing errors in the past 12 hours.  The
> table was autovacuumed on master at 2018-03-08 18:18:15.532137-06, which
> was about 3 hours before the first user query errored, however, I saw that
> 2 hours after the autovac, there was another user query that worked
> successfully on the table.  Not sure if related?
>
> Any insight/ideas would be much appreciated!
>
> Thanks,
> Jeremy
>

UPDATE: what is actually failing is a call to
pg_xact_commit_timestamp(xmin) on a given table under the view.  We still
think we must have some corruption though with pg_commit_ts.

Reply via email to