Re: [HACKERS] emergency outage requiring database restart

Merlin Moncure Tue, 25 Oct 2016 12:14:15 -0700

On Tue, Oct 25, 2016 at 12:57 PM, Alvaro Herrera
<[email protected]> wrote:
> Merlin Moncure wrote:
>
>> After last night, I rebuilt the cluster, turning on checksums, turning
>> on synchronous commit (it was off) and added a standby replica.  This
>> should help narrow the problem down should it re-occur; if storage is
>> bad (note, other database on same machine is doing 10x write activity
>> and is fine) or something is scribbling on shared memory (my guess
>> here)  then checksums should be popped, right?
>
> Not really sure about that.  As I recall we compute the CRC on the
> buffer's way out, based on the then-current contents, so if something
> scribbles on the buffer while it's waiting to be evicted, the CRC
> computation would include the new (corrupted) bytes rather than the
> original ones -- see FlushBuffer.


Huh. I have a new theory on this.  Dealing with the reconstituted
database, I'm finding more things -- functions and such, that are
simply gone and had to be rebuilt -- they escaped notice as they were
not in primary code paths.  Recall that the original outage came
manifested as queries getting stuck, possibly on spinlock (we don't
know for sure).  After that, things started to randomly disappear,
possibly from system catalogs (but now need to go back and verify
older data, I think).  There were three autovac processes running.

What if the subsequent dataloss was in fact a symptom of the first
outage?  Is in theory possible for data to appear visible but then be
eaten up as the transactions making the data visible get voided out by
some other mechanic?  I had to pull a quick restart the first time and
everything looked ok -- or so I thought.   What I think was actually
happening is that data started to slip into the void.   It's like
randomly sys catalogs were dropping off. I bet other data was, too.  I
can pull older backups and verify that.  It's as if some creeping xmin
was snuffing everything out.

The confirmation of this should be obvious -- if that's indeed the
case, the backup and restored cluster should no longer present data
loss. Given that I was getting that every 1-2 days, we should be able
to figure that out pretty soon.

merlin


-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] emergency outage requiring database restart

Reply via email to