On Mon, Oct 24, 2016 at 6:01 PM, Merlin Moncure <mmonc...@gmail.com> wrote: > On Thu, Oct 20, 2016 at 1:52 PM, Merlin Moncure <mmonc...@gmail.com> wrote: >> On Wed, Oct 19, 2016 at 2:39 PM, Merlin Moncure <mmonc...@gmail.com> wrote: >>> On Wed, Oct 19, 2016 at 9:56 AM, Bruce Momjian <br...@momjian.us> wrote: >>>> On Wed, Oct 19, 2016 at 08:54:48AM -0500, Merlin Moncure wrote: >>>>> > Yeah. Believe me -- I know the drill. Most or all the damage seemed >>>>> > to be to the system catalogs with at least two critical tables dropped >>>>> > or inaccessible in some fashion. A lot of the OIDs seemed to be >>>>> > pointing at the wrong thing. Couple more datapoints here. >>>>> > >>>>> > *) This database is OLTP, doing ~ 20 tps avg (but very bursty) >>>>> > *) Another database on the same cluster was not impacted. However >>>>> > it's more olap style and may not have been written to during the >>>>> > outage >>>>> > >>>>> > Now, this infrastructure running this system is running maybe 100ish >>>>> > postgres clusters and maybe 1000ish sql server instances with >>>>> > approximately zero unexplained data corruption issues in the 5 years >>>>> > I've been here. Having said that, this definitely smells and feels >>>>> > like something on the infrastructure side. I'll follow up if I have >>>>> > any useful info. >>>>> >>>>> After a thorough investigation I now have credible evidence the source >>>>> of the damage did not originate from the database itself. >>>>> Specifically, this database is mounted on the same volume as the >>>>> operating system (I know, I know) and something non database driven >>>>> sucked up disk space very rapidly and exhausted the volume -- fast >>>>> enough that sar didn't pick it up. Oh well :-) -- thanks for the help >>>> >>>> However, disk space exhaustion should not lead to corruption unless the >>>> underlying layers lied in some way. >>> >>> I agree -- however I'm sufficiently separated from the things doing >>> the things that I can't verify that in any real way. In the meantime >>> I'm going to take standard precautions (enable checksums/dedicated >>> volume/replication). Low disk space also does not explain the bizarre >>> outage I had last friday. >> >> ok, data corruption struck again. This time disk space is ruled out, >> and access to the database is completely denied: >> postgres=# \c castaging >> WARNING: leaking still-referenced relcache entry for >> "pg_index_indexrelid_index" > > Corruption struck again. > This time got another case of view busted -- attempting to create > gives missing 'type' error.
Call it a hunch -- I think the problem is in pl/sh. merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers