Re: [HACKERS] emergency outage requiring database restart

Merlin Moncure Mon, 24 Oct 2016 16:06:55 -0700

On Thu, Oct 20, 2016 at 1:52 PM, Merlin Moncure <[email protected]> wrote:
> On Wed, Oct 19, 2016 at 2:39 PM, Merlin Moncure <[email protected]> wrote:
>> On Wed, Oct 19, 2016 at 9:56 AM, Bruce Momjian <[email protected]> wrote:
>>> On Wed, Oct 19, 2016 at 08:54:48AM -0500, Merlin Moncure wrote:
>>>> > Yeah.  Believe me -- I know the drill.  Most or all the damage seemed
>>>> > to be to the system catalogs with at least two critical tables dropped
>>>> > or inaccessible in some fashion.  A lot of the OIDs seemed to be
>>>> > pointing at the wrong thing.  Couple more datapoints here.
>>>> >
>>>> > *) This database is OLTP, doing ~ 20 tps avg (but very bursty)
>>>> > *) Another database on the same cluster was not impacted.  However
>>>> > it's more olap style and may not have been written to during the
>>>> > outage
>>>> >
>>>> > Now, this infrastructure running this system is running maybe 100ish
>>>> > postgres clusters and maybe 1000ish sql server instances with
>>>> > approximately zero unexplained data corruption issues in the 5 years
>>>> > I've been here.  Having said that, this definitely smells and feels
>>>> > like something on the infrastructure side.  I'll follow up if I have
>>>> > any useful info.
>>>>
>>>> After a thorough investigation I now have credible evidence the source
>>>> of the damage did not originate from the database itself.
>>>> Specifically, this database is mounted on the same volume as the
>>>> operating system (I know, I know) and something non database driven
>>>> sucked up disk space very rapidly and exhausted the volume -- fast
>>>> enough that sar didn't pick it up.  Oh well :-) -- thanks for the help
>>>
>>> However, disk space exhaustion should not lead to corruption unless the
>>> underlying layers lied in some way.
>>
>> I agree -- however I'm sufficiently separated from the things doing
>> the things that I can't verify that in any real way.   In the meantime
>> I'm going to take standard precautions (enable checksums/dedicated
>> volume/replication).  Low disk space also does not explain the bizarre
>> outage I had last friday.
>
> ok, data corruption struck again.  This time disk space is ruled out,
> and access to the database is completely denied:
> postgres=# \c castaging
> WARNING:  leaking still-referenced relcache entry for
> "pg_index_indexrelid_index"


Corruption struck again.
This time got another case of view busted -- attempting to create
gives missing 'type' error.

merlin


-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] emergency outage requiring database restart

Reply via email to