On Thu, Oct 20, 2016 at 1:52 PM, Merlin Moncure <mmonc...@gmail.com> wrote: > On Wed, Oct 19, 2016 at 2:39 PM, Merlin Moncure <mmonc...@gmail.com> wrote: >> On Wed, Oct 19, 2016 at 9:56 AM, Bruce Momjian <br...@momjian.us> wrote: >>> On Wed, Oct 19, 2016 at 08:54:48AM -0500, Merlin Moncure wrote: >>>> > Yeah. Believe me -- I know the drill. Most or all the damage seemed >>>> > to be to the system catalogs with at least two critical tables dropped >>>> > or inaccessible in some fashion. A lot of the OIDs seemed to be >>>> > pointing at the wrong thing. Couple more datapoints here. >>>> > >>>> > *) This database is OLTP, doing ~ 20 tps avg (but very bursty) >>>> > *) Another database on the same cluster was not impacted. However >>>> > it's more olap style and may not have been written to during the >>>> > outage >>>> > >>>> > Now, this infrastructure running this system is running maybe 100ish >>>> > postgres clusters and maybe 1000ish sql server instances with >>>> > approximately zero unexplained data corruption issues in the 5 years >>>> > I've been here. Having said that, this definitely smells and feels >>>> > like something on the infrastructure side. I'll follow up if I have >>>> > any useful info. >>>> >>>> After a thorough investigation I now have credible evidence the source >>>> of the damage did not originate from the database itself. >>>> Specifically, this database is mounted on the same volume as the >>>> operating system (I know, I know) and something non database driven >>>> sucked up disk space very rapidly and exhausted the volume -- fast >>>> enough that sar didn't pick it up. Oh well :-) -- thanks for the help >>> >>> However, disk space exhaustion should not lead to corruption unless the >>> underlying layers lied in some way. >> >> I agree -- however I'm sufficiently separated from the things doing >> the things that I can't verify that in any real way. In the meantime >> I'm going to take standard precautions (enable checksums/dedicated >> volume/replication). Low disk space also does not explain the bizarre >> outage I had last friday. > > ok, data corruption struck again. This time disk space is ruled out, > and access to the database is completely denied: > postgres=# \c castaging > WARNING: leaking still-referenced relcache entry for > "pg_index_indexrelid_index"
single user mode dumps core :( bash-4.1$ postgres --single -D /var/lib/pgsql/9.5/data castaging LOG: 00000: could not change directory to "/root": Permission denied LOCATION: resolve_symlinks, exec.c:293 Segmentation fault (core dumped) Core was generated by `postgres --single -D /var/lib/pgsql/9.5/data castaging'. Program terminated with signal 11, Segmentation fault. #0 0x0000000000797d6f in ?? () Missing separate debuginfos, use: debuginfo-install postgresql95-server-9.5.2-1PGDG.rhel6.x86_64 (gdb) bt #0 0x0000000000797d6f in ?? () #1 0x000000000079acf1 in RelationCacheInitializePhase3 () #2 0x00000000007b35c5 in InitPostgres () #3 0x00000000006b9b53 in PostgresMain () #4 0x00000000005f30fb in main () (gdb) merlin -- Sent via pgsql-hackers mailing list (firstname.lastname@example.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers