Does anyone else have any suggestions about what I can do to diagnose this?
Do I need to re-initdb or can I reasonably keep running with the existing db? A On Mon, Jun 30, 2008 at 7:20 PM, Andrew Hammond <[EMAIL PROTECTED]> wrote: > On Fri, Jun 27, 2008 at 8:14 PM, Tom Lane <[EMAIL PROTECTED]> wrote: >> "Andrew Hammond" <[EMAIL PROTECTED]> writes: >>> (I thought this line was interesting) >>> Jun 27 15:54:31 qadb2 postgres[92519]: [44-1] PANIC: could not open >>> relation 1663/16386/679439393: No such file or directory >> >>> I googled to find out what the numbers 1663/16386/679439393 from the >>> PANIC message mean, but no luck. >> >> tablespaceOID/databaseOID/relfilenode. Looks like just some random user >> table. Not clear why this would be a crash, *especially* since WAL >> recovery is generally willing to create nonexistent files. Is this >> reproducible? > > Yes, both when I just tried to restart the recovery: > > Jun 30 16:04:43 qadb2 postgres[20797]: [27-1] DEBUG: invoking > IpcMemoryCreate(size=92938240) > Jun 30 16:04:43 qadb2 postgres[20797]: [28-1] DEBUG: max_safe_fds = > 983, usable_fds = 1000, already_open = 7 > Jun 30 16:04:43 qadb2 postgres[20798]: [29-1] LOG: database system > was interrupted while in recovery at 2008-06-27 15:54:31 PDT > Jun 30 16:04:43 qadb2 postgres[20798]: [29-2] HINT: This probably > means that some data is corrupted and you will have to use the last > backup for recovery. > Jun 30 16:04:43 qadb2 postgres[20798]: [30-1] LOG: starting archive recovery > Jun 30 16:04:43 qadb2 postgres[20798]: [31-1] LOG: restore_command = > "cp -p /usr/tmp/2008-06-25_wals/%f %p" > Jun 30 16:04:43 qadb2 postgres[20798]: [32-1] DEBUG: executing > restore command "cp -p /usr/tmp/2008-06-25_wals/00000001.history > pg_xlog/RECOVERYHISTORY" > Jun 30 16:04:43 qadb2 postgres[20798]: [33-1] DEBUG: could not > restore file "00000001.history" from archive: return code 256 > Jun 30 16:04:43 qadb2 postgres[20798]: [34-1] DEBUG: executing > restore command "cp -p > /usr/tmp/2008-06-25_wals/00000001000001D600000078.0055F0B8.backup > Jun 30 16:04:43 qadb2 postgres[20798]: [34-2] pg_xlog/RECOVERYHISTORY" > Jun 30 16:04:43 qadb2 postgres[20798]: [35-1] LOG: restored log file > "00000001000001D600000078.0055F0B8.backup" from archive > Jun 30 16:04:43 qadb2 postgres[20798]: [36-1] DEBUG: executing > restore command "cp -p > /usr/tmp/2008-06-25_wals/00000001000001D600000078 > pg_xlog/RECOVERYXLOG" > Jun 30 16:04:44 qadb2 postgres[20797]: [29-1] DEBUG: forked new > backend, pid=20805 socket=8 > Jun 30 16:04:44 qadb2 postgres[20805]: [29-1] LOG: connection > received: host=[local] > Jun 30 16:04:44 qadb2 postgres[20805]: [30-1] FATAL: the database > system is starting up > Jun 30 16:04:44 qadb2 postgres[20805]: [31-1] DEBUG: proc_exit(0) > Jun 30 16:04:44 qadb2 postgres[20805]: [32-1] DEBUG: shmem_exit(0) > Jun 30 16:04:44 qadb2 postgres[20805]: [33-1] DEBUG: exit(0) > Jun 30 16:04:44 qadb2 postgres[20797]: [30-1] DEBUG: reaping dead processes > Jun 30 16:04:44 qadb2 postgres[20797]: [31-1] DEBUG: server process > (PID 20805) exited with exit code 0 > Jun 30 16:04:44 qadb2 postgres[20798]: [37-1] LOG: restored log file > "00000001000001D600000078" from archive > Jun 30 16:04:44 qadb2 postgres[20798]: [38-1] LOG: checkpoint record > is at 1D6/7855F0B8 > Jun 30 16:04:44 qadb2 postgres[20798]: [39-1] LOG: redo record is at > 1D6/7855F0B8; undo record is at 0/0; shutdown FALSE > Jun 30 16:04:44 qadb2 postgres[20798]: [40-1] LOG: next transaction > ID: 397171279; next OID: 679516596 > Jun 30 16:04:44 qadb2 postgres[20798]: [41-1] LOG: next MultiXactId: > 857318; next MultiXactOffset: 1718141 > Jun 30 16:04:44 qadb2 postgres[20798]: [42-1] LOG: automatic recovery > in progress > Jun 30 16:04:44 qadb2 postgres[20798]: [43-1] LOG: redo starts at > 1D6/7855F108 > > Jun 30 16:04:45 qadb2 postgres[20798]: [44-1] PANIC: could not open > relation 1663/16386/679439393: No such file or directory > > Jun 30 16:04:45 qadb2 postgres[20797]: [32-1] DEBUG: reaping dead processes > Jun 30 16:04:45 qadb2 postgres[20797]: [33-1] LOG: startup process > (PID 20798) was terminated by signal 6 > Jun 30 16:04:45 qadb2 postgres[20797]: [34-1] LOG: aborting startup > due to startup process failure > Jun 30 16:04:45 qadb2 postgres[20797]: [35-1] DEBUG: proc_exit(1) > Jun 30 16:04:45 qadb2 postgres[20797]: [36-1] DEBUG: shmem_exit(1) > Jun 30 16:04:45 qadb2 postgres[20797]: [37-1] DEBUG: exit(1) > > > > And also when I tried to wipe the slate clean and recover it freshly. > > Jun 30 19:11:59 qadb2 postgres[23091]: [1-1] DEBUG: postmaster: > PostmasterMain: initial environ dump: > Jun 30 19:11:59 qadb2 postgres[23091]: [2-1] DEBUG: > ----------------------------------------- > Jun 30 19:11:59 qadb2 postgres[23091]: [3-1] DEBUG: USER=pgsql > Jun 30 19:11:59 qadb2 postgres[23091]: [4-1] DEBUG: MAIL=/var/mail/pgsql > Jun 30 19:11:59 qadb2 postgres[23091]: [5-1] DEBUG: > LD_LIBRARY_PATH=:/usr/local/adecn/lib > Jun 30 19:11:59 qadb2 postgres[23091]: [6-1] DEBUG: HOME=/usr/local/pgsql > Jun 30 19:11:59 qadb2 postgres[23091]: [7-1] DEBUG: PGLIB=/usr/local/lib > Jun 30 19:11:59 qadb2 postgres[23091]: [8-1] DEBUG: PS1=[QA2] [EMAIL > PROTECTED]:\w\$ > Jun 30 19:11:59 qadb2 postgres[23091]: [9-1] DEBUG: BLOCKSIZE=K > Jun 30 19:11:59 qadb2 postgres[23091]: [10-1] DEBUG: TERM=xterm > Jun 30 19:11:59 qadb2 postgres[23091]: [11-1] DEBUG: > PGSYSCONFDIR=/usr/local/etc/postgresql > Jun 30 19:11:59 qadb2 postgres[23091]: [12-1] DEBUG: > PGLOCALEDIR=/usr/local/share/locale > Jun 30 19:11:59 qadb2 postgres[23091]: [13-1] DEBUG: > Jun 30 19:11:59 qadb2 postgres[23091]: [13-2] > PATH=/usr/local/adecn/bin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/usr/games:/usr/local/sbin:/usr/local/bin:/usr/X11R6/bin > Jun 30 19:11:59 qadb2 postgres[23091]: [13-3] :/usr/local/pgsql/bin > > (snip junk about environment) > > Jun 30 19:11:59 qadb2 postgres[23091]: [19-1] DEBUG: > PGDATA=/var/db/adecn/adecndb > Jun 30 19:11:59 qadb2 postgres[23091]: [20-1] DEBUG: LC_COLLATE=C > Jun 30 19:11:59 qadb2 postgres[23091]: [21-1] DEBUG: LC_CTYPE=C > Jun 30 19:11:59 qadb2 postgres[23091]: [22-1] DEBUG: LC_MESSAGES=C > Jun 30 19:11:59 qadb2 postgres[23091]: [23-1] DEBUG: LC_MONETARY=C > Jun 30 19:11:59 qadb2 postgres[23091]: [24-1] DEBUG: LC_NUMERIC=C > Jun 30 19:11:59 qadb2 postgres[23091]: [25-1] DEBUG: LC_TIME=C > Jun 30 19:11:59 qadb2 postgres[23091]: [26-1] DEBUG: > ----------------------------------------- > Jun 30 19:11:59 qadb2 postgres[23092]: [27-1] DEBUG: invoking > IpcMemoryCreate(size=92938240) > Jun 30 19:11:59 qadb2 postgres[23092]: [28-1] DEBUG: max_safe_fds = > 983, usable_fds = 1000, already_open = 7 > Jun 30 19:11:59 qadb2 postgres[23093]: [29-1] LOG: database system > was interrupted at 2008-06-25 03:01:02 PDT > Jun 30 19:11:59 qadb2 postgres[23093]: [30-1] LOG: starting archive recovery > Jun 30 19:11:59 qadb2 postgres[23093]: [31-1] LOG: restore_command = > "cp -p /usr/tmp/2008-06-25_wals/%f %p" > Jun 30 19:11:59 qadb2 postgres[23093]: [32-1] DEBUG: executing > restore command "cp -p /usr/tmp/2008-06-25_wals/00000001.history > pg_xlog/RECOVERYHISTORY" > Jun 30 19:11:59 qadb2 postgres[23093]: [33-1] DEBUG: could not > restore file "00000001.history" from archive: return code 256 > Jun 30 19:11:59 qadb2 postgres[23093]: [34-1] DEBUG: executing > restore command "cp -p > /usr/tmp/2008-06-25_wals/00000001000001D600000078.0055F0B8.backup > Jun 30 19:11:59 qadb2 postgres[23093]: [34-2] pg_xlog/RECOVERYHISTORY" > Jun 30 19:11:59 qadb2 postgres[23093]: [35-1] LOG: restored log file > "00000001000001D600000078.0055F0B8.backup" from archive > Jun 30 19:11:59 qadb2 postgres[23093]: [36-1] DEBUG: executing > restore command "cp -p > /usr/tmp/2008-06-25_wals/00000001000001D600000078 > pg_xlog/RECOVERYXLOG" > Jun 30 19:12:00 qadb2 postgres[23093]: [37-1] LOG: restored log file > "00000001000001D600000078" from archive > Jun 30 19:12:00 qadb2 postgres[23093]: [38-1] LOG: checkpoint record > is at 1D6/7855F0B8 > Jun 30 19:12:00 qadb2 postgres[23093]: [39-1] LOG: redo record is at > 1D6/7855F0B8; undo record is at 0/0; shutdown FALSE > Jun 30 19:12:00 qadb2 postgres[23093]: [40-1] LOG: next transaction > ID: 397171279; next OID: 679516596 > Jun 30 19:12:00 qadb2 postgres[23093]: [41-1] LOG: next MultiXactId: > 857318; next MultiXactOffset: 1718141 > Jun 30 19:12:00 qadb2 postgres[23093]: [42-1] LOG: automatic recovery > in progress > Jun 30 19:12:00 qadb2 postgres[23093]: [43-1] LOG: redo starts at > 1D6/7855F108 > > Jun 30 19:12:00 qadb2 postgres[23093]: [44-1] PANIC: could not open > relation 1663/16386/679439393: No such file or directory > > Jun 30 19:12:00 qadb2 postgres[23092]: [29-1] DEBUG: reaping dead processes > Jun 30 19:12:00 qadb2 postgres[23092]: [30-1] LOG: startup process > (PID 23093) was terminated by signal 6 > Jun 30 19:12:00 qadb2 postgres[23092]: [31-1] LOG: aborting startup > due to startup process failure > Jun 30 19:12:00 qadb2 postgres[23092]: [32-1] DEBUG: proc_exit(1) > Jun 30 19:12:00 qadb2 postgres[23092]: [33-1] DEBUG: shmem_exit(1) > Jun 30 19:12:00 qadb2 postgres[23092]: [34-1] DEBUG: exit(1) > > So yes, I believe this is entirely repeatable. > >>> (On Thursday night) >>> vacuumdb: vacuuming of database "adecndb" failed: ERROR: could not >>> write block 209610 of relation 1663/16386/236356665: No space left on >>> device >>> CONTEXT: writing block 209610 of relation 1663/16386/236356665 >> >> That's pretty frickin' odd as well, because as a rule we make sure that >> backing store exists for each table page before we open it up for >> normal writing. Do you have a way to find out what relation >> 1663/16386/236356665 is? > > 1663 = pg_default tablespace > 16386 = adecndb, the only database (aside from postgres, template1 & 0 > on this backend). > > SELECT relname FROM pg_class WHERE relfilenode = 236356665; -- got nothing > > But this is on the production database where a vacuum has already succeeded. > >> What filesystem is this database sitting on? > > FreeBSD 6.2 ufs. Here are the options from /etc/fstab. > > /dev/da1 /xraid ufs rw,noatime 2 2 > >>> Now, the first message is very strange since we have monitoring on the >>> file system used by the database and it's been hovering at about 18% >>> space used for the last month. So I can't figure out why we'd get "No >>> space left on device", assuming the device is actually the disk (which >>> seems reasonable given the context) and not shared memory. >> >> Yeah, this is definitely a case of ENOSPC being returned by write(), >> If you're sure the disk wasn't out of space, maybe some per-user quota >> was getting in the way? > > Monitoring is using df to determine space available and runs every 5 > minutes. It will alarm at 70% usage and critical at 80% usage. We > received no alarms or other notifications. We do not yet have tracking > in place to story the history of this, so I can't say definitively > that we didn't spike over that usage level. However we are only at 18% > currently, so it seems unlikely. Nothing else uses this disk. > > I checked /etc/rc.conf and did not find quotas turned on. Neither are > they enabled in the fstab. I confirmed using > quota -v pgsql > It says "none". > > Andrew > -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers