Re: [HACKERS] the un-vacuumable table

Andrew Hammond Thu, 03 Jul 2008 13:43:02 -0700

Does anyone else have any suggestions about what I can do to diagnose this?


Do I need to re-initdb or can I reasonably keep running with the existing db?

A

On Mon, Jun 30, 2008 at 7:20 PM, Andrew Hammond
<[EMAIL PROTECTED]> wrote:
> On Fri, Jun 27, 2008 at 8:14 PM, Tom Lane <[EMAIL PROTECTED]> wrote:
>> "Andrew Hammond" <[EMAIL PROTECTED]> writes:
>>> (I thought this line was interesting)
>>> Jun 27 15:54:31 qadb2 postgres[92519]: [44-1] PANIC:  could not open
>>> relation 1663/16386/679439393: No such file or directory
>>
>>> I googled to find out what the numbers 1663/16386/679439393 from the
>>> PANIC message mean, but no luck.
>>
>> tablespaceOID/databaseOID/relfilenode.  Looks like just some random user
>> table.  Not clear why this would be a crash, *especially* since WAL
>> recovery is generally willing to create nonexistent files.  Is this
>> reproducible?
>
> Yes, both when I just tried to restart the recovery:
>
> Jun 30 16:04:43 qadb2 postgres[20797]: [27-1] DEBUG:  invoking
> IpcMemoryCreate(size=92938240)
> Jun 30 16:04:43 qadb2 postgres[20797]: [28-1] DEBUG:  max_safe_fds =
> 983, usable_fds = 1000, already_open = 7
> Jun 30 16:04:43 qadb2 postgres[20798]: [29-1] LOG:  database system
> was interrupted while in recovery at 2008-06-27 15:54:31 PDT
> Jun 30 16:04:43 qadb2 postgres[20798]: [29-2] HINT:  This probably
> means that some data is corrupted and you will have to use the last
> backup for recovery.
> Jun 30 16:04:43 qadb2 postgres[20798]: [30-1] LOG:  starting archive recovery
> Jun 30 16:04:43 qadb2 postgres[20798]: [31-1] LOG:  restore_command =
> "cp -p /usr/tmp/2008-06-25_wals/%f %p"
> Jun 30 16:04:43 qadb2 postgres[20798]: [32-1] DEBUG:  executing
> restore command "cp -p /usr/tmp/2008-06-25_wals/00000001.history
> pg_xlog/RECOVERYHISTORY"
> Jun 30 16:04:43 qadb2 postgres[20798]: [33-1] DEBUG:  could not
> restore file "00000001.history" from archive: return code 256
> Jun 30 16:04:43 qadb2 postgres[20798]: [34-1] DEBUG:  executing
> restore command "cp -p
> /usr/tmp/2008-06-25_wals/00000001000001D600000078.0055F0B8.backup
> Jun 30 16:04:43 qadb2 postgres[20798]: [34-2]  pg_xlog/RECOVERYHISTORY"
> Jun 30 16:04:43 qadb2 postgres[20798]: [35-1] LOG:  restored log file
> "00000001000001D600000078.0055F0B8.backup" from archive
> Jun 30 16:04:43 qadb2 postgres[20798]: [36-1] DEBUG:  executing
> restore command "cp -p
> /usr/tmp/2008-06-25_wals/00000001000001D600000078
> pg_xlog/RECOVERYXLOG"
> Jun 30 16:04:44 qadb2 postgres[20797]: [29-1] DEBUG:  forked new
> backend, pid=20805 socket=8
> Jun 30 16:04:44 qadb2 postgres[20805]: [29-1] LOG:  connection
> received: host=[local]
> Jun 30 16:04:44 qadb2 postgres[20805]: [30-1] FATAL:  the database
> system is starting up
> Jun 30 16:04:44 qadb2 postgres[20805]: [31-1] DEBUG:  proc_exit(0)
> Jun 30 16:04:44 qadb2 postgres[20805]: [32-1] DEBUG:  shmem_exit(0)
> Jun 30 16:04:44 qadb2 postgres[20805]: [33-1] DEBUG:  exit(0)
> Jun 30 16:04:44 qadb2 postgres[20797]: [30-1] DEBUG:  reaping dead processes
> Jun 30 16:04:44 qadb2 postgres[20797]: [31-1] DEBUG:  server process
> (PID 20805) exited with exit code 0
> Jun 30 16:04:44 qadb2 postgres[20798]: [37-1] LOG:  restored log file
> "00000001000001D600000078" from archive
> Jun 30 16:04:44 qadb2 postgres[20798]: [38-1] LOG:  checkpoint record
> is at 1D6/7855F0B8
> Jun 30 16:04:44 qadb2 postgres[20798]: [39-1] LOG:  redo record is at
> 1D6/7855F0B8; undo record is at 0/0; shutdown FALSE
> Jun 30 16:04:44 qadb2 postgres[20798]: [40-1] LOG:  next transaction
> ID: 397171279; next OID: 679516596
> Jun 30 16:04:44 qadb2 postgres[20798]: [41-1] LOG:  next MultiXactId:
> 857318; next MultiXactOffset: 1718141
> Jun 30 16:04:44 qadb2 postgres[20798]: [42-1] LOG:  automatic recovery
> in progress
> Jun 30 16:04:44 qadb2 postgres[20798]: [43-1] LOG:  redo starts at 
> 1D6/7855F108
>
> Jun 30 16:04:45 qadb2 postgres[20798]: [44-1] PANIC:  could not open
> relation 1663/16386/679439393: No such file or directory
>
> Jun 30 16:04:45 qadb2 postgres[20797]: [32-1] DEBUG:  reaping dead processes
> Jun 30 16:04:45 qadb2 postgres[20797]: [33-1] LOG:  startup process
> (PID 20798) was terminated by signal 6
> Jun 30 16:04:45 qadb2 postgres[20797]: [34-1] LOG:  aborting startup
> due to startup process failure
> Jun 30 16:04:45 qadb2 postgres[20797]: [35-1] DEBUG:  proc_exit(1)
> Jun 30 16:04:45 qadb2 postgres[20797]: [36-1] DEBUG:  shmem_exit(1)
> Jun 30 16:04:45 qadb2 postgres[20797]: [37-1] DEBUG:  exit(1)
>
>
>
> And also when I tried to wipe the slate clean and recover it freshly.
>
> Jun 30 19:11:59 qadb2 postgres[23091]: [1-1] DEBUG:  postmaster:
> PostmasterMain: initial environ dump:
> Jun 30 19:11:59 qadb2 postgres[23091]: [2-1] DEBUG:
> -----------------------------------------
> Jun 30 19:11:59 qadb2 postgres[23091]: [3-1] DEBUG:     USER=pgsql
> Jun 30 19:11:59 qadb2 postgres[23091]: [4-1] DEBUG:     MAIL=/var/mail/pgsql
> Jun 30 19:11:59 qadb2 postgres[23091]: [5-1] DEBUG:
> LD_LIBRARY_PATH=:/usr/local/adecn/lib
> Jun 30 19:11:59 qadb2 postgres[23091]: [6-1] DEBUG:     HOME=/usr/local/pgsql
> Jun 30 19:11:59 qadb2 postgres[23091]: [7-1] DEBUG:     PGLIB=/usr/local/lib
> Jun 30 19:11:59 qadb2 postgres[23091]: [8-1] DEBUG:     PS1=[QA2] [EMAIL 
> PROTECTED]:\w\$
> Jun 30 19:11:59 qadb2 postgres[23091]: [9-1] DEBUG:     BLOCKSIZE=K
> Jun 30 19:11:59 qadb2 postgres[23091]: [10-1] DEBUG:    TERM=xterm
> Jun 30 19:11:59 qadb2 postgres[23091]: [11-1] DEBUG:
> PGSYSCONFDIR=/usr/local/etc/postgresql
> Jun 30 19:11:59 qadb2 postgres[23091]: [12-1] DEBUG:
> PGLOCALEDIR=/usr/local/share/locale
> Jun 30 19:11:59 qadb2 postgres[23091]: [13-1] DEBUG:
> Jun 30 19:11:59 qadb2 postgres[23091]: [13-2]
> PATH=/usr/local/adecn/bin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/usr/games:/usr/local/sbin:/usr/local/bin:/usr/X11R6/bin
> Jun 30 19:11:59 qadb2 postgres[23091]: [13-3] :/usr/local/pgsql/bin
>
> (snip junk about environment)
>
> Jun 30 19:11:59 qadb2 postgres[23091]: [19-1] DEBUG:
> PGDATA=/var/db/adecn/adecndb
> Jun 30 19:11:59 qadb2 postgres[23091]: [20-1] DEBUG:    LC_COLLATE=C
> Jun 30 19:11:59 qadb2 postgres[23091]: [21-1] DEBUG:    LC_CTYPE=C
> Jun 30 19:11:59 qadb2 postgres[23091]: [22-1] DEBUG:    LC_MESSAGES=C
> Jun 30 19:11:59 qadb2 postgres[23091]: [23-1] DEBUG:    LC_MONETARY=C
> Jun 30 19:11:59 qadb2 postgres[23091]: [24-1] DEBUG:    LC_NUMERIC=C
> Jun 30 19:11:59 qadb2 postgres[23091]: [25-1] DEBUG:    LC_TIME=C
> Jun 30 19:11:59 qadb2 postgres[23091]: [26-1] DEBUG:
> -----------------------------------------
> Jun 30 19:11:59 qadb2 postgres[23092]: [27-1] DEBUG:  invoking
> IpcMemoryCreate(size=92938240)
> Jun 30 19:11:59 qadb2 postgres[23092]: [28-1] DEBUG:  max_safe_fds =
> 983, usable_fds = 1000, already_open = 7
> Jun 30 19:11:59 qadb2 postgres[23093]: [29-1] LOG:  database system
> was interrupted at 2008-06-25 03:01:02 PDT
> Jun 30 19:11:59 qadb2 postgres[23093]: [30-1] LOG:  starting archive recovery
> Jun 30 19:11:59 qadb2 postgres[23093]: [31-1] LOG:  restore_command =
> "cp -p /usr/tmp/2008-06-25_wals/%f %p"
> Jun 30 19:11:59 qadb2 postgres[23093]: [32-1] DEBUG:  executing
> restore command "cp -p /usr/tmp/2008-06-25_wals/00000001.history
> pg_xlog/RECOVERYHISTORY"
> Jun 30 19:11:59 qadb2 postgres[23093]: [33-1] DEBUG:  could not
> restore file "00000001.history" from archive: return code 256
> Jun 30 19:11:59 qadb2 postgres[23093]: [34-1] DEBUG:  executing
> restore command "cp -p
> /usr/tmp/2008-06-25_wals/00000001000001D600000078.0055F0B8.backup
> Jun 30 19:11:59 qadb2 postgres[23093]: [34-2]  pg_xlog/RECOVERYHISTORY"
> Jun 30 19:11:59 qadb2 postgres[23093]: [35-1] LOG:  restored log file
> "00000001000001D600000078.0055F0B8.backup" from archive
> Jun 30 19:11:59 qadb2 postgres[23093]: [36-1] DEBUG:  executing
> restore command "cp -p
> /usr/tmp/2008-06-25_wals/00000001000001D600000078
> pg_xlog/RECOVERYXLOG"
> Jun 30 19:12:00 qadb2 postgres[23093]: [37-1] LOG:  restored log file
> "00000001000001D600000078" from archive
> Jun 30 19:12:00 qadb2 postgres[23093]: [38-1] LOG:  checkpoint record
> is at 1D6/7855F0B8
> Jun 30 19:12:00 qadb2 postgres[23093]: [39-1] LOG:  redo record is at
> 1D6/7855F0B8; undo record is at 0/0; shutdown FALSE
> Jun 30 19:12:00 qadb2 postgres[23093]: [40-1] LOG:  next transaction
> ID: 397171279; next OID: 679516596
> Jun 30 19:12:00 qadb2 postgres[23093]: [41-1] LOG:  next MultiXactId:
> 857318; next MultiXactOffset: 1718141
> Jun 30 19:12:00 qadb2 postgres[23093]: [42-1] LOG:  automatic recovery
> in progress
> Jun 30 19:12:00 qadb2 postgres[23093]: [43-1] LOG:  redo starts at 
> 1D6/7855F108
>
> Jun 30 19:12:00 qadb2 postgres[23093]: [44-1] PANIC:  could not open
> relation 1663/16386/679439393: No such file or directory
>
> Jun 30 19:12:00 qadb2 postgres[23092]: [29-1] DEBUG:  reaping dead processes
> Jun 30 19:12:00 qadb2 postgres[23092]: [30-1] LOG:  startup process
> (PID 23093) was terminated by signal 6
> Jun 30 19:12:00 qadb2 postgres[23092]: [31-1] LOG:  aborting startup
> due to startup process failure
> Jun 30 19:12:00 qadb2 postgres[23092]: [32-1] DEBUG:  proc_exit(1)
> Jun 30 19:12:00 qadb2 postgres[23092]: [33-1] DEBUG:  shmem_exit(1)
> Jun 30 19:12:00 qadb2 postgres[23092]: [34-1] DEBUG:  exit(1)
>
> So yes, I believe this is entirely repeatable.
>
>>> (On Thursday night)
>>> vacuumdb: vacuuming of database "adecndb" failed: ERROR:  could not
>>> write block 209610 of relation 1663/16386/236356665: No space left on
>>> device
>>> CONTEXT:  writing block 209610 of relation 1663/16386/236356665
>>
>> That's pretty frickin' odd as well, because as a rule we make sure that
>> backing store exists for each table page before we open it up for
>> normal writing.  Do you have a way to find out what relation
>> 1663/16386/236356665 is?
>
> 1663 = pg_default tablespace
> 16386 = adecndb, the only database (aside from postgres, template1 & 0
> on this backend).
>
> SELECT relname FROM pg_class WHERE relfilenode = 236356665; -- got nothing
>
> But this is on the production database where a vacuum has already succeeded.
>
>> What filesystem is this database sitting on?
>
> FreeBSD 6.2 ufs. Here are the options from /etc/fstab.
>
> /dev/da1                /xraid          ufs     rw,noatime      2       2
>
>>> Now, the first message is very strange since we have monitoring on the
>>> file system used by the database and it's been hovering at about 18%
>>> space used for the last month. So I can't figure out why we'd get "No
>>> space left on device", assuming the device is actually the disk (which
>>> seems reasonable given the context) and not shared memory.
>>
>> Yeah, this is definitely a case of ENOSPC being returned by write(),
>> If you're sure the disk wasn't out of space, maybe some per-user quota
>> was getting in the way?
>
> Monitoring is using df to determine space available and runs every 5
> minutes. It will alarm at 70% usage and critical at 80% usage. We
> received no alarms or other notifications. We do not yet have tracking
> in place to story the history of this, so I can't say definitively
> that we didn't spike over that usage level. However we are only at 18%
> currently, so it seems unlikely. Nothing else uses this disk.
>
> I checked /etc/rc.conf and did not find quotas turned on. Neither are
> they enabled in the fstab. I confirmed using
> quota -v pgsql
> It says "none".
>
> Andrew
>

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] the un-vacuumable table

Reply via email to