[ This thread is becomming offtoppic and I suspect highly irrelevant to
most users on this list. I propse you remove the CC or move it to a
relevant list like reiserfs or postgresql-admin if you wish to reply ]
On Thu, Jun 20, 2002 at 01:11:09PM -0400, Greg A. Woods wrote:
> [ On Wednesday, June 19, 2002 at 23:53:14 (+0200), Ragnar Kj�rstad wrote: ]
> > Subject: Re: Backing up PostgreSQL?
> >
> > By this definition postgresql is consistant at all times
>
> That's simply not possible to be true. PostgreSQL uses multiple files
> in the filesystem namespace to contain its data -- sometimes even
> multiple files per table. It is literally impossible, with the POSIX
> file I/O interfaces, to guarantee that concurrent writes to multiple
> files will all complete at the same time. Remember I've only been
> talking about the backed up files being in a self-consistent state and
> not requiring roll-back or roll-forward of any transaction logs after
> restore.
Puh - we've been through this already! Postgresql doesn't need this
guarantee, because it writes to it's log to avoid this very problem!
> > Not at all. There are multiple levels of consistancy and in order to be
> > safe from corruption you need to think of all of them. The WAL protects
> > you from database inconsistancies, the journaling filesystem from
> > filesystem inconsistancies and a if the RAID is doing write-back caching
> > it must have battery-backed cached.
>
> Yeah, but you are still making invalid claims about what those different
> levels of consistency imply w.r.t. the consistency of backed up copies
> of the database files.
"You're wrong" is simply not a convincing line of argument.
> > Where does it say that close/open will flush metadata?
>
> That's how the unix filesystem works. UTSL.
Here is the close-code on linux:
asmlinkage long sys_close(unsigned int fd)
{
struct file * filp;
struct files_struct *files = current->files;
write_lock(&files->file_lock);
if (fd >= files->max_fds)
goto out_unlock;
filp = files->fd[fd];
if (!filp)
goto out_unlock;
files->fd[fd] = NULL;
FD_CLR(fd, files->close_on_exec);
__put_unused_fd(files, fd);
write_unlock(&files->file_lock);
return filp_close(filp, files);
out_unlock:
write_unlock(&files->file_lock);
return -EBADF;
}
int filp_close(struct file *filp, fl_owner_t id)
{
int retval;
if (!file_count(filp)) {
printk(KERN_ERR "VFS: Close: file count is 0\n");
return 0;
}
retval = 0;
if (filp->f_op && filp->f_op->flush) {
lock_kernel();
retval = filp->f_op->flush(filp);
unlock_kernel();
}
fcntl_dirnotify(0, filp, 0);
locks_remove_posix(filp, id);
fput(filp);
return retval;
}
As you can see data is only flushed if filp->f_op->flush() is set, and
if you look in fs/ext2/file.c you will see that struct file_operations
ext2_file_operations doesn't define this operation.
I'd quote the open-code as well, but it's much bigger - you'll find it
in the kernel source if you're really interested to find out.
This issue has been discussed in depth on the reiserfs and reiserfs-dev
lists; I propse you subscribe or browse the archives for more
information. In particular there is a thread about filesystem-features
required for mailservers, and there is a post from Wietse where he
writes:
"ext2fs isn't a great file system for mail. Queue files are short-lived,
so mail delivery involves are a lot of directory operations. ext2fs
has ASYNCHRONOUS DIRECTORY updates and can lose a file even when
fsync() on the file succeeds."
Just to make sure there is no (additional) confusion here; what I'm
saying is:
1. Meta-data must be updated properly. This is obvious and
shouldn't require futher explanation...
2. non-journaling filesystems (e.g. ext2 on linux) do update
the inode-metadata on fsync(), but they do not update the
directory.
As postgreSQL and other databases does not create new files very often,
it will _mostly_ be able to recover from a crash on a non-journaling
filesystem - but there is no guarantee. It's an easy mistake to forget
that postgreSQL actually creates files both when new tables/indexes are
created and when they get too big (I've done it myself in the past - it
doesn't make it anymore right)
> > If you read my posts carefully you will find that I've never claimed
> > that filesystem consistency equals database consistency.
>
> You have. You have confused the meanings and implied that one will get
> you the other.
Not at all. What I've stated all along is that the database-log provide
database-consistency and journaling filesystems provide
filesystem-consistency. There is no general way to provide
filesystem-consistency on non-journaling filesystems; fsync() will flush
the inode-data on most (all?) filesystems, but on there is no common way
to flush directory-updates. Some can be updated with fsync() on a
directory-filedescriptor, some have an attribute on the directory to set
syncronious file-operations and there may be filesystems out there that
flush on close()/open() like you claimed. For all I know postgreSQL may
implement some of theese quirks, but the point is that in general
directory-updates may be lost on non-journaling filesystems.
--
Ragnar Kj�rstad
Big Storage