[ This thread is becomming offtoppic and I suspect highly irrelevant to
most users on this list. I propse you remove the CC or move it to a
relevant list like reiserfs or postgresql-admin if you wish to reply ]

On Thu, Jun 20, 2002 at 01:11:09PM -0400, Greg A. Woods wrote:
> [ On Wednesday, June 19, 2002 at 23:53:14 (+0200), Ragnar Kj�rstad wrote: ]
> > Subject: Re: Backing up PostgreSQL?
> >
> > By this definition postgresql is consistant at all times
> 
> That's simply not possible to be true.  PostgreSQL uses multiple files
> in the filesystem namespace to contain its data -- sometimes even
> multiple files per table.  It is literally impossible, with the POSIX
> file I/O interfaces, to guarantee that concurrent writes to multiple
> files will all complete at the same time.  Remember I've only been
> talking about the backed up files being in a self-consistent state and
> not requiring roll-back or roll-forward of any transaction logs after
> restore.

Puh - we've been through this already! Postgresql doesn't need this
guarantee, because it writes to it's log to avoid this very problem!

> > Not at all. There are multiple levels of consistancy and in order to be
> > safe from corruption you need to think of all of them. The WAL protects
> > you from database inconsistancies, the journaling filesystem from
> > filesystem inconsistancies and a if the RAID is doing write-back caching
> > it must have battery-backed cached.
> 
> Yeah, but you are still making invalid claims about what those different
> levels of consistency imply w.r.t. the consistency of backed up copies
> of the database files.

"You're wrong" is simply not a convincing line of argument.

> > Where does it say that close/open will flush metadata?
> 
> That's how the unix filesystem works.  UTSL.

Here is the close-code on linux:

asmlinkage long sys_close(unsigned int fd)
{
        struct file * filp;
        struct files_struct *files = current->files;

        write_lock(&files->file_lock);
        if (fd >= files->max_fds)
                goto out_unlock;
        filp = files->fd[fd];
        if (!filp)
                goto out_unlock;
        files->fd[fd] = NULL;
        FD_CLR(fd, files->close_on_exec);
        __put_unused_fd(files, fd);
        write_unlock(&files->file_lock);
        return filp_close(filp, files);

out_unlock:
        write_unlock(&files->file_lock);
        return -EBADF;
}

int filp_close(struct file *filp, fl_owner_t id)
{
        int retval;

        if (!file_count(filp)) {
                printk(KERN_ERR "VFS: Close: file count is 0\n");
                return 0;
        }
        retval = 0;
        if (filp->f_op && filp->f_op->flush) {
                lock_kernel();
                retval = filp->f_op->flush(filp);
                unlock_kernel();
        }
        fcntl_dirnotify(0, filp, 0);
        locks_remove_posix(filp, id);
        fput(filp);
        return retval;
}

As you can see data is only flushed if filp->f_op->flush() is set, and
if you look in fs/ext2/file.c you will see that struct file_operations
ext2_file_operations doesn't define this operation.

I'd quote the open-code as well, but it's much bigger - you'll find it
in the kernel source if you're really interested to find out.


This issue has been discussed in depth on the reiserfs and reiserfs-dev
lists; I propse you subscribe or browse the archives for more
information. In particular there is a thread about filesystem-features
required for mailservers, and there is a post from Wietse where he
writes:

"ext2fs isn't a great file system for mail. Queue files are short-lived,
so mail delivery involves are a lot of directory operations.  ext2fs
has ASYNCHRONOUS DIRECTORY updates and can lose a file even when
fsync() on the file succeeds."


Just to make sure there is no (additional) confusion here; what I'm
saying is:
1. Meta-data must be updated properly. This is obvious and 
   shouldn't require futher explanation...
2. non-journaling filesystems (e.g. ext2 on linux) do update
   the inode-metadata on fsync(), but they do not update the
   directory. 

As postgreSQL and other databases does not create new files very often,
it will _mostly_ be able to recover from a crash on a non-journaling
filesystem - but there is no guarantee. It's an easy mistake to forget
that postgreSQL actually creates files both when new tables/indexes are
created and when they get too big (I've done it myself in the past - it
doesn't make it anymore right)



> > If you read my posts carefully you will find that I've never claimed
> > that filesystem consistency equals database consistency.
> 
> You have.  You have confused the meanings and implied that one will get
> you the other.

Not at all. What I've stated all along is that the database-log provide
database-consistency and journaling filesystems provide
filesystem-consistency. There is no general way to provide
filesystem-consistency on non-journaling filesystems; fsync() will flush
the inode-data on most (all?) filesystems, but on there is no common way
to flush directory-updates. Some can be updated with fsync() on a
directory-filedescriptor, some have an attribute on the directory to set
syncronious file-operations and there may be filesystems out there that
flush on close()/open() like you claimed. For all I know postgreSQL may
implement some of theese quirks, but the point is that in general
directory-updates may be lost on non-journaling filesystems.



-- 
Ragnar Kj�rstad
Big Storage

Reply via email to