Re: [PATCH 3/3] nfs: use ->mmap_prepare() to avoid an AB-BA deadlock

2007-11-14 Thread Trond Myklebust

On Wed, 2007-11-14 at 23:24 +0100, Nick Piggin wrote:

> mmap()s can be different from read in that the syscall may have little
> relation to when the data gets used. But I guess it's still a best
> effort thing. Fair enough.

Agreed that mmap() is special and very problematic on NFS. However I
can't see how we can improve on the existing models short of some
significant protocol modifications, and so far, nobody has presented the
IETF with a good case for why they need this level of cache consistency.

Cheers
   Trond

-
To unsubscribe from this list: send the line "unsubscribe linux-arch" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] nfs: use ->mmap_prepare() to avoid an AB-BA deadlock

2007-11-14 Thread Trond Myklebust

On Wed, 2007-11-14 at 22:50 +0100, Peter Zijlstra wrote:
> Right, but I guess what Nick asked is, if pages could be stale to start
> with, how is that avoided in the future.
> 
> The way I understand it, this re-validate is just a best effort at
> getting a coherent image.

The normal convention for NFS is to use a close-to-open cache
consistency model. In that model, applications must agree never to open
the file for reading or writing if an application on a different NFS
client already holds it open for writing.

However there is no standard locking model for _enforcing_ such an
agreement, so some setups do violate it. One obvious model that we try
to support is that where the applications are using POSIX locking in
order to ensure exclusive access to the data when requires.

Another model is to rely rather on synchronous writes and heavy
attribute revalidation to detect when a competing application has
written to the file (the 'noac' mount option). While such a model is
obviously deficient in that it can never guarantee cache coherency, we
do attempt to ensure that it works on a per-operation basis (IOW: we
check cache coherency before each call to read(), to mmap(), etc) since
it is by far the easiest model to apply if you have applications that
cannot be rewritten and that satisfy the requirement that they rarely
conflict.

Cheers
  Trond

-
To unsubscribe from this list: send the line "unsubscribe linux-arch" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] nfs: use ->mmap_prepare() to avoid an AB-BA deadlock

2007-11-14 Thread Nick Piggin
On Wed, Nov 14, 2007 at 05:18:50PM -0500, Trond Myklebust wrote:
> 
> On Wed, 2007-11-14 at 22:50 +0100, Peter Zijlstra wrote:
> > Right, but I guess what Nick asked is, if pages could be stale to start
> > with, how is that avoided in the future.
> > 
> > The way I understand it, this re-validate is just a best effort at
> > getting a coherent image.
> 
> The normal convention for NFS is to use a close-to-open cache
> consistency model. In that model, applications must agree never to open
> the file for reading or writing if an application on a different NFS
> client already holds it open for writing.
> 
> However there is no standard locking model for _enforcing_ such an
> agreement, so some setups do violate it. One obvious model that we try
> to support is that where the applications are using POSIX locking in
> order to ensure exclusive access to the data when requires.
> 
> Another model is to rely rather on synchronous writes and heavy
> attribute revalidation to detect when a competing application has
> written to the file (the 'noac' mount option). While such a model is
> obviously deficient in that it can never guarantee cache coherency, we
> do attempt to ensure that it works on a per-operation basis (IOW: we
> check cache coherency before each call to read(), to mmap(), etc) since
> it is by far the easiest model to apply if you have applications that
> cannot be rewritten and that satisfy the requirement that they rarely
> conflict.

mmap()s can be different from read in that the syscall may have little
relation to when the data gets used. But I guess it's still a best
effort thing. Fair enough.

Thanks,
Nick
-
To unsubscribe from this list: send the line "unsubscribe linux-arch" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] nfs: use ->mmap_prepare() to avoid an AB-BA deadlock

2007-11-14 Thread Peter Zijlstra

On Wed, 2007-11-14 at 16:41 -0500, Trond Myklebust wrote:
> On Wed, 2007-11-14 at 22:31 +0100, Peter Zijlstra wrote:
> > On Wed, 2007-11-14 at 22:22 +0100, Nick Piggin wrote:
> > > On Wed, Nov 14, 2007 at 09:01:39PM +0100, Peter Zijlstra wrote:
> > > > Normal locking order is:
> > > > 
> > > >   i_mutex
> > > > mmap_sem
> > > > 
> > > > However NFS's ->mmap hook, which is called under mmap_sem, can take 
> > > > i_mutex.
> > > > Avoid this potential deadlock by doing the work that requires i_mutex 
> > > > from
> > > > the new ->mmap_prepare().
> > > > 
> > > > [ Is this sufficient, or does it introduce a race? ]
> > > 
> > > Seems like an OK patchset in my opinion. I don't know much about NFS
> > > unfortunately, but I wonder what prevents the condition fixed by
> > > nfs_revalidate_mapping from happening again while the mmap is active...?
> > 
> > As the changelog might have suggested, I'm not overly sure of the nfs
> > requirements myself. I think it just does a best effort at getting the
> > pages coherent with other clients, and then hopes for the best.
> > 
> > I'll let Trond enlighten us further before I make an utter fool of
> > myself :-)
> 
> The NFS client needs to check the validity of already cached data before
> it allows those pages to be mmapped. If it finds out that the cache is
> stale, then we need to call invalidate_inode_pages2() to clear out the
> cache and refresh it from the server. The inode->i_mutex is necessary in
> order to prevent races between the new writes and the cache invalidation
> code.

Right, but I guess what Nick asked is, if pages could be stale to start
with, how is that avoided in the future.

The way I understand it, this re-validate is just a best effort at
getting a coherent image.

-
To unsubscribe from this list: send the line "unsubscribe linux-arch" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] nfs: use ->mmap_prepare() to avoid an AB-BA deadlock

2007-11-14 Thread Trond Myklebust

On Wed, 2007-11-14 at 22:31 +0100, Peter Zijlstra wrote:
> On Wed, 2007-11-14 at 22:22 +0100, Nick Piggin wrote:
> > On Wed, Nov 14, 2007 at 09:01:39PM +0100, Peter Zijlstra wrote:
> > > Normal locking order is:
> > > 
> > >   i_mutex
> > > mmap_sem
> > > 
> > > However NFS's ->mmap hook, which is called under mmap_sem, can take 
> > > i_mutex.
> > > Avoid this potential deadlock by doing the work that requires i_mutex from
> > > the new ->mmap_prepare().
> > > 
> > > [ Is this sufficient, or does it introduce a race? ]
> > 
> > Seems like an OK patchset in my opinion. I don't know much about NFS
> > unfortunately, but I wonder what prevents the condition fixed by
> > nfs_revalidate_mapping from happening again while the mmap is active...?
> 
> As the changelog might have suggested, I'm not overly sure of the nfs
> requirements myself. I think it just does a best effort at getting the
> pages coherent with other clients, and then hopes for the best.
> 
> I'll let Trond enlighten us further before I make an utter fool of
> myself :-)

The NFS client needs to check the validity of already cached data before
it allows those pages to be mmapped. If it finds out that the cache is
stale, then we need to call invalidate_inode_pages2() to clear out the
cache and refresh it from the server. The inode->i_mutex is necessary in
order to prevent races between the new writes and the cache invalidation
code.

Cheers
  Trond

-
To unsubscribe from this list: send the line "unsubscribe linux-arch" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] nfs: use ->mmap_prepare() to avoid an AB-BA deadlock

2007-11-14 Thread Nick Piggin
On Wed, Nov 14, 2007 at 09:01:39PM +0100, Peter Zijlstra wrote:
> Normal locking order is:
> 
>   i_mutex
> mmap_sem
> 
> However NFS's ->mmap hook, which is called under mmap_sem, can take i_mutex.
> Avoid this potential deadlock by doing the work that requires i_mutex from
> the new ->mmap_prepare().
> 
> [ Is this sufficient, or does it introduce a race? ]

Seems like an OK patchset in my opinion. I don't know much about NFS
unfortunately, but I wonder what prevents the condition fixed by
nfs_revalidate_mapping from happening again while the mmap is active...?

> 
> Signed-off-by: Peter Zijlstra <[EMAIL PROTECTED]>
> ---
>  fs/nfs/file.c |   25 +
>  1 file changed, 17 insertions(+), 8 deletions(-)
> 
> Index: linux-2.6/fs/nfs/file.c
> ===
> --- linux-2.6.orig/fs/nfs/file.c
> +++ linux-2.6/fs/nfs/file.c
> @@ -41,6 +41,9 @@
>  static int nfs_file_open(struct inode *, struct file *);
>  static int nfs_file_release(struct inode *, struct file *);
>  static loff_t nfs_file_llseek(struct file *file, loff_t offset, int origin);
> +static int
> +nfs_file_mmap_prepare(struct file * file, unsigned long len,
> + unsigned long prot, unsigned long flags, unsigned long pgoff);
>  static int  nfs_file_mmap(struct file *, struct vm_area_struct *);
>  static ssize_t nfs_file_splice_read(struct file *filp, loff_t *ppos,
>   struct pipe_inode_info *pipe,
> @@ -64,6 +67,7 @@ const struct file_operations nfs_file_op
>   .write  = do_sync_write,
>   .aio_read   = nfs_file_read,
>   .aio_write  = nfs_file_write,
> + .mmap_prepare   = nfs_file_mmap_prepare,
>   .mmap   = nfs_file_mmap,
>   .open   = nfs_file_open,
>   .flush  = nfs_file_flush,
> @@ -270,7 +274,8 @@ nfs_file_splice_read(struct file *filp, 
>  }
>  
>  static int
> -nfs_file_mmap(struct file * file, struct vm_area_struct * vma)
> +nfs_file_mmap_prepare(struct file * file, unsigned long len,
> + unsigned long prot, unsigned long flags, unsigned long pgoff)
>  {
>   struct dentry *dentry = file->f_path.dentry;
>   struct inode *inode = dentry->d_inode;
> @@ -279,13 +284,17 @@ nfs_file_mmap(struct file * file, struct
>   dfprintk(VFS, "nfs: mmap(%s/%s)\n",
>   dentry->d_parent->d_name.name, dentry->d_name.name);
>  
> - status = nfs_revalidate_mapping(inode, file->f_mapping);
> - if (!status) {
> - vma->vm_ops = &nfs_file_vm_ops;
> - vma->vm_flags |= VM_CAN_NONLINEAR;
> - file_accessed(file);
> - }
> - return status;
> + return nfs_revalidate_mapping(inode, file->f_mapping);
> +}
> +
> +static int
> +nfs_file_mmap(struct file *file, struct vm_area_struct *vma)
> +{
> + vma->vm_ops = &nfs_file_vm_ops;
> + vma->vm_flags |= VM_CAN_NONLINEAR;
> + file_accessed(file);
> +
> + return 0;
>  }
>  
>  /*
> 
> --
-
To unsubscribe from this list: send the line "unsubscribe linux-arch" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] nfs: use ->mmap_prepare() to avoid an AB-BA deadlock

2007-11-14 Thread Peter Zijlstra

On Wed, 2007-11-14 at 22:22 +0100, Nick Piggin wrote:
> On Wed, Nov 14, 2007 at 09:01:39PM +0100, Peter Zijlstra wrote:
> > Normal locking order is:
> > 
> >   i_mutex
> > mmap_sem
> > 
> > However NFS's ->mmap hook, which is called under mmap_sem, can take i_mutex.
> > Avoid this potential deadlock by doing the work that requires i_mutex from
> > the new ->mmap_prepare().
> > 
> > [ Is this sufficient, or does it introduce a race? ]
> 
> Seems like an OK patchset in my opinion. I don't know much about NFS
> unfortunately, but I wonder what prevents the condition fixed by
> nfs_revalidate_mapping from happening again while the mmap is active...?

As the changelog might have suggested, I'm not overly sure of the nfs
requirements myself. I think it just does a best effort at getting the
pages coherent with other clients, and then hopes for the best.

I'll let Trond enlighten us further before I make an utter fool of
myself :-)

-
To unsubscribe from this list: send the line "unsubscribe linux-arch" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html