Re: how to show propagation state for mounts

2008-02-20 Thread Matthew Wilcox
On Wed, Feb 20, 2008 at 04:04:22PM +, Al Viro wrote:
 It's less about the form of representation (after all, we generate poll
 events when contents of that sucker changes, so one *can* get a consistent
 snapshot of the entire thing) and more about having it self-contained
 when we have namespaces in the play.
 
 IOW, the data in there should give answers to questions that make sense.
 Do events get propagated from this vfsmount I have to that vfsmount I have?
 is a meaningful one; ditto for are events here propagated to somewhere I
 don't see? or are events getting propagated here from somewhere I don't
 see?.

Why do those last two questions deserve an answer?  How will a person's
or application's behaviour be affected by whether a change will
propagate to something they don't know about and can't see?

-- 
Intel are signing my paycheques ... these opinions are still mine
Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] enhanced ESTALE error handling

2008-01-18 Thread Matthew Wilcox
On Fri, Jan 18, 2008 at 10:36:01AM -0500, Peter Staubach wrote:
 @@ -1025,12 +1027,27 @@ static int fastcall link_path_walk(const
   mntget(save.mnt);
  
   result = __link_path_walk(name, nd);
 - if (result == -ESTALE) {
 + while (result == -ESTALE) {
 + /*
 +  * If no progress was made looking up the pathname,
 +  * then stop and return ENOENT instead of ESTALE.
 +  */
 + if (nd-dentry == save.dentry) {
 + result = -ENOENT;
 + break;
 + }
   *nd = save;
   dget(nd-dentry);
   mntget(nd-mnt);
   nd-flags |= LOOKUP_REVAL;
   result = __link_path_walk(name, nd);
 + /*
 +  * If no progress was made this time, then return
 +  * ENOENT instead of ESTALE because no recovery
 +  * is possible to recover the stale file handle.
 +  */
 + if (result == -ESTALE  nd-dentry == save.dentry)
 + result = -ENOENT;
   }
  
   dput(save.dentry);

Why do you need both of these tests?  The first one should be enough,
surely?

 @@ -1268,8 +1285,8 @@ int path_lookup_open(int dfd, const char
   * @create_mode: create intent flags
   */
  static int path_lookup_create(int dfd, const char *name,
 -   unsigned int lookup_flags, struct nameidata *nd,
 -   int open_flags, int create_mode)
 + unsigned int lookup_flags, struct nameidata *nd,
 + int open_flags, int create_mode)

Gratuitous reformatting?

 @@ -1712,7 +1729,10 @@ int open_namei(int dfd, const char *path
   int acc_mode, error;
   struct path path;
   struct dentry *dir;
 - int count = 0;
 + int count;
 +
 +top:
 + count = 0;
  
   acc_mode = ACC_MODE(flag);
  
 @@ -1739,7 +1759,8 @@ int open_namei(int dfd, const char *path
   /*
* Create - we need to know the parent.
*/
 - error = path_lookup_create(dfd,pathname,LOOKUP_PARENT,nd,flag,mode);
 + error = path_lookup_create(dfd, pathname, LOOKUP_PARENT, nd,
 + flag, mode);
   if (error)
   return error;
  
 @@ -1812,10 +1833,17 @@ ok:
   return 0;
  
  exit_dput:
 + if (error == -ESTALE)
 + d_drop(path.dentry);
   dput_path(path, nd);
  exit:
   if (!IS_ERR(nd-intent.open.file))
   release_open_intent(nd);
 + if (error == -ESTALE) {
 + d_drop(nd-dentry);
 + path_release(nd);
 + goto top;
 + }

I wonder if a tail-call might not work better here.

-- 
Intel are signing my paycheques ... these opinions are still mine
Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: file locks: Use wait_event_interruptible_timeout()

2008-01-15 Thread Matthew Wilcox
On Tue, Jan 15, 2008 at 09:48:51AM -0500, J. Bruce Fields wrote:
 On Mon, Jan 14, 2008 at 09:28:30PM -0700, Matthew Wilcox wrote:
  interruptible_sleep_on_locked() is just an open-coded
  wait_event_interruptible_timeout() with a few assumptions since we know
  we hold the BKL.  locks_block_on_timeout() is only used in one place, so
  it's actually simpler to inline it into its caller.
 
 Makes sense, thanks.  So the assumption we were depending on the BKL for
 was that we could count on the wake-up not coming till after we block,
 so we could skip a check -fl_next that's normally needed to resolve the
 usual sleeping-on-some-condition race?

That's right.

-- 
Intel are signing my paycheques ... these opinions are still mine
Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] rewrite rd

2008-01-14 Thread Matthew Wilcox
On Tue, Dec 04, 2007 at 05:26:28AM +0100, Nick Piggin wrote:
 +static void copy_to_brd(struct brd_device *brd, const void *src,
 + sector_t sector, size_t n)
 +{
 + struct page *page;
 + void *dst;
 + unsigned int offset = (sector  (PAGE_SECTORS-1))  SECTOR_SHIFT;
 + size_t copy;
 +
 + copy = min((unsigned long)n, PAGE_SIZE - offset);
 + page = brd_lookup_page(brd, sector);
 + BUG_ON(!page);
 +
 + dst = kmap_atomic(page, KM_USER1);
 + memcpy(dst + offset, src, copy);
 + kunmap_atomic(dst, KM_USER1);

You're using kmap_atomic, but I see no reason you can't be preempted.
Don't you need to at least disable preemption while you have stuff
atomically kmapped?

-- 
Intel are signing my paycheques ... these opinions are still mine
Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


file locks: Use wait_event_interruptible_timeout()

2008-01-14 Thread Matthew Wilcox

interruptible_sleep_on_locked() is just an open-coded
wait_event_interruptible_timeout() with a few assumptions since we know
we hold the BKL.  locks_block_on_timeout() is only used in one place, so
it's actually simpler to inline it into its caller.

Signed-off-by: Matthew Wilcox [EMAIL PROTECTED]

 locks.c |   33 -
 1 file changed, 4 insertions(+), 29 deletions(-)

diff --git a/fs/locks.c b/fs/locks.c
index 8b8388e..b681459 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -634,33 +634,6 @@ static int flock_locks_conflict(struct file_lock 
*caller_fl, struct file_lock *s
return (locks_conflict(caller_fl, sys_fl));
 }
 
-static int interruptible_sleep_on_locked(wait_queue_head_t *fl_wait, int 
timeout)
-{
-   int result = 0;
-   DECLARE_WAITQUEUE(wait, current);
-
-   __set_current_state(TASK_INTERRUPTIBLE);
-   add_wait_queue(fl_wait, wait);
-   if (timeout == 0)
-   schedule();
-   else
-   result = schedule_timeout(timeout);
-   if (signal_pending(current))
-   result = -ERESTARTSYS;
-   remove_wait_queue(fl_wait, wait);
-   __set_current_state(TASK_RUNNING);
-   return result;
-}
-
-static int locks_block_on_timeout(struct file_lock *blocker, struct file_lock 
*waiter, int time)
-{
-   int result;
-   locks_insert_block(blocker, waiter);
-   result = interruptible_sleep_on_locked(waiter-fl_wait, time);
-   __locks_delete_block(waiter);
-   return result;
-}
-
 void
 posix_test_lock(struct file *filp, struct file_lock *fl)
 {
@@ -1256,7 +1229,10 @@ restart:
if (break_time == 0)
break_time++;
}
-   error = locks_block_on_timeout(flock, new_fl, break_time);
+   locks_insert_block(flock, new_fl);
+   error = wait_event_interruptible_timeout(new_fl-fl_wait,
+   !new_fl-fl_next, break_time);
+   __locks_delete_block(new_fl);
if (error = 0) {
if (error == 0)
time_out_leases(inode);

-- 
Intel are signing my paycheques ... these opinions are still mine
Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Leak in nlmsvc_testlock for async GETFL case

2008-01-14 Thread Matthew Wilcox
On Mon, Jan 14, 2008 at 03:44:19PM -0500, J. Bruce Fields wrote:
 Thanks!  I've queued it up for 2.6.25.

Hi Bruce,

I haven't had as much time to play with de-BKL-ising fs/locks.c as I
would like, so fixing that for 2.6.25 is probably out of the question,
but here are two janitorial patches that hopefully can be applied and
will make the next steps easier.  They make sense all by themselves,
even if I don't get back to this project for a few months.

-- 
Intel are signing my paycheques ... these opinions are still mine
Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


file locks: Split flock_find_conflict out of flock_lock_file

2008-01-14 Thread Matthew Wilcox

Reduce the spaghetti-like nature of flock_lock_file by making the chunk
of code labelled find_conflict into its own function.  Also allocate
memory before taking the kernel lock in preparation for switching to a
normal spinlock.

Signed-off-by: Matthew Wilcox [EMAIL PROTECTED]

diff --git a/fs/locks.c b/fs/locks.c
index b681459..bc691e5 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -699,6 +699,33 @@ next_task:
return 0;
 }
 
+/*
+ * A helper function for flock_lock_file().  It searches the list of locks
+ * for @inode looking for one which would conflict with @request.
+ * If it does not find one, it returns the address where the requested lock
+ * should be inserted.  If it does find a conflicting lock, it returns NULL.
+ */
+static struct file_lock **
+flock_find_conflict(struct inode *inode, struct file_lock *request)
+{
+   struct file_lock **before;
+
+   for_each_lock(inode, before) {
+   struct file_lock *fl = *before;
+   if (IS_POSIX(fl))
+   break;
+   if (IS_LEASE(fl))
+   continue;
+   if (!flock_locks_conflict(request, fl))
+   continue;
+
+   if (request-fl_flags  FL_SLEEP)
+   locks_insert_block(fl, request);
+   return NULL;
+   }
+   return before;
+}
+
 /* Try to create a FLOCK lock on filp. We always insert new FLOCK locks
  * after any leases, but before any posix locks.
  *
@@ -714,18 +741,21 @@ static int flock_lock_file(struct file *filp, struct 
file_lock *request)
int error = 0;
int found = 0;
 
-   lock_kernel();
-   if (request-fl_flags  FL_ACCESS)
-   goto find_conflict;
+   if (request-fl_flags  FL_ACCESS) {
+   lock_kernel();
+   before = flock_find_conflict(inode, request);
+   unlock_kernel();
+   return before ? 0 : -EAGAIN;
+   }
 
if (request-fl_type != F_UNLCK) {
-   error = -ENOMEM;
new_fl = locks_alloc_lock();
-   if (new_fl == NULL)
-   goto out;
-   error = 0;
+   if (!new_fl)
+   return -ENOMEM;
}
 
+   lock_kernel();
+
for_each_lock(inode, before) {
struct file_lock *fl = *before;
if (IS_POSIX(fl))
@@ -754,22 +784,12 @@ static int flock_lock_file(struct file *filp, struct 
file_lock *request)
if (found)
cond_resched();
 
-find_conflict:
-   for_each_lock(inode, before) {
-   struct file_lock *fl = *before;
-   if (IS_POSIX(fl))
-   break;
-   if (IS_LEASE(fl))
-   continue;
-   if (!flock_locks_conflict(request, fl))
-   continue;
+   before = flock_find_conflict(inode, request);
+   if (!before) {
error = -EAGAIN;
-   if (request-fl_flags  FL_SLEEP)
-   locks_insert_block(fl, request);
goto out;
}
-   if (request-fl_flags  FL_ACCESS)
-   goto out;
+
locks_copy_lock(new_fl, request);
locks_insert_lock(before, new_fl);
new_fl = NULL;
-- 
Intel are signing my paycheques ... these opinions are still mine
Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


On setting a lease across a cluster

2008-01-04 Thread Matthew Wilcox

Hi Bruce,

The current implementation of vfs_setlease/generic_setlease/etc is a
bit quirky.  I've been thinking it over for the past couple of days,
and I think we need to refactor it to work sensibly.

As you may have noticed, I've been mulling over getting rid of the
BKL in fs/locks.c and the lease interface is particularly problematic.
Here's one reason why:

fcntl_setlease
   lock_kernel
   vfs_setlease
  lock_kernel
  generic_setlease
  unlock_kernel
   fasync_helper
   unlock_kernel

This is perfectly legitimate for the BKL of course, but other spinlocks
can't be acquired recursively in this way.  At first I thought I'd just
push the call to generic_setlease into __vfs_setlease and have two ways
of calling it:

fcntl_setlease
   lock_kernel
   __vfs_setlease
   fasync_helper
   unlock_kernel

vfs_setlease
  lock_kernel
  __vfs_setlease
  unlock_kernel

But generic_setlease allocates memory (GFP_KERNEL) under the lock, so
that's bad for converting to spnlocks later.  Then I thought I'd move
the spinlock acquisition into generic_setlease().  But I can't do that
either as fcntl_setlease() needs to hold the spinlock around the call
to generic_setlease() and fasync_helper().

Then I started to wonder about the current split of functionality between
fcntl_setlease, vfs_setlease and generic_setlease.  The check for no
other process having this file open should be common to all filesystems.
And it should be honoured by NFS (it shouldn't hand out a delegation if
a local user has the file open), so it needs to be in vfs_setlease().

Now I'm worrying about the mechanics of calling out to a filesystem to
perform a lock.  Obviously, a network filesystem is going to have to
sleep waiting for a response from the fileserver, so that precludes the
use of a spinlock held over the call to f_op-setlease.  I'm not keen on
the idea of EXPORT_SYMBOL_GPL() a spinlock -- verifying locking is quite
hard enough without worrying what a filesystem might be doing with it.

So I think we need to refactor the interface, and I'd like to hear your
thoughts on my ideas of how to handle this.

First, have clients of vfs_setlease call lease_alloc() and pass it in,
rather than allocate it on the stack and have this horrible interface
where we may pass back an allocated lock.  This allows us to not allocate
memory (hence sleep) during generic_setlease().

Second, move some of the functionality of generic_setlease() to
vfs_setlease(), as mentioned above.

Third, change the purpose of f_op-setlease.  We can rename it if you
like to catch any out of tree users.  I'd propose using it like this:

vfs_setlease()
   if (f_op-setlease())
  res = f_op-setlease()
  if (res)
 return res;
   lock_kernel()
   generic_setlease()
   unlock_kernel()

fcntl_setlease
   if (f_op-setlease())
  res = f_op-setlease()
  if (res)
 return res;
   lock_kernel
   generic_setlease()
   ... fasync ...
   unlock_kernel

So 'f_op-setlease' now means Try to get a lease from the fileserver.
We can optimise this a bit to not even call setlease if, say, we already
have a read lease and we're trying to get a second read lease.  But we
have to record our local leases (that way they'll show up in /proc/locks).

I think the combination of these three ideas gives us a sane interface
to the various setlease functions, and let us convert from lock_kernel()
to spin_lock() later.  Have I missed something?  I don't often think
about the needs of cluster filesystems, so I may misunderstand how they
need this operation to work.

At some point, we need to revisit the logic for 'is this file open
by another process' -- it's clearly buggy since it doesn't hold the
inode_lock while checking i_count, so it could race against an __iget().

-- 
Intel are signing my paycheques ... these opinions are still mine
Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: On setting a lease across a cluster

2008-01-04 Thread Matthew Wilcox
On Fri, Jan 04, 2008 at 02:47:18PM -0500, J. Bruce Fields wrote:
   Then I started to wonder about the current split of functionality between
   fcntl_setlease, vfs_setlease and generic_setlease.  The check for no
   other process having this file open should be common to all filesystems.
   And it should be honoured by NFS (it shouldn't hand out a delegation if
   a local user has the file open), so it needs to be in vfs_setlease().
 
 Of course, in the case of a (still unfortunately theoretical) cluster
 filesystem implementation, those checks will be insufficient on their
 own.

Oh yes, definitely insufficient, but also necessary.  Right now,
filesystems bypass them entirely.  And the last thing we want is
individual filesystems duplicating these tests ...

 Though the write-lease check especially:
 
   if ((arg == F_WRLCK)
  ((atomic_read(dentry-d_count)  1)
   || (atomic_read(inode-i_count)  1)))
   goto out;
 
 seems like a terrible hack, and gives some annoying false positives:
 
   http://www.ussg.iu.edu/hypermail/linux/kernel/0711.1/2255.html

It is a terrible hack.  It was sufficient for the time, but it wasn't
supposed to be merged in that state ... I haven't had the courage to
figure out how to solve it properly yet.

   Now I'm worrying about the mechanics of calling out to a filesystem to
   perform a lock.  Obviously, a network filesystem is going to have to
   sleep waiting for a response from the fileserver, so that precludes the
   use of a spinlock held over the call to f_op-setlease.  I'm not keen on
   the idea of EXPORT_SYMBOL_GPL() a spinlock -- verifying locking is quite
   hard enough without worrying what a filesystem might be doing with it.
 
 I'm confused as to what the locks are actually protecting.
 
 If they're mainly just protecting the inode's lock list, for example,
 then we should let the filesystem call back to helpers in locks.c for
 any manipulations of that list and do the locking in those helpers.

But we need to ensure the lock list is always consistent for, eg,
/proc/locks.  We don't want locks to temporarily appear on the list when
they never get granted (due to an error elsewhere in the chain).

   vfs_setlease()
  if (f_op-setlease())
 res = f_op-setlease()
 if (res)
return res;
  lock_kernel()
  generic_setlease()
  unlock_kernel()
 
 Why can't the filesystem call into generic_setlease() on its own?

Because (assuming we're rid of the BKL), fcntl_setlease() needs to
acquire the spinlock and hold it while generic_setlease() runs, so
generic_setlease() can't acquire the lock.

 Christoph Hellwig has suggested removing the second case and just doing
 
   .setlease = generic_setlease
 
 for all filesystems that should support leases; see the final patch in
 
   git://linux-nfs.org/~bfields/linux.git server-cluster-lease-api

Yeah, but that doesn't work if we repurpose the setlease f_op.

-- 
Intel are signing my paycheques ... these opinions are still mine
Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: On setting a lease across a cluster

2008-01-04 Thread Matthew Wilcox
On Fri, Jan 04, 2008 at 01:55:36PM -0500, david m. richter wrote:
   fwiw, i've done some work on extending the lease subsystem to help 
 support the full range of requirements for NFSv4 file and directory 
 delegations (e.g., breaking a lease when unlinking a file) and we ended up 
 actually doing most of what you've just suggested here, which i take to be 
 a good sign.

As long as it's great minds thinking alike and not fools seldom
differing ;-)

   most of my refactoring came out of trying to simplify locking and 
 avoid holding locks too long (rather than specifically focusing on 
 cluster-oriented stuff, but the goals dovetail) and your work on getting 
 the BKL out of locks.c entirely is something i really like and look 
 forward to.

Excellent.  Shall I make the patch myself, or did you want to post a
patch based on working code?  ;-)

-- 
Intel are signing my paycheques ... these opinions are still mine
Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: On setting a lease across a cluster

2008-01-04 Thread Matthew Wilcox
On Fri, Jan 04, 2008 at 03:53:04PM -0500, J. Bruce Fields wrote:
 So, the problem is that fcntl_setlease() does
 
   vfs_setlease()
   fasync_helper()
 
 which the bkl held over both, and you want to preserve that?
 
 But what that BKL is doing is a mystery to me--the very first thing that
 fasync_helper() does is kmem_cache_alloc(., GFP_KERNEL).  So you won't
 be introducing any new problem if you lock those two operations
 separately.  Unless I'm totally missing something.

A very good point.

So yet another race caused by using the BKL rather than thinking ... but
maybe it's an inconsequential race.  The consequences are that (if the
kmalloc in fasync_helper sleeps) a lease appears that isn't fully set-up
yet (and may be removed if the kmalloc fails).  Actually, it seems bad
if the kmalloc eventually succeeds -- there's a window while kmalloc is
sleeping where another process could open the file, break the lease,
fl_fasync will be NULL, so no signal is sent.  Then 30 seconds later the 
lease is removed without the leaseholder being sent a signal.  Bad.

How can we fix this situation?  I think we need a better interface than
fasync_helper() -- fasync_alloc() and fasync_setup() would seem to do
the trick.

-- 
Intel are signing my paycheques ... these opinions are still mine
Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Remove BKL from fs/locks.c

2007-12-30 Thread Matthew Wilcox
On Sun, Dec 30, 2007 at 08:36:44PM +1100, Stephen Rothwell wrote:
 We should probably do some performance testing on this because the last
 time we tried the impact was quite noticeable.  You should ping Tridge as
 he has some good lock testing setups.  And he cares if we slow him down :-)

Last time I did this, I switched to a semaphore instead of a spinlock.
That was what slowed us down.  I doubt we can see a performance loss
with this patch since it's a 1-1 substitution of the BKL spinlock with a
private spinlock.

Good idea about asking tridge for an evaluation of the patch though.

-- 
Intel are signing my paycheques ... these opinions are still mine
Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC] Remove BKL from fs/locks.c

2007-12-29 Thread Matthew Wilcox

I've been promising to do this for about seven years now.

It seems to work well enough, but I haven't run any serious stress
tests on it.  This implementation uses one spinlock to protect both lock
lists and all the i_flock chains.  It doesn't seem worth splitting up
the locking any further.

I had to move one memory allocation out from under the file_lock_lock.
I hope I got that logic right.  I'm rather tempted to split out the
find_conflict algorithm from that function into something that can be
called separately for the FL_ACCESS case.

I also have to drop and reacquire the file_lock_lock around the call
to cond_resched().  This was done automatically for us before by the
special BKL semantics.

I had to change vfs_setlease() as it relied on the special BKL ability
to recursively acquire the same lock.  The internal caller now calls
__vfs_setlease and the exported interface acquires and releases the
file_lock_lock around calling __vfs_setlease.

I should probably split out the removal of interruptible_sleep_on_locked()
as it's basically unrelated to all this.

diff --git a/fs/locks.c b/fs/locks.c
index 8b8388e..68de569 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -139,9 +139,23 @@ int lease_break_time = 45;
 #define for_each_lock(inode, lockp) \
for (lockp = inode-i_flock; *lockp != NULL; lockp = 
(*lockp)-fl_next)
 
+/*
+ * Protects the two list heads below, plus the inode-i_flock list
+ */
+static DEFINE_SPINLOCK(file_lock_lock);
 static LIST_HEAD(file_lock_list);
 static LIST_HEAD(blocked_list);
 
+static inline void lock_flocks(void)
+{
+   spin_lock(file_lock_lock);
+}
+
+static inline void unlock_flocks(void)
+{
+   spin_unlock(file_lock_lock);
+}
+
 static struct kmem_cache *filelock_cache __read_mostly;
 
 /* Allocate an empty lock structure. */
@@ -507,9 +521,9 @@ static void __locks_delete_block(struct file_lock *waiter)
  */
 static void locks_delete_block(struct file_lock *waiter)
 {
-   lock_kernel();
+   lock_flocks();
__locks_delete_block(waiter);
-   unlock_kernel();
+   unlock_flocks();
 }
 
 /* Insert waiter into blocker's block list.
@@ -634,29 +648,15 @@ static int flock_locks_conflict(struct file_lock 
*caller_fl, struct file_lock *s
return (locks_conflict(caller_fl, sys_fl));
 }
 
-static int interruptible_sleep_on_locked(wait_queue_head_t *fl_wait, int 
timeout)
-{
-   int result = 0;
-   DECLARE_WAITQUEUE(wait, current);
-
-   __set_current_state(TASK_INTERRUPTIBLE);
-   add_wait_queue(fl_wait, wait);
-   if (timeout == 0)
-   schedule();
-   else
-   result = schedule_timeout(timeout);
-   if (signal_pending(current))
-   result = -ERESTARTSYS;
-   remove_wait_queue(fl_wait, wait);
-   __set_current_state(TASK_RUNNING);
-   return result;
-}
-
-static int locks_block_on_timeout(struct file_lock *blocker, struct file_lock 
*waiter, int time)
+static int locks_block_on_timeout(struct file_lock *blocker,
+   struct file_lock *waiter, int time)
 {
int result;
locks_insert_block(blocker, waiter);
-   result = interruptible_sleep_on_locked(waiter-fl_wait, time);
+   unlock_flocks();
+   result = wait_event_interruptible_timeout(waiter-fl_wait,
+   !waiter-fl_next, time);
+   lock_flocks();
__locks_delete_block(waiter);
return result;
 }
@@ -666,7 +666,7 @@ posix_test_lock(struct file *filp, struct file_lock *fl)
 {
struct file_lock *cfl;
 
-   lock_kernel();
+   lock_flocks();
for (cfl = filp-f_path.dentry-d_inode-i_flock; cfl; cfl = 
cfl-fl_next) {
if (!IS_POSIX(cfl))
continue;
@@ -677,7 +677,7 @@ posix_test_lock(struct file *filp, struct file_lock *fl)
__locks_copy_lock(fl, cfl);
else
fl-fl_type = F_UNLCK;
-   unlock_kernel();
+   unlock_flocks();
return;
 }
 
@@ -741,18 +741,16 @@ static int flock_lock_file(struct file *filp, struct 
file_lock *request)
int error = 0;
int found = 0;
 
-   lock_kernel();
-   if (request-fl_flags  FL_ACCESS)
-   goto find_conflict;
-
-   if (request-fl_type != F_UNLCK) {
-   error = -ENOMEM;
+   if (!(request-fl_flags  FL_ACCESS)  (request-fl_type != F_UNLCK)) {
new_fl = locks_alloc_lock();
-   if (new_fl == NULL)
-   goto out;
-   error = 0;
+   if (!new_fl)
+   return -ENOMEM;
}
 
+   lock_flocks();
+   if (request-fl_flags  FL_ACCESS)
+   goto find_conflict;
+
for_each_lock(inode, before) {
struct file_lock *fl = *before;
if (IS_POSIX(fl))
@@ -778,8 +776,11 @@ static int flock_lock_file(struct file *filp, struct 
file_lock *request)
 * If a higher-priority process was 

Re: NFS Killable tasks request comments on patch

2007-12-06 Thread Matthew Wilcox
On Thu, Dec 06, 2007 at 10:34:26AM -0700, Matthew Wilcox wrote:
 I've done a more thorough review of Liam's work and come up with a few
 more fixes (which I'll publish later today)

I've put up a git tree for this work; see
http://git.kernel.org/?p=linux/kernel/git/willy/misc.git;a=shortlog;h=task_killable

I've split up some of the earlier patches, hopefully in a way which will
make akpm less grumpy.  Andrew, let me know if I've misunderstood how
you want to see it.

Here's the commit for the NFS people to chew on a bit.

commit f05b88f294044cdf56fbe637a246ba6c7a14d6f1
Author: Matthew Wilcox [EMAIL PROTECTED]
Date:   Thu Dec 6 16:24:39 2007 -0500

NFS: Switch from intr mount option to TASK_KILLABLE

By using the TASK_KILLABLE infrastructure, we can get rid of the 'intr'
mount option.  We have to use _killable everywhere instead of _interruptible
as we get rid of rpc_clnt_sigmask/sigunmask.

Signed-off-by: Liam R. Howlett [EMAIL PROTECTED]
Signed-off-by: Matthew Wilcox [EMAIL PROTECTED]

diff --git a/fs/nfs/client.c b/fs/nfs/client.c
index 70587f3..310fa2f 100644
--- a/fs/nfs/client.c
+++ b/fs/nfs/client.c
@@ -302,7 +302,7 @@ found_client:
if (new)
nfs_free_client(new);
 
-   error = wait_event_interruptible(nfs_client_active_wq,
+   error = wait_event_killable(nfs_client_active_wq,
clp-cl_cons_state != NFS_CS_INITING);
if (error  0) {
nfs_put_client(clp);
@@ -494,10 +494,6 @@ static int nfs_init_server_rpcclient(struct nfs_server 
*server, rpc_authflavor_t
if (server-flags  NFS_MOUNT_SOFT)
server-client-cl_softrtry = 1;
 
-   server-client-cl_intr = 0;
-   if (server-flags  NFS4_MOUNT_INTR)
-   server-client-cl_intr = 1;
-
return 0;
 }
 
diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
index 5e8d82f..7b994b2 100644
--- a/fs/nfs/direct.c
+++ b/fs/nfs/direct.c
@@ -193,7 +193,7 @@ static ssize_t nfs_direct_wait(struct nfs_direct_req *dreq)
if (dreq-iocb)
goto out;
 
-   result = wait_for_completion_interruptible(dreq-completion);
+   result = wait_for_completion_killable(dreq-completion);
 
if (!result)
result = dreq-error;
@@ -391,9 +391,7 @@ static ssize_t nfs_direct_read(struct kiocb *iocb, const 
struct iovec *iov,
   unsigned long nr_segs, loff_t pos)
 {
ssize_t result = 0;
-   sigset_t oldset;
struct inode *inode = iocb-ki_filp-f_mapping-host;
-   struct rpc_clnt *clnt = NFS_CLIENT(inode);
struct nfs_direct_req *dreq;
 
dreq = nfs_direct_req_alloc();
@@ -405,11 +403,9 @@ static ssize_t nfs_direct_read(struct kiocb *iocb, const 
struct iovec *iov,
if (!is_sync_kiocb(iocb))
dreq-iocb = iocb;
 
-   rpc_clnt_sigmask(clnt, oldset);
result = nfs_direct_read_schedule_iovec(dreq, iov, nr_segs, pos);
if (!result)
result = nfs_direct_wait(dreq);
-   rpc_clnt_sigunmask(clnt, oldset);
nfs_direct_req_release(dreq);
 
return result;
@@ -767,9 +763,7 @@ static ssize_t nfs_direct_write(struct kiocb *iocb, const 
struct iovec *iov,
size_t count)
 {
ssize_t result = 0;
-   sigset_t oldset;
struct inode *inode = iocb-ki_filp-f_mapping-host;
-   struct rpc_clnt *clnt = NFS_CLIENT(inode);
struct nfs_direct_req *dreq;
size_t wsize = NFS_SERVER(inode)-wsize;
int sync = 0;
@@ -787,11 +781,9 @@ static ssize_t nfs_direct_write(struct kiocb *iocb, const 
struct iovec *iov,
if (!is_sync_kiocb(iocb))
dreq-iocb = iocb;
 
-   rpc_clnt_sigmask(clnt, oldset);
result = nfs_direct_write_schedule_iovec(dreq, iov, nr_segs, pos, sync);
if (!result)
result = nfs_direct_wait(dreq);
-   rpc_clnt_sigunmask(clnt, oldset);
nfs_direct_req_release(dreq);
 
return result;
diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index db5d96d..f68c222 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -433,15 +433,11 @@ static int nfs_wait_schedule(void *word)
  */
 static int nfs_wait_on_inode(struct inode *inode)
 {
-   struct rpc_clnt *clnt = NFS_CLIENT(inode);
struct nfs_inode *nfsi = NFS_I(inode);
-   sigset_t oldmask;
int error;
 
-   rpc_clnt_sigmask(clnt, oldmask);
error = wait_on_bit_lock(nfsi-flags, NFS_INO_REVALIDATING,
-   nfs_wait_schedule, TASK_INTERRUPTIBLE);
-   rpc_clnt_sigunmask(clnt, oldmask);
+   nfs_wait_schedule, TASK_KILLABLE);
 
return error;
 }
diff --git a/fs/nfs/mount_clnt.c b/fs/nfs/mount_clnt.c
index 8afd9f7..49c7cd0 100644
--- a/fs/nfs/mount_clnt.c
+++ b/fs/nfs/mount_clnt.c
@@ -56,7 +56,7 @@ int nfs_mount(struct sockaddr *addr, size_t len, char 
*hostname, char *path

Re: Massive slowdown when re-querying large nfs dir

2007-11-04 Thread Matthew Wilcox
On Mon, Nov 05, 2007 at 07:58:38AM +0300, Al Boldi wrote:
 Any ideas?

How about tcpdumping and seeing what requests are flowing across the
wire?  You might be able to figure out what's being done differently.

-- 
Intel are signing my paycheques ... these opinions are still mine
Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC, PATCH] locks: remove posix deadlock detection

2007-10-28 Thread Matthew Wilcox
On Sun, Oct 28, 2007 at 01:43:21PM -0400, J. Bruce Fields wrote:
 We currently attempt to return -EDEALK to blocking fcntl() file locking
 requests that would create a cycle in the graph of tasks waiting on
 locks.
 
 This is inefficient: in the general case it requires us determining
 whether we're adding a cycle to an arbitrary directed acyclic graph.
 And this calculation has to be performed while holding a lock (currently
 the BKL) that prevents that graph from changing.
 
 It has historically been a source of bugs; most recently it was noticed
 that it could loop indefinitely while holding the BKL.

It can also return -EDEADLK spuriously.  So yeah, just kill it.

-- 
Intel are signing my paycheques ... these opinions are still mine
Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC, PATCH] locks: remove posix deadlock detection

2007-10-28 Thread Matthew Wilcox
On Sun, Oct 28, 2007 at 06:40:52PM +, Alan Cox wrote:
 NAK. This is an ABI change. It was also comprehensively rejected before
 because
 
 - EDEADLK behaviour is ABI

Not in any meaningful way.

 - EDEADLK behaviour is required by SuSv3

What SuSv3 actually says is:

If the system detects that sleeping until a locked region is
unlocked would cause a deadlock, fcntl() shall fail with an
[EDEADLK] error.

It doesn't require the system to detect it, only mandate what to return
if it does detect it.

 - We have no idea what applications may rely on this behaviour.

I've never heard of one that does.

 so we need to fix the bugs - the lock usage and the looping. At that
 point it merely becomes a performance concern to those who use it, which
 is the proper behaviour. If you want a faster non-checking one use
 flock(), or add another flag that is a Linux don't check for deadlock

You can't fix the false EDEADLK detection without solving the halting
problem.  Best of luck with that.

-- 
Intel are signing my paycheques ... these opinions are still mine
Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC, PATCH] locks: remove posix deadlock detection

2007-10-28 Thread Matthew Wilcox
On Sun, Oct 28, 2007 at 05:50:30PM -0400, Trond Myklebust wrote:
  You can't fix the false EDEADLK detection without solving the halting
  problem.  Best of luck with that.
 
 I can see that it would be difficult to do efficiently, but basically,
 this boils down to finding a circular path in a graph. That is hardly an
 unsolvable issue...

Bzzt.  You get a false deadlock with multiple threads like so:

Thread A of task B takes lock 1
Thread C of task D takes lock 2
Thread C of task D blocks on lock 1
Thread E of task B blocks on lock 2

We currently declare deadlock at this point (unless the deadlock detection
code has changed since I last looked at it), despite thread A being about
to release lock 1.  Oh, and by the way, thread E is capable of releasing
lock 1, so you can't just say well, detect by thread instead of by task.

So the only way I can see to accurately detect deadlock is to simulate
the future execution of all threads in task B to see if any of them
will release lock 1 without first gaining lock 2.  Which, I believe,
is halting-equivalent.

-- 
Intel are signing my paycheques ... these opinions are still mine
Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC, PATCH] locks: remove posix deadlock detection

2007-10-28 Thread Matthew Wilcox
On Sun, Oct 28, 2007 at 10:48:33PM +, Alan Cox wrote:
  Bzzt.  You get a false deadlock with multiple threads like so:
  
  Thread A of task B takes lock 1
  Thread C of task D takes lock 2
  Thread C of task D blocks on lock 1
  Thread E of task B blocks on lock 2
 
 The spec and SYSV certainly ignore threading in this situation and you
 know that perfectly well (or did in 2004)

The discussion petered out (or that mailing list archive lost articles
from the thread) without any kind of resolution, or indeed interest.
What is your suggestion for handling this problem?  As it is now, the
kernel 'detects' deadlock where there is none, which doesn't seem
allowed by SuSv3 either.

-- 
Intel are signing my paycheques ... these opinions are still mine
Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC, PATCH] locks: remove posix deadlock detection

2007-10-28 Thread Matthew Wilcox
On Sun, Oct 28, 2007 at 11:55:52PM +0100, Jiri Kosina wrote:
 On Sun, 28 Oct 2007, Matthew Wilcox wrote:
 
  Bzzt.  You get a false deadlock with multiple threads like so:
  Thread A of task B takes lock 1
  Thread C of task D takes lock 2
  Thread C of task D blocks on lock 1
  Thread E of task B blocks on lock 2
 
   A potential for deadlock occurs if a process controlling a locked 
   region is put to sleep by attempting to lock another process' 
   locked region. If the system detects that sleeping until a locked 
   region is unlocked would cause a deadlock, fcntl() shall fail with 
   an [EDEADLK] error.
 
 This is what POSIX says [1], even after being modified with respect to 
 POSIX Threads Extension, right?
 
 So it doesn't deal with threads at all, just processess are taken into 
 account. Probably for a reason :)

Did you have a concrete suggestion, or are you just quoting the spec?

The problem is that it's nonsense -- processes don't sleep, threads do.
I think the key is would deadlock, not might deadlock.  Our current
behaviour is clearly in violation of SuSv3.

-- 
Intel are signing my paycheques ... these opinions are still mine
Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC, PATCH] locks: remove posix deadlock detection

2007-10-28 Thread Matthew Wilcox
On Sun, Oct 28, 2007 at 09:38:55PM +, Alan Cox wrote:
  It doesn't require the system to detect it, only mandate what to return
  if it does detect it.
 
 We should be detecting at least the obvious case.

What is the obvious case?  A task that has never called clone()?

 If SYSV only spots simple AB - BA deadlocks or taking the same lock twice
 yourself then that ought to be sufficient for us too.

You can't deadlock against yourself -- either with POSIX locks or BSD
locks (POSIX simple upgrades/downgrades the lock; each byte in a file
can only be in either read-locked state or write-locked state for a
given process.  BSD locks release the lock and wake all waiters before
attempting to acquire the lock in its new state).  In my other post, I
showed a simple AB-BA deadlock that can't be easily detected.

-- 
Intel are signing my paycheques ... these opinions are still mine
Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SLUB performance regression vs SLAB

2007-10-05 Thread Matthew Wilcox
On Fri, Oct 05, 2007 at 08:48:53AM +0200, Jens Axboe wrote:
 I'd like to second Davids emails here, this is a serious problem. Having
 a reproducible test case lowers the barrier for getting the problem
 fixed by orders of magnitude. It's the difference between the problem
 getting fixed in a day or two and it potentially lingering for months,
 because email ping-pong takes forever and the test team has moved on to
 other tests, we'll let you know the results of test foo in 3 weeks time
 when we have a new slot on the box just removing any developer
 motivation to work on the issue.

I vaguely remembered something called orasim, so I went looking for it.
I found http://oss.oracle.com/~wcoekaer/orasim/ which is dated from
2004, and I found http://oss.oracle.com/projects/orasimjobfiles/ which
seems to be a stillborn project.  Is there anything else I should know
about orasim?  ;-)

-- 
Intel are signing my paycheques ... these opinions are still mine
Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


SLUB performance regression vs SLAB

2007-10-04 Thread Matthew Wilcox
On Mon, Oct 01, 2007 at 01:50:44PM -0700, Christoph Lameter wrote:
 The problem is with the weird way of Intel testing and communication. 
 Every 3-6 month or so they will tell you the system is X% up or down on 
 arch Y (and they wont give you details because its somehow secret). And 
 then there are conflicting statements by the two or so performance test 
 departments. One of them repeatedly assured me that they do not see any 
 regressions.

Could you cut out the snarky remarks?  It takes a long time to run a
test, and testing every one of the patches you send really isn't high
on anyone's priority list.  The performance team have also been having
problems getting stable results with recent kernels, adding to the delay.
The good news is that we do now have committment to testing upstream
kernels, so you should see results more frequently than you have been.

I'm taking over from Suresh as liason for the performance team, so
if you hear *anything* from *anyone* else at Intel about performance,
I want you to cc me about it.  OK?  And I don't want to hear any more
whining about hearing different things from different people.

So, on a well-known OLTP benchmark which prohibits publishing absolute
numbers and on an x86-64 system (I don't think exactly which model
is important), we're seeing *6.51%* performance loss on slub vs slab.
This is with a 2.6.23-rc3 kernel.  Tuning the boot parameters, as you've
asked for before (slub_min_order=2, slub_max_order=4, slub_min_objects=8)
gets back 0.38% of that.  It's still down 6.13% over slab.

For what it's worth, 2.6.23-rc3 already has a 1.19% regression versus
RHEL 4.5, so the performance guys are really unhappy about going up to
almost 8% regression.

In the detailed profiles, __slab_free is the third most expensive
function, behind only spin locks.  get_partial_node is right behind it
in fourth place, and kmem_cache_alloc is sixth.  __slab_alloc is eight
and kmem_cache_free is tenth.  These positions don't change with the
slub boot parameters.

Now, where do we go next?  I suspect that 2.6.23-rc9 has significant
changes since -rc3, but I'd like to confirm that before kicking off
another (expensive) run.  Please, tell me what useful kernels are to test.

-- 
Intel are signing my paycheques ... these opinions are still mine
Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SLUB performance regression vs SLAB

2007-10-04 Thread Matthew Wilcox
On Thu, Oct 04, 2007 at 10:38:15AM -0700, Christoph Lameter wrote:
 On Thu, 4 Oct 2007, Matthew Wilcox wrote:
 
  So, on a well-known OLTP benchmark which prohibits publishing absolute
  numbers and on an x86-64 system (I don't think exactly which model
  is important), we're seeing *6.51%* performance loss on slub vs slab.
  This is with a 2.6.23-rc3 kernel.  Tuning the boot parameters, as you've
  asked for before (slub_min_order=2, slub_max_order=4, slub_min_objects=8)
  gets back 0.38% of that.  It's still down 6.13% over slab.
 
 Yeah the fastpath vs. slow path is not the issue as Siddha and I concluded 
 earlier. Seems that we are mainly seeing cacheline bouncing due to two 
 cpus accessing meta data in the same page struct. The patches in 
 MM that are scheduled to be merged for .24 address that issue. I 
 have repeatedly asked that these patches be tested. The patches were 
 posted months ago.

I just checked with the guys who did the test.  When I said -rc3, I
mis-spoke; this is 2.6.23-rc3 *plus* the patches which Suresh agreed to
test for you.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SLUB performance regression vs SLAB

2007-10-04 Thread Matthew Wilcox
On Thu, Oct 04, 2007 at 01:48:34PM -0700, David Miller wrote:
 There comes a point where it is the reporter's responsibility to help
 the developer come up with a publishable test case the developer can
 use to work on fixing the problem and help ensure it stays fixed.

That's a lot of effort.  Is it more effort than doing some remote
debugging with Christoph?  I don't know.

 Using an unpublishable benchmark, whose results even cannot be
 published, really stretches the limits of reasonable don't you
 think?

Yet here we stand.  Christoph is aggressively trying to get slab removed
from the tree.  There is a testcase which shows slub performing worse
than slab.  It's not my fault I can't publish it.  And just because I
can't publish it doesn't mean it doesn't exist.

Slab needs to not get removed until slub is as good a performer on this
benchmark.

-- 
Intel are signing my paycheques ... these opinions are still mine
Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SLUB performance regression vs SLAB

2007-10-04 Thread Matthew Wilcox
On Thu, Oct 04, 2007 at 01:55:37PM -0700, David Miller wrote:
 Anything, I do mean anything, can be simulated using small test
 programs.  Pointing at a big fancy machine with lots of storage
 and disk is a passive aggressive way to avoid the real issues,
 in that nobody is putting forth the effort to try and come up
 with an at least publishable test case that Christoph can use to
 help you guys.
 
 If coming up with a reproducable and publishable test case is
 the difference between this getting fixed and it not getting
 fixed, are you going to invest the time to do that?

If that's what it takes, then yes.  But I'm far from convinced that
it's as easy to come up with a TPC benchmark simulator as you think.
There have been efforts in the past (orasim, for example), but
presumably Christoph has already tried these benchmarks.

-- 
Intel are signing my paycheques ... these opinions are still mine
Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] isofs: add +w bit for non-RR discs

2007-10-04 Thread Matthew Wilcox
On Tue, Oct 02, 2007 at 08:00:26PM +0200, Jan Engelhardt wrote:
 Add %S_IWUGO bit for files on ISO-9660 filesystems without RockRidge

Looks to me like you've added S_IWUSR, not S_IWUGO.

 - popt-mode = S_IRUGO | S_IXUGO; /*
 + popt-mode = S_IRUGO | S_IWUSR | S_IXUGO;
 - inode-i_mode = S_IRUGO | S_IXUGO | S_IFDIR;
 + inode-i_mode = S_IRUGO | S_IWUSR | S_IXUGO | S_IFDIR;

-- 
Intel are signing my paycheques ... these opinions are still mine
Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] fs: Correct SuS compliance for open of large file without options

2007-09-27 Thread Matthew Wilcox
On Thu, Sep 27, 2007 at 02:37:42PM -0400, Theodore Tso wrote:
 I'm reminded of Rusty's 2003 OLS Keynote, where he points out that
 what's important is not making an interface easy to use, but _hard_
 _to_ _misuse_.  That fact that sysfs is all laid out in a directory,
 but for which some directories/symlinks are OK to use, and some are
 NOT OK to use --- is why I call the sysfs interface an open pit.
 Sure, if you have the map to the minefield, a minefield is perfectly
 safe when you know what to avoid.  But is that the best way to
 construct a path/interface for an application programmer to get from
 point A to point B?  Maybe, maybe not.

I agree with your criticism of sysfs (indeed, I think I've made many of
them before in somewhat stronger language), but do you have a suggestion
as to an interface that would be harder to misuse?

-- 
Intel are signing my paycheques ... these opinions are still mine
Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] fs: Correct SuS compliance for open of large file without options

2007-09-27 Thread Matthew Wilcox
On Thu, Sep 27, 2007 at 07:19:27PM -0400, Theodore Tso wrote:
 Would you accept a patch which causes the deprecated sysfs
 files/directories to disappear, even if CONFIG_SYS_DEPRECATED is
 defined, via a boot-time parameter?

How about a mount option?  That way people can test without a reboot:

mount -o remount,deprecated={yes,no} /sys

-- 
Intel are signing my paycheques ... these opinions are still mine
Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 1/2] VFS: new fgetattr() file operation

2007-09-24 Thread Matthew Wilcox
On Mon, Sep 24, 2007 at 02:48:08PM +0200, Miklos Szeredi wrote:
  and if that means adding silly rename support so be it.
 
 That's what is done currently.
 
 But it's has various dawbacks, like rmdir doesn't work if there are
 open files within an otherwise empty directory.
 
 I'd happily accept suggestions on how to deal with this differenty.

Only sillyrename files with nlink  1?  I don't see how attributes can
change anything for a deleted file.

-- 
Intel are signing my paycheques ... these opinions are still mine
Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 1/2] VFS: new fgetattr() file operation

2007-09-24 Thread Matthew Wilcox
On Mon, Sep 24, 2007 at 03:06:06PM +0200, Miklos Szeredi wrote:
 A file isn't deleted while there are still links or open files
 refering to it.  So getting the attributes for a file with nlink==0 is
 perfectly valid while the file is still open.

Is it?  Why not just pretend that the attributes are wiped when the file
is deleted.  Effectively, they are, since they can't affect anything.

 If a network filesystem protocol can't handle operations (be it data
 or metadata) on an unlinked file, we must do sillirenaming, so that
 the file is not actually unlinked.

Or you could call getattr right before you unlink and cache the result
in the client.

-- 
Intel are signing my paycheques ... these opinions are still mine
Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/2] avoid clobbering registers with J_ASSERT macro

2007-08-20 Thread Matthew Wilcox
On Mon, Aug 20, 2007 at 04:22:04PM +0100, Stephen C. Tweedie wrote:
 Hi,
 
 On Mon, 2007-08-20 at 09:18 -0400, Chris Snook wrote:
 
   How's about we just remove that printk?  Do
   
   #define J_ASSERT(e) BUG_ON(e)?

ITYM #define J_ASSERT(e) BUG_ON(!e)

 It did.  The original J_ASSERT predates BUG() entirely, and was added so
 that we got the file/line-no information.  But with the current BUG()
 macro, I can't see any reason for J_ASSERT still to try to gather that
 information itself.

Do you still want to keep J_ASSERT, or should all uses of it be replaced
with BUG_ON?

(to put it another way; if you were writing JBD now, would you add your
own J_ASSERT, or would you just use BUG_ON directly?)

-- 
Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Advocacy] Re: 3ware 9650 tips

2007-07-16 Thread Matthew Wilcox
On Mon, Jul 16, 2007 at 08:40:00PM +0300, Al Boldi wrote:
 XFS surely rocks, but it's missing one critical component: data=ordered
 And that's one component that's just too critical to overlook for an 
 enterprise environment that is built on data-integrity over performance.
 
 So that's the secret why people still use ext3, and XFS' reliance on external 
 hardware to ensure integrity is really misplaced.
 
 Now, maybe when we get the data=ordered onto the VFS level, then maybe XFS 
 may become viable for the enterprise, and ext3 may cease to be KING.

Wow, thanks for bringing an advocacy thread onto linux-fsdevel.  Just what
we wanted.  Do you have any insight into how to get the data=ordered
onto the VFS level?  Because to me, that sounds like pure nonsense.

-- 
Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] VFS: data=ordered (was: [Advocacy] Re: 3ware 9650 tips)

2007-07-16 Thread Matthew Wilcox
On Mon, Jul 16, 2007 at 09:28:08PM +0300, Al Boldi wrote:
 Well, conceptually it sounds like a piece of cake, technically your guess is 
 as good as mine.  IIRC, akpm once mentioned something like this.

How much have you looked at the VFS?  There's nothing journalling-related
in the VFS right now.  ext3 and XFS share no common journalling code,
nor do I think that would be possible, due to the very different concepts
they have of journalling.

Here's a good hint:

$ find fs -type f |xargs grep -l journal_start
fs/ext3/acl.c
fs/ext3/inode.c
fs/ext3/ioctl.c
fs/ext3/namei.c
fs/ext3/resize.c
fs/ext3/super.c
fs/ext3/xattr.c
fs/ext4/acl.c
fs/ext4/extents.c
fs/ext4/inode.c
fs/ext4/ioctl.c
fs/ext4/namei.c
fs/ext4/resize.c
fs/ext4/super.c
fs/ext4/xattr.c
fs/jbd/journal.c
fs/jbd/transaction.c
fs/jbd2/journal.c
fs/jbd2/transaction.c
fs/ocfs2/journal.c
fs/ocfs2/super.c

JBD and JBD2 provide a journalling implementation that ext3, ext4 and
ocfs2 use.  Note that XFS doesn't, it has its own journalling code.

If you want XFS to support data=ordered, talk to the XFS folks.  Or
start picking through XFS yourself, of course -- you do have the source
code.

-- 
Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ANNOUNCE] util-linux-ng 2.13-rc1

2007-07-05 Thread Matthew Wilcox
On Thu, Jul 05, 2007 at 11:30:20PM +0200, Karel Zak wrote:
 The package build system is now based on autotools. The build system
 supports  separate CFLAGS and LDFLAGS for suid programs (SUID_CFLAGS,
 SUID_LDFLAGS). For more details see the README file
  
   And this is really dumb.  autotools is a completely pain in the ass and
 
  Well, Adrian Bunk added autotools stuff to util-linux during his work
  on v2.13. This stuff has been fixed and stabilized in util-linux-ng
  v2.13.
 
  I'm not fanatical autotools protagonist, but it seems useful in
  util-linux. We will see...
 
  I'm ready to change or fix arbitrary thing in util-linux-ng, but I
  always need a real reason -- bug report, new feature, or so. This
  discussion is about impressions and feelings only.

No, it's based on long, hard and painful experiences attempting to debug
the fuckups that autoconf creates when it goes wrong.

-- 
Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Read/write counts

2007-06-04 Thread Matthew Wilcox
On Mon, Jun 04, 2007 at 09:56:07AM -0700, Bryan Henderson wrote:
 Programs that assume a full transfer are fairly common, but are 
 universally regarded as either broken or just lazy, and when it does cause 
 a problem, it is far more common to fix the application than the kernel.

Linus has explicitly forbidden short reads from being returned.  The
original poster may get away with it for a specialised case, but for
example, signals may not cause a return to userspace with a short read
for exactly this reason.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [NFS] [PATCH] locks: provide a file lease method enabling cluster-coherent leases

2007-06-01 Thread Matthew Wilcox
On Fri, Jun 01, 2007 at 12:44:16PM -0400, J. Bruce Fields wrote:
 The only problem I'm aware of is that leases aren't broken on rename,
 link, and unlink.  This is kind of tricky to fix.  David Richter (cc'd)
 and I sketched out a few different approaches, and I think he has some
 patches implementing at least one of them.
 
 This may require defining some new type of lease, to avoid changing the
 documented behavior of the fcntl lease operations (which only break on
 open).  Although actually I believe Samba needs the same behavior we do,
 and they're probably the most important user of leases

Samba internally prohibits renaming or deleting an open file, to match
Windows semantics.  So it won't notice the difference.  At least, that's
what I remember from a discussion with Tridge when we were implementing
leases back in 2000.

I think it's an acceptable change in Linux semantics to break leases on
rename/delete/link.  Though I'm not quite sure why you need to -- nobody
else is touching the contents of the file, so it's not like you need to
write the data back to it, or discard your cached copy of it in the case
of a read-only lease.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [nameidata 1/2] Don't pass NULL nameidata to vfs_create

2007-04-16 Thread Matthew Wilcox
On Mon, Apr 16, 2007 at 06:11:30PM +0200, Andreas Gruenbacher wrote:
 +static inline int
 +nfsd_do_create(struct inode *dir, struct dentry *child, struct vfsmount *mnt,
 +int mode)
 +{
 + struct nameidata nd = {
 + .dentry = child,
 + .mnt = mnt,
 + };
 +
 + return vfs_create(dir, child, mode, nd);
 +}
 +

Wouldn't it normally result in fewer instructions (on most architectures
... maybe not x86) to keep the same argument order as vfs_create?  ie:

static inline int nfsd_do_create(struct inode *dir, struct dentry *child,
 int mode, struct vfsmount *mnt)

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: REISER4 FOR INCLUSION IN THE LINUX KERNEL.

2007-04-09 Thread Matthew Wilcox
On Sun, Apr 08, 2007 at 09:16:59PM -0700, [EMAIL PROTECTED] wrote:
 REISER4 FOR INCLUSION IN THE LINUX KERNEL.

Fuck off.  Cheerleading like this only hurts your cause.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: forced umount?

2007-03-18 Thread Matthew Wilcox
On Sun, Mar 18, 2007 at 08:16:19PM +0100, Arjan van de Ven wrote:
 the problem with the people who say they want forced umount is.. that
 most of the time they either want
 1) get rid of the namespace entry
 or
 2) want to stop any and all IO to a certain device/partition 

There is a third component - they want to deliver a fatal signal to
all processes which are waiting on IO to that sb.  My scenario here is a
machine with an NFS mount of a server which has gone down.  Cronjobs which
scan the whole filesystem (eg updatedb) soon pile up sleeping on access.
Equally, if one has one's ogg collection stored on said NFS server, the
ogg player will be in uninterruptible sleep while holding the sound device
open, preventing other applications from making sounds.  It's desirable
to be able to kill these apps dead, and the usual suggestion of 'mount
it soft,intr' isn't the greatest idea (and somewhat hard to change after
the fact).

I remember Linus suggesting a sleeping state between UNINTERRUPTIBLE and
INTERRUPTIBLE which would be FATAL_SIGNALS_ONLY.  The usual problem (of
short reads) isn't a problem if the task is only going to die when it
receives them.  Has anyone investigated this in any detail?  Perhaps
I'll take a look at doing it next week.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] sys_fallocate() system call

2007-03-17 Thread Matthew Wilcox
On Fri, Mar 16, 2007 at 05:17:04PM +0100, Heiko Carstens wrote:
  +asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len)
 
 e.g.
 
 asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len, int mode)
 
 would work even on s390 ;)

How about:

asmlinkage long sys_fallocate(int fd, int mode, u32 off_low, u32 off_high,
u32 len_low, u32 len_high);

That way we all suffer equally ...
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Rename file over another with the same inode number fails silently.

2007-02-08 Thread Matthew Wilcox
On Thu, Feb 08, 2007 at 11:48:16AM -0500, John Muir wrote:
 The attached test program creates a file, and then some hard links to
 that file (file0 - fileN).
 The test program then attempts to rename(fileN, file) for every hard
 link created.
 
 My expectation is that the hard links file0 - fileN would simply
 disappear, or that rename would respond with an error result and an
 appropriate errno value indicating the problem.
 
 My observation is that the hard links file0 to fileN do not in fact
 disappear and rename returns 0.
 
 Do I have the wrong expectations? If so, why should I have to stat the
 files to determine if they are the same inode before I rename?

This is a POSIX requirement.  See

http://www.opengroup.org/onlinepubs/009695399/functions/rename.html

where it says,

If the old argument and the new argument resolve to the same existing
file, rename() shall return successfully and perform no other action.

There's not really much point in discussing whether your expectations
are wrong or whether the extra work you have to do is silly ... POSIX
says so, so we have to behave this way.

Personally, I think this is an unexpected oversight on POSIX's part.
The rationale refers to 'rename(x, x);', and doesn't discuss hard
links.  Possibly something to raise with the POSIX ctte, but I doubt
they'd be interested in changing this now ... perhaps if you can find
some other unices which behave differently, you might have a shot.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Symbolic links vs hard links

2007-01-10 Thread Matthew Wilcox
On Wed, Jan 10, 2007 at 09:38:11AM -0800, Bryan Henderson wrote:
 Other people are of the opinion that the invention of the symbolic link
 was a huge mistake.
 
 I guess I haven't heard that one.  What is the argument that we were 
 better off without symbolic links?

I suppose http://www.cs.bell-labs.com/sys/doc/lexnames.html is as good
a presentation of that argument as any, though there's also some good
stuff in the Unix Haters Handbook (pages 164-165 in particular).

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ext2: conditional removal of NFSD code

2007-01-06 Thread Matthew Wilcox
On Sat, Jan 06, 2007 at 10:58:31PM +0300, Alexey Dobriyan wrote:
 Nor me nor my box is going to act as NFS server, so ifdef all
 exporting code.

 @@ -916,7 +918,9 @@ static int ext2_fill_super(struct super_
* set up enough so that it can read an inode
*/
   sb-s_op = ext2_sops;
 +#if defined(CONFIG_EXPORTFS) || defined(CONFIG_EXPORTFS_MODULE)
   sb-s_export_op = ext2_export_ops;
 +#endif

To avoid putting ifdefs within a function, how about adding:

#if defined(CONFIG_EXPORTFS) || defined(CONFIG_EXPORTFS_MODULE)
#define set_export_ops(sb, ops) sb-s_export_op = ops
#else
#define set_export_ops(sb, ops) 0
#endif

That way you can get rid of the function pointer from the struct
superblock too.

But Dave Woodhouse is going to kill you for adding another
CONFIG_*_MODULE dependency.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Finding hardlinks

2007-01-03 Thread Matthew Wilcox
On Wed, Jan 03, 2007 at 01:33:31PM +0100, Miklos Szeredi wrote:
 High probability is all you have.  Cosmic radiation hitting your
 computer will more likly cause problems, than colliding 64bit inode
 numbers ;)

Some of us have machines designed to cope with cosmic rays, and would be
unimpressed with a decrease in reliability.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: inode i_blksize problem

2006-12-20 Thread Matthew Wilcox
On Wed, Dec 20, 2006 at 01:26:56PM +0100, Sergio Paracuellos wrote:
 I am trying to compile a module for kernel 2.6.18-1 that uses the 'inode
 struct' but the compiler tell me inode struct hasn't a member called
 i_blksize. I don't have that problem in kernel 2.6.16. What happend
 with i_blksize?

git-log reveals:

commit ba52de123d454b57369f291348266d86f4b35070
Author: Theodore Ts'o [EMAIL PROTECTED]
Date:   Wed Sep 27 01:50:49 2006 -0700

[PATCH] inode-diet: Eliminate i_blksize from the inode structure

This eliminates the i_blksize field from struct inode.  Filesystems that wan
t
to provide a per-inode st_blksize can do so by providing their own getattr
routine instead of using the generic_fillattr() function.

Note that some filesystems were providing pretty much random (and incorrect)
values for i_blksize.

[EMAIL PROTECTED]: cleanup]
[EMAIL PROTECTED]: generic_fillattr() fix]
Signed-off-by: Theodore Ts'o [EMAIL PROTECTED]
Signed-off-by: Adrian Bunk [EMAIL PROTECTED]
Signed-off-by: Andrew Morton [EMAIL PROTECTED]
Signed-off-by: Linus Torvalds [EMAIL PROTECTED]

So, it's gone.  I'd just delete that line from your sources if I were you.
By the way, what kind of module is this?  Whatever it's doing looks
pretty dodgy to me.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFSv4/pNFS possible POSIX I/O API standards

2006-12-17 Thread Matthew Wilcox
On Sun, Dec 17, 2006 at 11:07:27AM -0800, Ulrich Drepper wrote:
 And how often do the scripts which are in everyday use require such a 
 command?  And the same for the other programs.

I know that the rsync load is a major factor on kernel.org right now.
With all the git trees (particularly the ones that people haven't packed
recently), there's a lot of files in a lot of directories.  If
readdirplus would help this situation, it would definitely have a real
world benefit.  Obviously, I haven't done any measurements or attempted
to quantify what the improvement would be.

For those not familiar with a git repo, it has an 'objects' directory with
256 directories named 00 to ff.  Each of those directories can contain
many files (with names like '8cd5bbfb4763322837cd1f7c621f02ebe22fef') Once
a file is written, it is never modified, so all rsync needs to do is be
able to compare the timestamps and sizes and notice they haven't changed.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: openg and path_to_handle

2006-12-14 Thread Matthew Wilcox
On Thu, Dec 14, 2006 at 03:00:41PM -0600, Rob Ross wrote:
 I don't think that I understand what you're saying here. The openg() 
 call does not perform file open (not that that is necessarily even a 
 first-class FS operation), it simply does the lookup.
 
 When we were naming these calls, from a POSIX consistency perspective it 
 seemed best to keep the open nomenclature. That seems to be confusing 
 to some. Perhaps we should rename the function lookup or something 
 similar, to help keep from giving the wrong idea?
 
 There is a difference between the openg() and path_to_handle() approach 
 in that we do permission checking at openg(), and that does have 
 implications on how the handle might be stored and such. That's being 
 discussed in a separate thread.

I was just thinking about how one might implement this, when it struck
me ... how much more efficient is a kernel implementation compared to:

int openg(const char *path)
{
char *s;
do {
s = tempnam(FSROOT, .sutoc);
link(path, s);
} while (errno == EEXIST);

mpi_broadcast(s);
sleep(10);
unlink(s);
}

and sutoc() becomes simply open().  Now you have a name that's quick to
open (if a client has the filesystem mounted, it has a handle for the
root already), has a defined lifespan, has minimal permission checking,
and doesn't require standardisation.

I suppose some cluster fs' might not support cross-directory links
(AFS is one, I think), but then, no cluster fs's support openg/sutoc.
If a filesystem's willing to add support for these handles, it shouldn't
be too hard for them to treat files starting .sutoc specially, and as
efficiently as adding the openg/sutoc concept.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: openg and path_to_handle

2006-12-06 Thread Matthew Wilcox
On Thu, Dec 07, 2006 at 07:40:05AM +1100, David Chinner wrote:
 Permission checks are done on the path_to_handle(), so in reality
 only root or CAP_SYS_ADMIN users can currently use the
 open_by_handle interface because of this lack of checking. Given
 that our current users of this interface need root permissions to do
 other things (data migration), this has never been an issue.
 
 This is an implementation detail - it is possible that file handle,
 being opaque, could encode a UID/GID of the user that constructed
 the handle and then allow any process with the same UID/GID to use
 open_by_handle() on that handle. (I think hch has already pointed
 this out.)

While it could do that, I'd be interested to see how you'd construct
the handle such that it's immune to a malicious user tampering with it,
or saving it across a reboot, or constructing one from scratch.

I suspect any real answer to this would have to involve cryptographical
techniques (say, creating a secure hash of the information plus a
boot-time generated nonce).  Now you're starting to use a lot of bits,
and compute time, and you'll need to be sure to keep the nonce secret.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: openg and path_to_handle

2006-12-06 Thread Matthew Wilcox
On Wed, Dec 06, 2006 at 03:09:10PM -0700, Andreas Dilger wrote:
 Considering that filesystems like GFS and OCFS allow clients DIRECT
 ACCESS to the block device itself (which no amount of authentication
 will fix, unless it is in the disks themselves), the risk of passing a
 file handle around is pretty minimal.

That's either disingenuous, or missing the point.  OCFS/GFS allow the
kernel direct access to the block device.  openg()sutoc() are about
passing around file handles to untrusted users.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFSv4/pNFS possible POSIX I/O API standards

2006-12-05 Thread Matthew Wilcox
On Tue, Dec 05, 2006 at 10:07:48AM +, Christoph Hellwig wrote:
 The filehandle idiocy on the other hand is way of into crackpipe land.

Right, and it needs to be discarded.  Of course, there was a real
problem that it addressed, so we need to come up with an acceptable
alternative.

The scenario is a cluster-wide application doing simultaneous opens of
the same file.  So thousands of nodes all hitting the same DLM locks
(for read) all at once.  The openg() non-solution implies that all
nodes in the cluster share the same filehandle space, so I think a
reasonable solution can be implemented entirely within the clusterfs,
with an extra flag to open(), say O_CLUSTER_WIDE.  When the clusterfs
sees this flag set (in -lookup), it can treat it as a hint that this
pathname component is likely to be opened again on other nodes and
broadcast that fact to the other nodes within the cluster.  Other nodes
on seeing that hint (which could be structured as The child bin
of filehandle e62438630ca37539c8cc1553710bbfaa3cf960a7 has filehandle
ff51a98799931256b555446b2f5675db08de6229) can keep a record of that fact.
When they see their own open, they can populate the path to that file
without asking the server for extra metadata.

There's obviously security issues there (why I say 'hint' rather than
'command'), but there's also security problems with open-by-filehandle.
Note that this solution requires no syscall changes, no application
changes, and also helps a scenario where each node opens a different
file in the same directory.

I've never worked on a clusterfs, so there may be some gotchas (eg, how
do you invalidate the caches of nodes when you do a rename).  But this
has to be preferable to open-by-fh.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFSv4/pNFS possible POSIX I/O API standards

2006-11-29 Thread Matthew Wilcox
On Wed, Nov 29, 2006 at 09:04:50AM +, Christoph Hellwig wrote:
  - openg/sutoc
 
   No way.  We already have a very nice file descriptor abstraction.
   You can pass file descriptors over unix sockets just fine.

Yes, but it behaves like dup().  Gary replied to me off-list (which I
didn't notice and continued replying to him off-list).  I wrote:

Is this for people who don't know about dup(), or do they need
independent file offsets?  If the latter, I think an xdup() would be
preferable (would there be a security issue for OSes with revoke()?)
Either that, or make the key be useful for something else.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFSv4/pNFS possible POSIX I/O API standards

2006-11-29 Thread Matthew Wilcox
On Wed, Nov 29, 2006 at 05:23:13AM -0700, Matthew Wilcox wrote:
 On Wed, Nov 29, 2006 at 09:04:50AM +, Christoph Hellwig wrote:
   - openg/sutoc
  
  No way.  We already have a very nice file descriptor abstraction.
  You can pass file descriptors over unix sockets just fine.
 
 Yes, but it behaves like dup().  Gary replied to me off-list (which I
 didn't notice and continued replying to him off-list).  I wrote:
 
 Is this for people who don't know about dup(), or do they need
 independent file offsets?  If the latter, I think an xdup() would be
 preferable (would there be a security issue for OSes with revoke()?)
 Either that, or make the key be useful for something else.

I further wonder if these people would see appreciable gains from doing
sutoc rather than doing openat(dirfd, basename, flags, mode);

If they did, I could also see openat being extended to allow dirfd to
be a file fd, as long as pathname were NULL or a pointer to NUL.

But with all the readx stuff being proposed, I bet they don't really
need independent file offsets.  That's, like, so *1970*s.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFSv4/pNFS possible POSIX I/O API standards

2006-11-28 Thread Matthew Wilcox
On Mon, Nov 27, 2006 at 09:34:05PM -0700, Gary Grider wrote:
 Things like
 openg()  - on process opens a file and gets a key that is passed to 
 lots of processes which
 use the key to get a handle (great for thousands of processes opening a 
 file)

I don't understand how this leads to a more efficient implementation.
It seem to just add complexity.  What does 'sutoc' mean anyway?

 readx/writex - scattergather readwrite - more appropriate and 
 complete than the real time extended read/write

These don't seem to be documented on the website.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] ensure i_ino uniqueness in filesystems without permanent inode numbers (via pointer conversion)

2006-11-17 Thread Matthew Wilcox
On Fri, Nov 17, 2006 at 08:43:00AM -0500, Jeff Layton wrote:
 2) this scheme would effectively leak inode addresses into userspace.
 I'm not sure if that would be exploitable, but it's probably best not to
 do it. The patch adds a static unsigned int that is initialized to a
 random value at boot time. We'll xor the inode offset with this value.
 That should allow for a unique i_ino value, but since the xor mask would
 be secret, it shouldn't be possible to turn it back into an address.
 There may be a more secure way to do this. I'm definitely open to
 suggestions here.

I *think* the xor mask is mere obfuscation.  It looks likely that you can
recover it with a little bit of trial and error.  If you can force the
filesystem to hand you back new inodes quickly such that there is a high
probability you get consecutive allocations, you'll get a sequence which
would be spaced 700-odd bytes apart, except that it's been xored.  Since
you know it's incrementing, if you see the sequence decrease, you'll
know that was a 1 in that bit.

It'd be a bit more complex than that, and cryptanalysis was never my
forte, but I suspect we should either use a folded hash like md5, or
give up and just divide the address by sizeof(struct inode).  Sure,
divides are slow, but this is a divide by a constant, so it shouldn't be
that bad.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Make journal_commit_transaction() more understandable

2005-08-29 Thread Matthew Wilcox

journal_commit_transaction() is still 650+ lines long and contains 16
local variables.  By moving phase 3 into its own function, we reduce
its length by 150+ lines and reduce it to 5 local variables.

Signed-off-by: Matthew Wilcox [EMAIL PROTECTED]

 commit.c |  367 +--
 1 files changed, 194 insertions(+), 173 deletions(-)

Index: linux-2.6/fs/jbd/commit.c
===
RCS file: /var/lib/cvs/linux-2.6/fs/jbd/commit.c,v
retrieving revision 1.14
diff -u -p -r1.14 commit.c
--- linux-2.6/fs/jbd/commit.c   4 Apr 2005 20:47:15 -   1.14
+++ linux-2.6/fs/jbd/commit.c   29 Aug 2005 15:59:02 -
@@ -93,6 +93,189 @@ static int inverted_lock(journal_t *jour
return 1;
 }
 
+/*
+ * Way to go: we have now written out all of the data for a transaction!
+ * Now comes the tricky part: we need to write out metadata.  Loop over the
+ * transaction's entire buffer list:
+ */
+static int journal_write_metadata(journal_t *journal,
+ transaction_t *commit_transaction)
+{
+   char *tagp = NULL;
+   struct journal_head *descriptor = NULL;
+   int flags;
+   unsigned long blocknr;
+   journal_header_t *header;
+   journal_block_tag_t *tag = NULL;
+   int space_left = 0;
+   int first_tag = 0;
+   int tag_flag;
+   int i;
+   int err = 0;
+   int bufs = 0;
+
+   struct buffer_head **wbuf = journal-j_wbuf;
+
+   commit_transaction-t_state = T_COMMIT;
+
+   while (commit_transaction-t_buffers) {
+
+   /* Find the next buffer to be journaled... */
+
+   struct journal_head *jh, *new_jh;
+   jh = commit_transaction-t_buffers;
+
+   /* If we're in abort mode, we just un-journal the buffer and
+  release it for background writing. */
+
+   if (is_journal_aborted(journal)) {
+   JBUFFER_TRACE(jh, journal is aborting: refile);
+   journal_refile_buffer(journal, jh);
+   /* If that was the last one, we need to clean up
+* any descriptor buffers which may have been
+* already allocated, even if we are now
+* aborting. */
+   if (!commit_transaction-t_buffers)
+   goto start_journal_io;
+   continue;
+   }
+
+   /* Make sure we have a descriptor block in which to
+  record the metadata buffer. */
+
+   if (!descriptor) {
+   struct buffer_head *bh;
+
+   J_ASSERT (bufs == 0);
+
+   jbd_debug(4, JBD: get descriptor\n);
+
+   descriptor = journal_get_descriptor_buffer(journal);
+   if (!descriptor) {
+   __journal_abort_hard(journal);
+   continue;
+   }
+
+   bh = jh2bh(descriptor);
+   jbd_debug(4, JBD: got buffer %llu (%p)\n,
+   (unsigned long long)bh-b_blocknr, bh-b_data);
+   header = (journal_header_t *)bh-b_data[0];
+   header-h_magic = cpu_to_be32(JFS_MAGIC_NUMBER);
+   header-h_blocktype = cpu_to_be32(JFS_DESCRIPTOR_BLOCK);
+   header-h_sequence  = 
cpu_to_be32(commit_transaction-t_tid);
+
+   tagp = bh-b_data[sizeof(journal_header_t)];
+   space_left = bh-b_size - sizeof(journal_header_t);
+   first_tag = 1;
+   set_buffer_jwrite(bh);
+   set_buffer_dirty(bh);
+   wbuf[bufs++] = bh;
+
+   /* Record it so that we can wait for IO
+   completion later */
+   BUFFER_TRACE(bh, ph3: file as descriptor);
+   journal_file_buffer(descriptor, commit_transaction,
+   BJ_LogCtl);
+   }
+
+   /* Where is the buffer to be written? */
+
+   err = journal_next_log_block(journal, blocknr);
+   /* If the block mapping failed, just abandon the buffer
+  and repeat this loop: we'll fall into the
+  refile-on-abort condition above. */
+   if (err) {
+   __journal_abort_hard(journal);
+   continue;
+   }
+
+   /*
+* start_this_handle() uses t_outstanding_credits to determine
+* the free space in the log, but this counter is changed
+* by journal_next_log_block() also.
+*/
+   commit_transaction-t_outstanding_credits

Re: Kernel bug: Bad page state: related to generic symlink code and mmap

2005-08-19 Thread Matthew Wilcox
On Fri, Aug 19, 2005 at 08:38:34PM +0100, Al Viro wrote:
 FWIW, I'd rather take page_symlink(), page_symlink_inode_operations,
 page_put_link(), page_follow_link_light(), page_readlink(), page_getlink(),
 generic_readlink() and vfs_readlink() to the same place where these guys
 would live.  They all belong together and none of them has any business
 in fs/namei.c.  Options: fs/libfs.c or separate library since fs/libfs.c
 is getting crowded.  Linus, do you have any objections to that or suggestions
 on filename here?

fs/symlink.c?

-- 
Next the statesmen will invent cheap lies, putting the blame upon 
the nation that is attacked, and every man will be glad of those
conscience-soothing falsities, and will diligently study them, and refuse
to examine any refutations of them; and thus he will by and by convince 
himself that the war is just, and will thank God for the better sleep 
he enjoys after this process of grotesque self-deception. -- Mark Twain
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] fs/fcntl.c : don't test unsigned value for less than zero

2005-04-15 Thread Matthew Wilcox
On Fri, Apr 15, 2005 at 10:03:05PM +1000, Herbert Xu wrote:
 I suppose it could be smart and stay quiet about
 
 val  0 || val  BOUND
 
 However, gcc is slow enough as it is without adding unnecessary
 smarts like this.

It only warns with -W on, not with -Wall, so I see no compelling
reason to fix this.  I think the real problem here is that 'arg'
is declared 2 pages earlier in the function prototype (aka the
function-growth-hormone-imbalance syndrome).

There's two good ways of fixing this, adding a f_setsig() function:

static inline int f_setsig(struct file *filp, unsigned long arg)
{
if (arg  _NSIG)
return -EINVAL;

filp-f_owner.signum = arg;
return 0;
}
...
case F_SETSIG:
err = f_setsig(filp, arg);
break;

or add a function that checks a variable to see if it's a valid signal number:

#define valid_signal(arg)   ((unsigned long)arg = _NSIG)
...
case F_SETSIG:
if (!valid_signal(arg))
break;
err = 0;
filp-f_owner.signum = arg;
break;

Looks like futex.c, ptrace.c, signal.c, sys.c and almost every
architecture's ptrace code could easily make use of the latter, but not
the former.  It also looks like we have a few off-by-one errors.  For
example, in h8300's ptrace code:

case PTRACE_SYSCALL:
case PTRACE_CONT: {
ret = -EIO;
if ((unsigned long) data = _NSIG)
break ;

but

case PTRACE_SINGLESTEP: {
ret = -EIO;
if ((unsigned long) data  _NSIG)
break;

so I'd recommend the second solution.  But be careful not to fix up
cases like:

./kernel/exit.c:if (sig  1 || sig  _NSIG)

where we really don't want to allow zero.

-- 
Next the statesmen will invent cheap lies, putting the blame upon 
the nation that is attacked, and every man will be glad of those
conscience-soothing falsities, and will diligently study them, and refuse
to examine any refutations of them; and thus he will by and by convince 
himself that the war is just, and will thank God for the better sleep 
he enjoys after this process of grotesque self-deception. -- Mark Twain
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] fs/fcntl.c : don't test unsigned value for less than zero

2005-04-14 Thread Matthew Wilcox
On Fri, Apr 15, 2005 at 03:07:42AM +0200, Jesper Juhl wrote:
 'arg' is unsigned so it can never be less than zero, so testing for that 
 is pointless and also generates a warning when building with gcc -W. This 
 patch eliminates the pointless check.

Didn't Linus already reject this one 6 months ago?

-- 
Next the statesmen will invent cheap lies, putting the blame upon 
the nation that is attacked, and every man will be glad of those
conscience-soothing falsities, and will diligently study them, and refuse
to examine any refutations of them; and thus he will by and by convince 
himself that the war is just, and will thank God for the better sleep 
he enjoys after this process of grotesque self-deception. -- Mark Twain
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mmap question

2005-03-21 Thread Matthew Wilcox
On Mon, Mar 21, 2005 at 01:33:55PM -0800, Bryan Henderson wrote:
 It looks to me like you're running into the fundamental limitation that 
 the CPU doesn't notify Linux every time you store into a memory location. 
 It does, though, set the dirty flag in the page table, and Linux 
 eventually inspects that flag and finds out that you have stored in the 
 past.  At that time, it can call set_page_dirty.

well, we *could* know ... never map this page writable.  have a per-vma
flag that says emulate writes, and call the filesystem to update
backing storage before returning to the application.

the price of doing this would be exorbitant.

-- 
Next the statesmen will invent cheap lies, putting the blame upon 
the nation that is attacked, and every man will be glad of those
conscience-soothing falsities, and will diligently study them, and refuse
to examine any refutations of them; and thus he will by and by convince 
himself that the war is just, and will thank God for the better sleep 
he enjoys after this process of grotesque self-deception. -- Mark Twain
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Efficient handling of sparse files

2005-02-28 Thread Matthew Wilcox

This problem came up with the systemimager program which uses rsync to
install files from a master server to many clients.  Red Hat has a system
user with uid 2^32-1 which causes lastlog to grow to 1.2GB in size.
rsync does understand the concept of sparse files (with the -S flag), but
it has to read every block to discover that it is indeed empty.  This sucks.

I was wondering if we could introduce a new system call (or ioctl?) that,
given an fd would find the next block with data in it.  We could use the
-bmap method ... except that has dire warnings about adding new callers
and viro may soon be in testicle-gouging range.

One system interface hack would be to introduce lseek(fd, 0, SEEK_DATA)
... but without permission to reuse -bmap for this purpose, it's
pointless to discuss user interfaces.

Suggestions?

-- 
Next the statesmen will invent cheap lies, putting the blame upon 
the nation that is attacked, and every man will be glad of those
conscience-soothing falsities, and will diligently study them, and refuse
to examine any refutations of them; and thus he will by and by convince 
himself that the war is just, and will thank God for the better sleep 
he enjoys after this process of grotesque self-deception. -- Mark Twain
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Help!

2005-02-04 Thread Matthew Wilcox
On Fri, Feb 04, 2005 at 03:58:04PM +0530, Anirban Mukherjee wrote:
 I am doing a project on Ext2fs and Ext3fs and I require some materials on
 the physical structure of Ext2fs.It would be very helpful if someone sends
 me some information(or links where to find it) along with some general
 stuff on Ext2fs and Ext3fs.

Please do my homework for me

Did you even look in Documentation/filesystems/ext{2,3}.txt?

-- 
Next the statesmen will invent cheap lies, putting the blame upon 
the nation that is attacked, and every man will be glad of those
conscience-soothing falsities, and will diligently study them, and refuse
to examine any refutations of them; and thus he will by and by convince 
himself that the war is just, and will thank God for the better sleep 
he enjoys after this process of grotesque self-deception. -- Mark Twain
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


NFSD needs EXPORTFS

2005-02-03 Thread Matthew Wilcox

Got this report about 2.6.11-rc3.  Is this the correct solution?

- Forwarded message from Joel Soete [EMAIL PROTECTED] -

A short analyse, it seems that's because NFSD was builtin while EXPORTFS
was a module in my previous config file. Imho EXPORTFS would be build as
NFSD?

Is the following hunk would do the trick:
--- fs/Kconfig.Orig 2005-02-03 16:45:13.562275206 +0100
+++ fs/Kconfig  2005-02-03 16:46:36.496469111 +0100
@@ -1400,6 +1400,7 @@
tristate NFS server support
depends on INET
select LOCKD
+   select EXPORTFS
select SUNRPC
help
  If you want your Linux box to act as an NFS *server*, so that other


Thanks in advance,
Joel

- End forwarded message -

-- 
Next the statesmen will invent cheap lies, putting the blame upon 
the nation that is attacked, and every man will be glad of those
conscience-soothing falsities, and will diligently study them, and refuse
to examine any refutations of them; and thus he will by and by convince 
himself that the war is just, and will thank God for the better sleep 
he enjoys after this process of grotesque self-deception. -- Mark Twain
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: [PATCH-2.6] Add helper function to lock multiple page cache pages.

2005-02-02 Thread Matthew Wilcox
On Wed, Feb 02, 2005 at 03:12:50PM +, Anton Altaparmakov wrote:

I think the below loop would be clearer as a for loop ...

err = 0;
for (nr = 0; nr  nr_pages; nr++, start++) {
if (start == lp_idx) {
pages[nr] = locked_page;
if (!nr)
continue;
lock_page(locked_page);
if (!wbc)
continue;
if (wbc-for_reclaim) {
up(inode-i_sem);
up_read(inode-i_sb-s_umount);
}
/* Was the page truncated under us? */
if (page_mapping(locked_page) != mapping) {
err = -ESTALE;
goto err_out_locked;
}
} else {
pages[nr] = find_lock_page(mapping, start);
if (pages[nr])
continue;
if (!cached_page) {
cached_page = alloc_page(gfp_mask);
if (unlikely(!cached_page))
goto err_out;
}
err = add_to_page_cache_lru(cached_page,
mapping, start, gfp_mask);
if (unlikely(err)) {
if (err == -EEXIST)
continue;
goto err_out;
}
pages[nr] = cached_page;
cached_page = NULL;
}
}

The above fixes two bugs in the below:
 - if (!unlikely(cached_page)) should be if (unlikely(!cached_page))
 - The -EEXIST case after add_to_page_cache_lru() would result in
   an infinite loop in the original as nr wasn't being incremented.

 + err = nr = 0;
 + while (nr  nr_pages) {
 + if (start == lp_idx) {
 + pages[nr] = locked_page;
 + if (nr) {
 + lock_page(locked_page);
 + if (wbc) {
 + if (wbc-for_reclaim) {
 + up(inode-i_sem);
 + up_read(inode-i_sb-s_umount);
 + }
 + /* Was the page truncated under us? */
 + if (page_mapping(locked_page) !=
 + mapping) {
 + err = -ESTALE;
 + goto err_out_locked;
 + }
 + }
 + }
 + } else {
 + pages[nr] = find_lock_page(mapping, start);
 + if (!pages[nr]) {
 + if (!cached_page) {
 + cached_page = alloc_page(gfp_mask);
 + if (!unlikely(cached_page))
 + goto err_out;
 + }
 + err = add_to_page_cache_lru(cached_page,
 + mapping, start, gfp_mask);
 + if (unlikely(err)) {
 + if (err == -EEXIST)
 + continue;
 + goto err_out;
 + }
 + pages[nr] = cached_page;
 + cached_page = NULL;
 + }
 + }
 + nr++;
 + start++;
 + }

-- 
Next the statesmen will invent cheap lies, putting the blame upon 
the nation that is attacked, and every man will be glad of those
conscience-soothing falsities, and will diligently study them, and refuse
to examine any refutations of them; and thus he will by and by convince 
himself that the war is just, and will thank God for the better sleep 
he enjoys after this process of grotesque self-deception. -- Mark Twain
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Minor ext2 speedup

2005-01-24 Thread Matthew Wilcox

Port Andreas Dilger's and Jan Kara's patch for ext3 to ext2.
Also some whitespace changes to get ext2/ext3 closer in sync.

Signed-off-by: Matthew Wilcox [EMAIL PROTECTED]

Index: fs/ext2/balloc.c
===
RCS file: /var/cvs/linux-2.6/fs/ext2/balloc.c,v
retrieving revision 1.4
diff -u -p -r1.4 balloc.c
--- fs/ext2/balloc.c13 Sep 2004 15:23:44 -  1.4
+++ fs/ext2/balloc.c24 Jan 2005 22:10:01 -
@@ -6,7 +6,7 @@
  * Laboratoire MASI - Institut Blaise Pascal
  * Universite Pierre et Marie Curie (Paris VI)
  *
- *  Enhanced block allocation by Stephen Tweedie ([EMAIL PROTECTED]), 1993
+ *  Enhanced block allocation by Stephen Tweedie ([EMAIL PROTECTED]), 1993
  *  Big-endian to little-endian byte-swapping/bitmaps by
  *David S. Miller ([EMAIL PROTECTED]), 1995
  */
@@ -52,9 +52,9 @@ struct ext2_group_desc * ext2_get_group_
 
return NULL;
}
-   
-   group_desc = block_group / EXT2_DESC_PER_BLOCK(sb);
-   offset = block_group % EXT2_DESC_PER_BLOCK(sb);
+
+   group_desc = block_group  EXT2_DESC_PER_BLOCK_BITS(sb);
+   offset = block_group  (EXT2_DESC_PER_BLOCK(sb) - 1);
if (!sbi-s_group_desc[group_desc]) {
ext2_error (sb, ext2_get_group_desc,
Group descriptor not loaded - 
@@ -62,7 +62,7 @@ struct ext2_group_desc * ext2_get_group_
 block_group, group_desc, offset);
return NULL;
}
-   
+
desc = (struct ext2_group_desc *) sbi-s_group_desc[group_desc]-b_data;
if (bh)
*bh = sbi-s_group_desc[group_desc];
@@ -236,12 +236,12 @@ do_more:
 
for (i = 0, group_freed = 0; i  count; i++) {
if (!ext2_clear_bit_atomic(sb_bgl_lock(sbi, block_group),
-   bit + i, (void *) bitmap_bh-b_data))
-   ext2_error (sb, ext2_free_blocks,
- bit already cleared for block %lu,
- block + i);
-   else
+   bit + i, bitmap_bh-b_data)) {
+   ext2_error(sb, __FUNCTION__,
+   bit already cleared for block %lu, block + i);
+   } else {
group_freed++;
+   }
}
 
mark_buffer_dirty(bitmap_bh);
@@ -569,25 +569,24 @@ unsigned long ext2_count_free_blocks (st
 static inline int
 block_in_use(unsigned long block, struct super_block *sb, unsigned char *map)
 {
-   return ext2_test_bit ((block - 
le32_to_cpu(EXT2_SB(sb)-s_es-s_first_data_block)) %
+   return ext2_test_bit ((block -
+   le32_to_cpu(EXT2_SB(sb)-s_es-s_first_data_block)) %
 EXT2_BLOCKS_PER_GROUP(sb), map);
 }
 
 static inline int test_root(int a, int b)
 {
-   if (a == 0)
-   return 1;
-   while (1) {
-   if (a == 1)
-   return 1;
-   if (a % b)
-   return 0;
-   a = a / b;
-   }
+   int num = b;
+
+   while (a  num)
+   num *= b;
+   return num == a;
 }
 
 static int ext2_group_sparse(int group)
 {
+   if (group = 1)
+   return 1;
return (test_root(group, 3) || test_root(group, 5) ||
test_root(group, 7));
 }

-- 
Next the statesmen will invent cheap lies, putting the blame upon 
the nation that is attacked, and every man will be glad of those
conscience-soothing falsities, and will diligently study them, and refuse
to examine any refutations of them; and thus he will by and by convince 
himself that the war is just, and will thank God for the better sleep 
he enjoys after this process of grotesque self-deception. -- Mark Twain
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Minor ext3 speedup

2005-01-15 Thread Matthew Wilcox
On Thu, Jan 13, 2005 at 01:15:06PM +0100, Jan Kara wrote:
   Attached patch removes unnecessary division and modulo from ext3 code
 paths. It reduces (according to oprofile) the CPU usage measurably under
 a dbench load (see description of the patch for the numbers).

I thought I'd apply Jan  Andreas's patch to ext2 too.  While doing so,
I noticed several places where patches had been applied to ext2 and not to
ext3.  One of them was even a bugfix.  This patch brings ext2/balloc.c
and ext3/balloc.c closer together and fixes the bug.  It includes the
aforementioned patch for both ext2 and ext3.

For the curious, the bug occurs when ext3_free_blocks_sb() decides to
do_more.  *pdquot_freed_blocks was updated each time around the loop,
so bg_free_blocks_count was getting over-incremented.  I fixed it the
same way it had been fixed in ext2 -- by introducing a group_freed
variable that is reset each time around the loop.

Index: linux-2.6/fs/ext2/balloc.c
===
RCS file: /var/cvs/linux-2.6/fs/ext2/balloc.c,v
retrieving revision 1.4
diff -u -p -r1.4 balloc.c
--- linux-2.6/fs/ext2/balloc.c  13 Sep 2004 15:23:44 -  1.4
+++ linux-2.6/fs/ext2/balloc.c  15 Jan 2005 16:45:10 -
@@ -6,7 +6,7 @@
  * Laboratoire MASI - Institut Blaise Pascal
  * Universite Pierre et Marie Curie (Paris VI)
  *
- *  Enhanced block allocation by Stephen Tweedie ([EMAIL PROTECTED]), 1993
+ *  Enhanced block allocation by Stephen Tweedie ([EMAIL PROTECTED]), 1993
  *  Big-endian to little-endian byte-swapping/bitmaps by
  *David S. Miller ([EMAIL PROTECTED]), 1995
  */
@@ -52,9 +52,9 @@ struct ext2_group_desc * ext2_get_group_
 
return NULL;
}
-   
-   group_desc = block_group / EXT2_DESC_PER_BLOCK(sb);
-   offset = block_group % EXT2_DESC_PER_BLOCK(sb);
+
+   group_desc = block_group  EXT2_DESC_PER_BLOCK_BITS(sb);
+   offset = block_group  (EXT2_DESC_PER_BLOCK(sb) - 1);
if (!sbi-s_group_desc[group_desc]) {
ext2_error (sb, ext2_get_group_desc,
Group descriptor not loaded - 
@@ -62,7 +62,7 @@ struct ext2_group_desc * ext2_get_group_
 block_group, group_desc, offset);
return NULL;
}
-   
+
desc = (struct ext2_group_desc *) sbi-s_group_desc[group_desc]-b_data;
if (bh)
*bh = sbi-s_group_desc[group_desc];
@@ -236,12 +236,12 @@ do_more:
 
for (i = 0, group_freed = 0; i  count; i++) {
if (!ext2_clear_bit_atomic(sb_bgl_lock(sbi, block_group),
-   bit + i, (void *) bitmap_bh-b_data))
-   ext2_error (sb, ext2_free_blocks,
- bit already cleared for block %lu,
- block + i);
-   else
+   bit + i, bitmap_bh-b_data)) {
+   ext2_error(sb, __FUNCTION__,
+   bit already cleared for block %lu, block + i);
+   } else {
group_freed++;
+   }
}
 
mark_buffer_dirty(bitmap_bh);
@@ -569,25 +569,24 @@ unsigned long ext2_count_free_blocks (st
 static inline int
 block_in_use(unsigned long block, struct super_block *sb, unsigned char *map)
 {
-   return ext2_test_bit ((block - 
le32_to_cpu(EXT2_SB(sb)-s_es-s_first_data_block)) %
+   return ext2_test_bit ((block -
+   le32_to_cpu(EXT2_SB(sb)-s_es-s_first_data_block)) %
 EXT2_BLOCKS_PER_GROUP(sb), map);
 }
 
 static inline int test_root(int a, int b)
 {
-   if (a == 0)
-   return 1;
-   while (1) {
-   if (a == 1)
-   return 1;
-   if (a % b)
-   return 0;
-   a = a / b;
-   }
+   int num = b;
+
+   while (a  num)
+   num *= b;
+   return num == a;
 }
 
 static int ext2_group_sparse(int group)
 {
+   if (group = 1)
+   return 1;
return (test_root(group, 3) || test_root(group, 5) ||
test_root(group, 7));
 }
Index: linux-2.6/fs/ext3/balloc.c
===
RCS file: /var/cvs/linux-2.6/fs/ext3/balloc.c,v
retrieving revision 1.9
diff -u -p -r1.9 balloc.c
--- linux-2.6/fs/ext3/balloc.c  12 Jan 2005 20:17:23 -  1.9
+++ linux-2.6/fs/ext3/balloc.c  15 Jan 2005 16:45:10 -
@@ -43,34 +43,34 @@ struct ext3_group_desc * ext3_get_group_
 struct buffer_head ** bh)
 {
unsigned long group_desc;
-   unsigned long desc;
-   struct ext3_group_desc * gdp;
+   unsigned long offset;
+   struct ext3_group_desc * desc;
+   struct ext3_sb_info *sbi = EXT3_SB(sb);
 
-   if (block_group = EXT3_SB(sb)-s_groups_count) {
+   if 

Re: [Final call for testers][PATCH] superblock handling changes (2.4.6-pre3)

2001-06-15 Thread Matthew Wilcox

On Fri, Jun 15, 2001 at 01:10:00AM -0400, Alexander Viro wrote:
 +static inline void write_super(struct super_block *sb)
 +{
 + lock_super(sb);
 + if (sb-s_root  sb-s_dirt)

When I first looked at this, I thought it was a typo.  I don't think we
should have s_dirty and s_dirt as fields of the superblock.  how about
s_dirty_inodes and s_isdirty, respectively?

 +restart:
 + spin_lock(sb_lock);
 + sb = sb_entry(super_blocks.next);
 + while (sb != sb_entry(super_blocks))
 + if (sb-s_dirt) {
 + sb-s_count++;
 + spin_unlock(sb_lock);
 + down_read(sb-s_umount);
 + write_super(sb);
 + drop_super(sb);
 + goto restart;
 + } else
 + sb = sb_entry(sb-s_list.next);
 + spin_unlock(sb_lock);

I think this could be clearer.

struct list_head *tmp;
restart:
spin_lock(sb_lock);
list_for_each(tmp, super_blocks) {
struct super_block *sb = sb_entry(tmp);
if (!sb-s_dirt)
continue;
spin_unlock(sb_lock);
down_read(sb-s_umount);
write_super(sb);
drop_super(sb);
goto restart;
}
spin_unlock(sb_lock);

 @@ -773,16 +810,16 @@
  void *data, int silent)
  {
   struct super_block * s;
 - s = get_empty_super();
 + s = alloc_super();
   if (!s)
   goto out;
   s-s_dev = dev;
   s-s_bdev = bdev;
   s-s_flags = flags;
 - s-s_dirt = 0;
   s-s_type = type;
 - s-s_dquot.flags = 0;
 - s-s_maxbytes = MAX_NON_LFS;
 + spin_lock(sb_lock);
 + list_add (s-s_list, super_blocks.prev);

I'd use list_add_tail(s-s_list, super_blocks);

 - if (mnt-mnt_instances.next != mnt-mnt_instances.prev) {
 + if (atomic_read(sb-s_active)  1) {

I'm happy to see that line disappear.  It was mightily confusing.

-- 
Revolutions do not require corporate support.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]



Re: about BKL in VFS

2001-06-08 Thread Matthew Wilcox

On Fri, Jun 08, 2001 at 02:04:16PM -0400, Alexander Viro wrote:
 the only
 areas in VFS that still rely on BKL are locks.c, dquot.c and super.c. The
 latter is fixed in the patches I'm feeding to Linus (OK, the pieces I'm
 feeding to Linus make shifting the BKL down to method calls trivial). Other
 two...  Well, ask the poor guys who maintain them ;-)

the file locking code always takes the BKL itself, and doesn't rely on
the rest of the VFS to take the lock for it.  it's not too much trouble
to remove it, and this will happen as part of the 2.5 rewrite.  which is
in progress.

annoyingly, there's very little common semantics between POSIX locks and
FLOCK locks, which means that fs/locks.c is basically implementing two
completely independent things through one interface.  and then there's
the other 3-4 different types of lock bleuch.

-- 
Revolutions do not require corporate support.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]



Re: [RFD w/info-PATCH] device arguments from lookup, partion code

2001-05-22 Thread Matthew Wilcox

On Tue, May 22, 2001 at 04:31:37PM +0100, Alan Cox wrote:
  `the class of devices in question' was cryptographic devices, and possibly
  other transactional DSPs.  I don't consider audio to be transactional.
  in any case, you can do transactional things with two threads, as long
  as they each have their own fd on the device.  Think of the fd as your
  transaction handle.
 
 Thats a bit pathetic. So I have to fill my app with expensive pthread locks
 or hack all the drivers and totally change the multi-open sematics in the ABI

huh?

void thread_init(void) {
int fd = open(/dev/crypto);
real_thread_init(fd);
}

where was that lock again?

and notice this idea is only for transactional things -- what
transactional things do sound drivers do?

-- 
Revolutions do not require corporate support.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]



Re: [RFD w/info-PATCH] device arguments from lookup, partion code

2001-05-21 Thread Matthew Wilcox

On Mon, May 21, 2001 at 06:13:18PM -0700, Linus Torvalds wrote:
 Nope. You can (and people do, quite often) share filps. So you can't
 associate state with it.

For _devices_, though?  I don't expect my mouse to work if gpm and xfree
both try to consume device events from the same filp.  Heck, it doesn't
even work when they try to consume events from the same inode :-)  I think
this is a reasonable restriction for the class of devices in question.

-- 
Revolutions do not require corporate support.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]



Re: [RFD w/info-PATCH] device arguments from lookup, partion code

2001-05-21 Thread Matthew Wilcox

On Tue, May 22, 2001 at 02:22:34AM +0200, Ingo Oeser wrote:
 ioctl has actually 4 semantics:
 
 command only
 command + read
 command + write
 command + rw-transaction
 
 Separating these would be a first step. And yes, I consider each
 of them useful.
 
 command only: reset drive

echo 'reset' /dev/sg0ctl

 command + rw-transaction: dear device please mangle this data
(crypto processors come to mind...)

I can't think of a reasonable tool-based approach to this, but I can
definitely see that a program could use this well.  It simply requires
that you use the filp to store your state.

fd = open(/dev/crypto) - creates filp
write(fd, Death to all fanatics!\n); - calls crypto device, stores result in
private data structure
sleep(100);
read(fd, Qrngu gb nyy snangvpf!\n); - frees data structure

[You'll note the advanced design of my crypto processor.]

Clearly, this is open to abuse by persons never calling read() and passing in
far too much to write().  I think this can be alleviated by refusing to accept more 
than (say) 4k at a time, or bean-counter.

A sick way would be to allow the -write() call to have its buffer
modified.  But I don't think we want to go down that path.

-- 
Revolutions do not require corporate support.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]



Re: [RFD w/info-PATCH] device arguments from lookup, partion code

2001-05-20 Thread Matthew Wilcox

On Sun, May 20, 2001 at 03:11:53PM -0400, Alexander Viro wrote:
 Pheeew... Could you spell about megabyte of stuff in ioctl.c?

No.

$ ls -l arch/*/kernel/ioctl32*.c
-rw-r--r--1 willywilly   22479 Jan 24 16:59 arch/mips64/kernel/ioctl32.c
-rw-r--r--1 willywilly  109475 May 18 16:39 arch/parisc/kernel/ioctl32.c
-rw-r--r--1 willywilly  117605 Feb  1 20:35 arch/sparc64/kernel/ioctl32.c

only about 100k.

-- 
Revolutions do not require corporate support.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]



Re: [RFD w/info-PATCH] device arguments from lookup, partion code

2001-05-19 Thread Matthew Wilcox

On Sat, May 19, 2001 at 12:51:23PM -0600, Richard Gooch wrote:
 Al, if you really want to kill ioctl(2), then perhaps you should
 implement a transaction(2) syscall. Something like:
 int transaction (int fd, void *rbuf, size_t rlen,
void *wbuf, size_t wlen);
 
 Of course, there wouldn't be any practical gain, since we already have
 ioctl(2). Any gain would be aesthetic.

I can tell you haven't had to write any 32-bit ioctl emulation code for
a 64-bit kernel recently.

-- 
Revolutions do not require corporate support.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]



Re: [RFD w/info-PATCH] device arguments from lookup, partion code

2001-05-19 Thread Matthew Wilcox

On Sat, May 19, 2001 at 10:22:55PM -0400, Richard Gooch wrote:
 The transaction(2) syscall can be just as easily abused as ioctl(2) in
 this respect.

But read() and write() cannot.

-- 
Revolutions do not require corporate support.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]



Re: Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH] device arguments from lookup)

2001-05-19 Thread Matthew Wilcox

On Sat, May 19, 2001 at 12:51:07PM -0400, Alexander Viro wrote:
 clone(), walk(), clunk(), stat() and open() ;-) Basically, we can add
 unopened descriptors. I.e. no IO until you open it (turning the thing into
 opened one), but we can do lookups (move to child), we can clone and
 kill them and we can stat them.

Those who would like a more detailed explanation can find one at
http://plan9.bell-labs.com/sys/man/5/INDEX.html

-- 
Revolutions do not require corporate support.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]



Re: [RFD w/info-PATCH] device arguments from lookup, partion code

2001-05-19 Thread Matthew Wilcox

On Sat, May 19, 2001 at 05:25:22PM +0100, Alan Cox wrote:
 Only to an English speaker. I suspect Quebec City canadians would prefer a
 different command set.

Should we support `pas387' as well as `no387' as a kernel boot parameter
then?  Face it, a sysadmin has to know the limited subset of english
which is used to configure a kernel.

-- 
Revolutions do not require corporate support.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]



F_SETLK_NP

2001-05-06 Thread Matthew Wilcox


Here's an initial implementation of F_SETLK_NP as previously discussed
on linux-fsdevel.  The _ONLY_ difference between these locks and POSIX
locks is that they are not removed on a close(dup(fd)).

I have a question about the interaction between POSIX and RANGE locks.
If they overlap, should they be merged?  And if so, what type should
the resulting lock have?  My initial thought is that they should _not_
be merged but _should_ conflict, which is going to require a slightly
bigger patch.  This patch merges them and gives the merged lock the type
of the lock which arrived first.

Anyway, here's the test program (I checked what was going on via
/proc/locks, you can too).

/* lock-test.c Copyright (c) Matthew Wilcox, Hewlett-Packard.  GPL License */

#define _GNU_SOURCE

#include fcntl.h
#include stdio.h
#include unistd.h

#define F_SETLK_NP 15

int main(int argc, char **argv) {
struct flock64 lock;
int fd;

lock.l_type = F_RDLCK;
lock.l_whence = 0;
lock.l_start = 50;
lock.l_len = 100;

fd = open(argv[0], O_RDONLY);
fcntl(fd, F_SETLK_NP, lock);
printf(Applied lock to file -- check and press the `Enter' key\n);
getchar();
close(dup(fd));
printf(Closed dup -- check and press the `Enter' key\n);
getchar();
close(fd);
printf(Closed original -- check and press the `Enter' key\n);
getchar();
return 0;
}

And now for the patch itself.  It was generated against 2.4.5-pre1 but
will probably apply against any 2.4.x kernel.  I've taken the liberty
of getting rid of some old cruft which wasn't used any more.  This patch is
i386 only -- just add constants to other arch include files as required.

diff -urNX dontdiff linux-245p1/fs/fcntl.c linux-245p1-vfs/fs/fcntl.c
--- linux-245p1/fs/fcntl.c  Sun May  6 09:16:20 2001
+++ linux-245p1-vfs/fs/fcntl.c  Sun May  6 05:52:35 2001
@@ -260,6 +260,10 @@
case F_SETLKW:
err = fcntl_setlk(fd, cmd, (struct flock *) arg);
break;
+   case F_SETLK_NP:
+   case F_SETLKW_NP:
+   err = fcntl_setlk_np(fd, cmd, (struct flock64 *) arg);
+   break;
case F_GETOWN:
/*
 * XXX If f_owner is a process group, the
diff -urNX dontdiff linux-245p1/fs/locks.c linux-245p1-vfs/fs/locks.c
--- linux-245p1/fs/locks.c  Sun May  6 09:14:08 2001
+++ linux-245p1-vfs/fs/locks.c  Sun May  6 05:38:12 2001
@@ -547,7 +547,7 @@
/* POSIX locks owned by the same process do not conflict with
 * each other.
 */
-   if (!(sys_fl-fl_flags  FL_POSIX) ||
+   if (!(sys_fl-fl_flags  (FL_POSIX | FL_RANGE)) ||
locks_same_owner(caller_fl, sys_fl))
return (0);
 
@@ -620,7 +620,7 @@
 
lock_kernel();
for (cfl = filp-f_dentry-d_inode-i_flock; cfl; cfl = cfl-fl_next) {
-   if (!(cfl-fl_flags  FL_POSIX))
+   if (!(cfl-fl_flags  (FL_POSIX | FL_RANGE)))
continue;
if (posix_locks_conflict(cfl, fl))
break;
@@ -863,7 +863,7 @@
if (caller-fl_type != F_UNLCK) {
   repeat:
for (fl = inode-i_flock; fl != NULL; fl = fl-fl_next) {
-   if (!(fl-fl_flags  FL_POSIX))
+   if (!(fl-fl_flags  (FL_POSIX | FL_RANGE)))
continue;
if (!posix_locks_conflict(caller, fl))
continue;
@@ -892,7 +892,7 @@
 
/* First skip locks owned by other processes.
 */
-   while ((fl = *before)  (!(fl-fl_flags  FL_POSIX) ||
+   while ((fl = *before)  (!(fl-fl_flags  (FL_POSIX | FL_RANGE)) ||
  !locks_same_owner(caller, fl))) {
before = fl-fl_next;
}
@@ -1458,23 +1458,72 @@
break;
case F_UNLCK:
break;
-   case F_SHLCK:
-   case F_EXLCK:
-#ifdef __sparc__
-/* warn a bit for now, but don't overdo it */
-{
-   static int count = 0;
-   if (!count) {
-   count=1;
-   printk(KERN_WARNING
-  fcntl_setlk() called by process %d (%s) with broken flock() 
emulation\n,
-  current-pid, current-comm);
+   default:
+   error = -EINVAL;
+   goto out_putf;
+   }
+
+   if (filp-f_op  filp-f_op-lock != NULL) {
+   error = filp-f_op-lock(filp, cmd, file_lock);
+   if (error  0)
+   goto out_putf;
}
+   error = posix_lock_file(filp, file_lock, cmd == F_SETLKW);
+
+out_putf:
+   fput(filp);
+out:
+   locks_free_lock(file_lock);
+   return error;
 }
-   if (!(filp-f_mode  3))
+
+/* Apply the lock described by l to an open file descriptor

Re: File Locking in Linux 2.5

2001-05-04 Thread Matthew Wilcox

On Fri, May 04, 2001 at 01:33:44PM +0200, Trond Myklebust wrote:
 ??? It'll be a completely useless form of lock if programs can end up
 blocking forever on a set of (from their perspective) valid
 operations.

Umm.. that's your opinion.  I don't see how `making them mandatory' would
make this better or worse than if they check for deadlock.  If they're
mandatory, a process doing a read() which conflicts will block forever
whether they're checked for deadlock or not.

 Ideally the VFS locking manager wants to provide a service to samba,
 NFS and their ilk, whereby they ask for a lock. If waiting would
 deadlock, send back an immediate answer to the caller. If not, then
 the locking manager should be responsible for scheduling a lock
 immediately after the blocking lock is released, and notifying the
 caller that it can proceed.

did you read my proposal?  was that not exactly what i suggested?

 Anything less than that would be an unacceptable step backward
 wrt. the existing code. We don't want to cram more responsabilities
 down on the filesystem-level.

Actually, we _do_ want to give the filesystems the opportunity to handle
their own locks without poking around in the guts of locks.c as lockd
currently does.  Did my proposal seem unacceptable to you?

 How about allowing the process to use a cookie to define ownership at
 the kernel level?
 
 For normal processes this may or may not include some combination
 involving the pid, but kernel level server processes could use this to
 save the metadata required to identify the client. I know the NFSv4
 people are keen to implement this sort of thing...

I like this.  That may be why the current code checks both owner and
pid fields.  Definitely worthy of a comment if so...

-- 
Revolutions do not require corporate support.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]



Re: File Locking in Linux 2.5

2001-05-03 Thread Matthew Wilcox

On Thu, May 03, 2001 at 12:19:31AM -0700, Jeremy Allison wrote:
  I want to add another byte-range lock, which looks and smells like a
  POSIX fcntl lock except that it is not removed by closing any fd which
  happens to be open on this file.
 
 If you do this I'll be eternally in your debt ! I had to
 write that code for Samba 2.2 and it was a *horrible* 
 experience. With non-broken locking we can #ifdef out
 via autoconf a large chunk of code from the Linux Samba
 implementation, and free up fd's immediately on close.
 
 As soon as you have any prototypes of this please let
 us know on the [EMAIL PROTECTED] list - I'd
 like to try it asap !

I'll get to it this weekend then.  Should be a relatively simple patch.
Are there any other semantics you want changing from the POSIX lock?
I'm thinking of removing deadlock detection, for one...

-- 
Revolutions do not require corporate support.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]



Re: File Locking in Linux 2.5

2001-05-03 Thread Matthew Wilcox

On Thu, May 03, 2001 at 01:28:33PM +0200, Andi Kleen wrote:
 Just don't forget to add a per user ulimit for it and probably an admin
 tool like ipcs.

It'll use the same limit as the other locks do.

-- 
Revolutions do not require corporate support.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]



Re: File Locking in Linux 2.5

2001-05-03 Thread Matthew Wilcox

On Thu, May 03, 2001 at 02:10:48PM +0200, Trond Myklebust wrote:
== Matthew Wilcox [EMAIL PROTECTED] writes:
 
   I'll get to it this weekend then.  Should be a relatively
   simple patch.  Are there any other semantics you want changing
   from the POSIX lock?  I'm thinking of removing deadlock
   detection, for one...
 
 What do you mean by `removing' it? Eliminating deadlock detection
 entirely will get you in deep trouble when you add mandatory locking.

Huh?  (1) I don't intend to have mandatory locking on these new locks.
(2) how exactly would it get me into trouble?

Let's just go over the semantics we currently have for posix locks,
and see if we can come up with `better' lock semantics.

POSIX locks are advisory (unless you do the mandatory lock dance),
and apply to a byte-range.  They are checked for deadlock, merged with other
compatible locks (same PID, same `owner').  (this seems redundant and there's
a question mark by it in the source... never mind.)

The POSIX locking rules mean that you can do funky things like:

lock bytes 0-100
lock bytes 50-150
unlock bytes 50-150

and you have bytes 0-49 left locked.  of course, you can then unlock
bytes 0-100 and it will remove the lock from 0-49, but this behaviour
seems unintuitive.

You can also (and this causes a fair bit of ugliness in the kernel code)
do:

lock 0-100
unlock 25-75

or, even worse

lock_read 0-100
lock_write 25-75

which leaves you with 0-24 read-locked, 25-75 write-locked and 76-100
read-locked.

i can't imagine this is behaviour people actually use/desire.  So for
a new lock type, i propose the following semantics:

byte-range, advisory, non-merging, non-deadlock-checking

Another semantic issue I want to tie down.  Is it useful for different
threadsd to be able to share a lock?  Or should locks from different
threads conflict?  I'm currently leaning towards the latter, but you
may be able to convince me I'm wrong.

-- 
Revolutions do not require corporate support.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]



Re: File Locking in 2.5

2001-05-02 Thread Matthew Wilcox

On Wed, May 02, 2001 at 04:49:30PM +0200, Andi Kleen wrote:
 On Mon, Apr 30, 2001 at 12:39:23PM -0600, Matthew Wilcox wrote:
   * All filesystems will fill in their -lock method.
 
 Why when a common stub should work for 90% of them?  Please keep
 global search-and-edit operation low when not absolutely possible.

I think this will be one of the more minor changes that filesystems will
see during 2.5 :-)

   * Local filesystems should all use local_lock() for this method,
 unless they have a good reason to provide their own facilities.
 
 
 So this should be default for -lock

Perhaps we misunderstand one another?  I propose most filesystems should do

struct file_operations {
...
lock: local_lock
}

and I think you're proposing

if (i_fop-lock)
i_fop-lock();
else
local_lock();

please correct me if I'm wrong.

 A clustered filesystem might call out to the network and say `I want
 to put this lock on this file'. Either some other node in the cluster
 says `Denied', `Blocked' or `Granted' (ie handles the request), OR no
 other node accepts responsibility for the lock, in which case we lock
 it locally by calling local_lock().
 
 It would be nice if it was possible to interface it to the distributed lock
 manager IBM has recently GPLed...

It would.  I hadn't heard that got released, I shall investigate.

 It's still linear lists I guess? Any evidence that a hash table or a ternary
 search tree makes sense?

I'm not going to change it from a linear list until I've made these other
changes.  In the meantime, if anyone knows of a data structure which lets me
record ranges and check for overlaps, let me know...

 I would add a printk warning that fires 10 times, then you can get rid
 of it.

They're only supported for the benefit of really old applications (ie
stuff that runs under iBCS).  We could probably drop them or make them a
config option.  I've never met anyone who mounts their filesystems with
-o mand :-)

 I never understood why you can't do it completely inode local protected
 by the inode lock; but then I don't claim that I understand fs/locks.c

inode_lock claims it's for protecting the list manipulations, not for
protecting any element in an individual inode.  Though in practice it
seems to be used for protecting i_state as well.  I think I'd rather use
a different lock to protect i_flock -- inode_lock is probably already
highly contended.

 Or just walk the inode hash ? 

yes, could do that.  but /proc/locks really ought to die.  i don't see why
we need to expose kernel addresses to userspace.  the extreme security
people have also voiced concerns about the covert channel involved and
the way this can be used to spy on other processes.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]



File Locking in 2.5

2001-04-30 Thread Matthew Wilcox
 that it is not removed by closing any fd which
   happens to be open on this file. Samba keeps a list of open fds which
   are not currently in use on any locked file to work around this
   stupidity in the spec. I'd like the external interface to this to be
   fcntl(F_SETLK_NP) and F_SETLKW_NP. Clearly F_GETLK does not need to be
   altered or replaced.
   
Restructuring

   locks.c still runs almost entirely under the BKL. An earlier attempt
   to move it to a different locking scheme was thwarted when the code
   was integrated into 2.4.0-test9 while I was on holiday, and without me
   submitting it to Linus. Grumble. I plan to move it to _one_ spinlock
   to cover all lock-related structures, and I think that will be
   possible with the plan described above (since this code will no longer
   sleep).
   
   As soon as lockd no longer needs to keep its fingers inside locks.c, I
   want to remove the global list of locks. It's also used by /proc/locks
   -- which probably needs to go away anyway. So what's useful about
   /proc/locks? I'd like to be able to see which locks my process has,
   and which processes have a lock on a given file. The former is easy --
   /proc/$PID/locks can be constructed relatively easily from the fd's
   open by that process. The latter? I don't know. Ideas welcome.
   
Links

   [1]POSIX file locking 
   [2]Olaf Kirch's page on NLM (warning: out of date)
 _
   
   
Matthew Wilcox [EMAIL PROTECTED]

References

   1. http://www.opengroup.org/onlinepubs/007908799/xsh/fcntl.html
   2. http://www.swb.de/personal/okir/lockd.html
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]



Re: [gfs-admin@sistina.com: Your message to gfs awaits moderator approval]

2001-04-30 Thread Matthew Wilcox

On Mon, Apr 30, 2001 at 02:15:50PM -0700, David S. Miller wrote:
 Now since kpreslan is the only one from sistina.com on linux-fsdevel I
 would say that would be the problem address.  However, I've made some
 postings to linux-kernel today (which this address is also on) yet I
 received no such bounces.
 
 In the meantime I've removed kpreslan.  I'm not going to remove the
 entire domain, I only do that if I see evidence of site negligence
 not personal negligence.  We have no proof of the latter at this point.

ok, thanks.  i had assumed that they'd subscribed the gfs list to
-fsdevel.  i guess what actually happened is that kpreslan `bounced'
instead of `forwarded' the message to the gfs list, so the fascist mail
software checked whether i was a subscriber.

thanks for your prompt action.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]



Re: 64-bit block sizes on 32-bit systems

2001-03-26 Thread Matthew Wilcox

On Mon, Mar 26, 2001 at 08:39:21AM -0800, LA Walsh wrote:
 I vaguely remember a discussion about this a few months back.
 If I remember, the reasoning was it would unnecessarily slow
 down smaller systems that would never have block devices in
 the 4-28T range attached.  

4k page size * 2GB = 8TB.

i consider it much more likely on such systems that the page size will
be increased to maybe 16 or 64k which would give us 32TB or 128TB.
you keep on trying to increase the size of types without looking at
what gcc outputs in the way of code that manipulates 64-bit types.
seriously, why don't you just try it?  see what the performance is.
see what the code size is.  then come back with some numbers.  and i mean
numbers, not `it doesn't feel any slower'.

personally, i'm going to see what the situation looks like in 5 years time
and try to solve the problem then.  there're enough real problems with the
VFS today that i don't feel inclined to fix tomorrow's potential problems.

-- 
Revolutions do not require corporate support.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]



Re: 64-bit block sizes on 32-bit systems

2001-03-26 Thread Matthew Wilcox

On Mon, Mar 26, 2001 at 10:47:13AM -0700, Andreas Dilger wrote:
 What do you mean by problems 5 years down the road?  The real issue is that
 this 32-bit block count limit affects composite devices like MD RAID and
 LVM today, not just individual disks.  There have been several postings
 I have seen with people having a problem _today_ with a 2TB limit on
 devices.

people who can afford 2TB of disc can afford to buy a 64-bit processor.

-- 
Revolutions do not require corporate support.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]



Re: 64-bit block sizes on 32-bit systems

2001-03-26 Thread Matthew Wilcox

On Mon, Mar 26, 2001 at 01:13:32PM -0800, Ion Badulescu wrote:
 Yes, there are millions of 32-bit systems in use today. They do their job
 just fine with the 32-bit device support we have right now. Do you really
 want to penalize them *all* for the sake of the few idiotic sysadmins who
 want multi-TB storage on their 32-bit system?

And they can even have that multi-TB storage today.  It's the lazy ones
who want to combine all their 80GB discs into one virtual device.

In case anyone's getting the wrong idea from this, I _do_ think we ought
to make the 2TB limit go away.  I _don't_ think we should do that by going
to 64-bit block numbers on 32-bit machines.  Linus suggested a while back
(in one of the kiobuf flamewa^W discussions) that the block device layer
ought to work fine with non-512-byte-blocks.  Perhaps those who are keen
to see Linux support large drives should investigate that instead.

-- 
Revolutions do not require corporate support.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]



Re: [RFC] sane access to per-fs metadata (was Re: [PATCH] Documentation/ioctl-number.txt)

2001-03-23 Thread Matthew Wilcox

On Fri, Mar 23, 2001 at 09:56:47AM -0700, Bryan Henderson wrote:
 There's a lot of cool simplicity in this, both in implementation and 
 application, but it leaves something to be desired in functionality.  This 
 is partly because the price you pay for being able to use existing, 
 well-worn Unix interfaces is the ancient limitations of those interfaces 
 -- like the inability to return adequate error information.

hmm... open("defrag-error") first, then read from it if it fails?

 effective the defrag was?  And bear in mind that multiple processes may be 
 issuing commands to /mnt/control simultaneously.

you should probably serialise them.  you probably have to do this anyway.

 With ioctl, I can easily match a response of any kind to a request.  I can 
 even return an English text message if I want to be friendly.

yes, one of the nice plan9 changes was the change to returning strings
instead of numerics.

-- 
Revolutions do not require corporate support.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]



Re: (struct dentry *)-vfsmnt;

2001-03-14 Thread Matthew Wilcox

On Wed, Mar 14, 2001 at 10:26:50AM -0700, Andreas Dilger wrote:
  Let me put it that way: I don't understand why (if it is useful at all)
  it is done in the fs. Looks like a wrong level...
 
 For the same reason that the UUID and LABEL are stored in the superblock:
 you want this infomation kept with the filesystem and not anywhere else,
 otherwise it will quickly get out-of-date.  Wherever you mounted the
 filesystem last is where it would be mounted if you import the VG on
 another system.  You can obviously edit /etc/fstab afterwards if it is
 wrong, and then remount the filesystem(s), and this will store the
 correct mountpoint into the filesystem for the next vgimport.

Al is saying `why not do this in mount(8) instead of mount(2)?'  I haven't
seen you answer that yet.

-- 
Revolutions do not require corporate support.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]



remove unused entries from ext2 inode

2001-03-12 Thread Matthew Wilcox


Andi Kleen mentioned that the ext2 part of the inode union was rather large.
So I wondered if I could chop it down a bit.

 * i_osync is only referenced, never set
 * i_faddr, i_frag_no, i_frag_size -- we don't support fragments.
 * not_used_1 can clearly be removed.
 * i_high_size is obsoleted by the VFS support for 64-bit files.
 * i_new_inode is set but never referenced.

I decided not to remove the ACL entries yet.

Index: fs/ext2/ialloc.c
===
RCS file: /var/cvs/linux/fs/ext2/ialloc.c,v
retrieving revision 1.1.1.3
diff -u -p -r1.1.1.3 ialloc.c
--- fs/ext2/ialloc.c2000/12/26 00:20:32 1.1.1.3
+++ fs/ext2/ialloc.c2001/03/12 13:44:25
@@ -432,13 +432,9 @@ repeat:
inode-i_blksize = PAGE_SIZE;   /* This is the optimal IO size (for stat), not 
the fs block size */
inode-i_blocks = 0;
inode-i_mtime = inode-i_atime = inode-i_ctime = CURRENT_TIME;
-   inode-u.ext2_i.i_new_inode = 1;
inode-u.ext2_i.i_flags = dir-u.ext2_i.i_flags;
if (S_ISLNK(mode))
inode-u.ext2_i.i_flags = ~(EXT2_IMMUTABLE_FL | EXT2_APPEND_FL);
-   inode-u.ext2_i.i_faddr = 0;
-   inode-u.ext2_i.i_frag_no = 0;
-   inode-u.ext2_i.i_frag_size = 0;
inode-u.ext2_i.i_file_acl = 0;
inode-u.ext2_i.i_dir_acl = 0;
inode-u.ext2_i.i_dtime = 0;
Index: fs/ext2/inode.c
===
RCS file: /var/cvs/linux/fs/ext2/inode.c,v
retrieving revision 1.1.1.3
diff -u -p -r1.1.1.3 inode.c
--- fs/ext2/inode.c 2001/01/01 10:32:55 1.1.1.3
+++ fs/ext2/inode.c 2001/03/12 13:43:00
@@ -405,7 +405,7 @@ static int ext2_alloc_branch(struct inod
*branch[n].p = branch[n].key;
mark_buffer_uptodate(bh, 1);
mark_buffer_dirty_inode(bh, inode);
-   if (IS_SYNC(inode) || inode-u.ext2_i.i_osync) {
+   if (IS_SYNC(inode)) {
ll_rw_block (WRITE, 1, bh);
wait_on_buffer (bh);
}
@@ -470,13 +470,13 @@ static inline int ext2_splice_branch(str
/* had we spliced it onto indirect block? */
if (where-bh) {
mark_buffer_dirty_inode(where-bh, inode);
-   if (IS_SYNC(inode) || inode-u.ext2_i.i_osync) {
+   if (IS_SYNC(inode)) {
ll_rw_block (WRITE, 1, where-bh);
wait_on_buffer(where-bh);
}
}
 
-   if (IS_SYNC(inode) || inode-u.ext2_i.i_osync)
+   if (IS_SYNC(inode))
ext2_sync_inode (inode);
else
mark_inode_dirty(inode);
@@ -1036,14 +1036,10 @@ void ext2_read_inode (struct inode * ino
inode-i_blocks = le32_to_cpu(raw_inode-i_blocks);
inode-i_version = ++event;
inode-u.ext2_i.i_flags = le32_to_cpu(raw_inode-i_flags);
-   inode-u.ext2_i.i_faddr = le32_to_cpu(raw_inode-i_faddr);
-   inode-u.ext2_i.i_frag_no = raw_inode-i_frag;
-   inode-u.ext2_i.i_frag_size = raw_inode-i_fsize;
inode-u.ext2_i.i_file_acl = le32_to_cpu(raw_inode-i_file_acl);
if (S_ISDIR(inode-i_mode))
inode-u.ext2_i.i_dir_acl = le32_to_cpu(raw_inode-i_dir_acl);
else {
-   inode-u.ext2_i.i_high_size = le32_to_cpu(raw_inode-i_size_high);
inode-i_size |= ((__u64)le32_to_cpu(raw_inode-i_size_high))  32;
}
inode-i_generation = le32_to_cpu(raw_inode-i_generation);
@@ -1112,25 +1108,26 @@ static int ext2_update_inode(struct inod
unsigned long offset;
int err = 0;
struct ext2_group_desc * gdp;
+   struct super_block *sb = inode-i_sb;
 
if ((inode-i_ino != EXT2_ROOT_INO 
-inode-i_ino  EXT2_FIRST_INO(inode-i_sb)) ||
-   inode-i_ino  le32_to_cpu(inode-i_sb-u.ext2_sb.s_es-s_inodes_count)) {
-   ext2_error (inode-i_sb, "ext2_write_inode",
+inode-i_ino  EXT2_FIRST_INO(sb)) ||
+   inode-i_ino  le32_to_cpu(sb-u.ext2_sb.s_es-s_inodes_count)) {
+   ext2_error (sb, "ext2_write_inode",
"bad inode number: %lu", inode-i_ino);
return -EIO;
}
-   block_group = (inode-i_ino - 1) / EXT2_INODES_PER_GROUP(inode-i_sb);
-   if (block_group = inode-i_sb-u.ext2_sb.s_groups_count) {
-   ext2_error (inode-i_sb, "ext2_write_inode",
+   block_group = (inode-i_ino - 1) / EXT2_INODES_PER_GROUP(sb);
+   if (block_group = sb-u.ext2_sb.s_groups_count) {
+   ext2_error (sb, "ext2_write_inode",
"group = groups count");
return -EIO;
}
-   group_desc = block_group  EXT2_DESC_PER_BLOCK_BITS(inode-i_sb);
-   desc = block_group  (EXT2_DESC_PER_BLOCK(inode-i_sb) - 1);
-   bh = inode-i_sb-u.ext2_sb.s_group_desc[group_desc];
+   group_desc = block_group  

NULL f_ops

2001-03-08 Thread Matthew Wilcox


Someone tell me if my chain of reasoning is wrong here...

 (1) The only way to get a `struct file' is to call get_empty_filp()
 (2) All callers of get_empty_filp() set -f_ops to a non-NULL value
 (3) All checks for f_ops being NULL can be removed

-- 
Revolutions do not require corporate support.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]



Re: NULL f_ops

2001-03-08 Thread Matthew Wilcox

On Thu, Mar 08, 2001 at 11:16:10AM -0500, Alexander Viro wrote:
 I'm not sure on your #2. In principle, -i_fop can be NULL. It may be
 a good thing to declare that it should never happens, but right now it's
 not guaranteed.
 
 Besides, revoke-like thing in proc/generic.c _does_ set f_op to NULL.
 It's damn ugly and the whole kill_inodes() should die slow and painful
 death, but right now that's the way it is done. Same thing for devfs,
 BTW - not that I thought that devfs deserved a different fate, but...

I'd rather see it set to null_file_ops then NULL.  If it can be null,
that means we need to go through the fs directory adding them; there
are a few places it isn't checked right now.

-- 
Revolutions do not require corporate support.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]



Re: More better in mount(2)

2001-01-05 Thread Matthew Wilcox

On Fri, Jan 05, 2001 at 01:00:58PM -0800, LA Walsh wrote:
   There is no file-system specific code.

even better, you can do it in the vfs.

 file systems, but that seems as ugly or uglier than adding a 6th parameter.
 
 Note that the 6th parameter would only be looked at if the MS_EXTRA bit is
 set --
 meaning no current mount programs would have to change -- they would just
 continue
 to do mounts the way they always have and specify no extra 'sysdata'.

there have been times before when this kind of extension have been added.
it almost always hurts us later.  don't add new arguments to syscalls
without changing the syscall number.

-- 
Revolutions do not require corporate support.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]



Re: dnotify enhancement

2000-12-13 Thread Matthew Wilcox

On Tue, Dec 12, 2000 at 03:47:54PM -0500, James Antill wrote:
  Say you have...
 
 /a/b/c/d
 
 ...and you are monitoring the 'd' directory. Then someone does...
 
 % mv /a/b /a/b.old
 % mkdir -p /a/b/c/d

that shouldn't trigger an event.  i'm strongly against doing recursion
of this in the kernel.  if you want to monitor that kind of thing, it makes
sense to do it in userspace.

are there any cases where you want to know that the directory itself
has been renamed, but not that one of its more remote ancestors has been
renamed?

-- 
Revolutions do not require corporate support.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]



Re: big kernel locks for inode ops

2000-12-06 Thread Matthew Wilcox

On Wed, Dec 06, 2000 at 03:03:00PM -0800, LA Walsh wrote:
 
 I notice many places in open.c where 'lock_kernel()' has been dropped in
 2.4.  There are still a few places.  Should there be a separate
 file-sub-system
 lock, perhaps later evolving into a /per filesystem lock?
 
 Under what circumstances in the file-system code does one still have
 to (or should) lock down the kernel?

Rusty wrote an `unreliable guide to kernel locking', available as
linux/Documentation/DocBook/kernel-locking.* (there are various formats
available for it, see the makefile for details).

-- 
Revolutions do not require corporate support.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]