Re: how to show propagation state for mounts
On Wed, Feb 20, 2008 at 04:04:22PM +, Al Viro wrote: It's less about the form of representation (after all, we generate poll events when contents of that sucker changes, so one *can* get a consistent snapshot of the entire thing) and more about having it self-contained when we have namespaces in the play. IOW, the data in there should give answers to questions that make sense. Do events get propagated from this vfsmount I have to that vfsmount I have? is a meaningful one; ditto for are events here propagated to somewhere I don't see? or are events getting propagated here from somewhere I don't see?. Why do those last two questions deserve an answer? How will a person's or application's behaviour be affected by whether a change will propagate to something they don't know about and can't see? -- Intel are signing my paycheques ... these opinions are still mine Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] enhanced ESTALE error handling
On Fri, Jan 18, 2008 at 10:36:01AM -0500, Peter Staubach wrote: @@ -1025,12 +1027,27 @@ static int fastcall link_path_walk(const mntget(save.mnt); result = __link_path_walk(name, nd); - if (result == -ESTALE) { + while (result == -ESTALE) { + /* + * If no progress was made looking up the pathname, + * then stop and return ENOENT instead of ESTALE. + */ + if (nd-dentry == save.dentry) { + result = -ENOENT; + break; + } *nd = save; dget(nd-dentry); mntget(nd-mnt); nd-flags |= LOOKUP_REVAL; result = __link_path_walk(name, nd); + /* + * If no progress was made this time, then return + * ENOENT instead of ESTALE because no recovery + * is possible to recover the stale file handle. + */ + if (result == -ESTALE nd-dentry == save.dentry) + result = -ENOENT; } dput(save.dentry); Why do you need both of these tests? The first one should be enough, surely? @@ -1268,8 +1285,8 @@ int path_lookup_open(int dfd, const char * @create_mode: create intent flags */ static int path_lookup_create(int dfd, const char *name, - unsigned int lookup_flags, struct nameidata *nd, - int open_flags, int create_mode) + unsigned int lookup_flags, struct nameidata *nd, + int open_flags, int create_mode) Gratuitous reformatting? @@ -1712,7 +1729,10 @@ int open_namei(int dfd, const char *path int acc_mode, error; struct path path; struct dentry *dir; - int count = 0; + int count; + +top: + count = 0; acc_mode = ACC_MODE(flag); @@ -1739,7 +1759,8 @@ int open_namei(int dfd, const char *path /* * Create - we need to know the parent. */ - error = path_lookup_create(dfd,pathname,LOOKUP_PARENT,nd,flag,mode); + error = path_lookup_create(dfd, pathname, LOOKUP_PARENT, nd, + flag, mode); if (error) return error; @@ -1812,10 +1833,17 @@ ok: return 0; exit_dput: + if (error == -ESTALE) + d_drop(path.dentry); dput_path(path, nd); exit: if (!IS_ERR(nd-intent.open.file)) release_open_intent(nd); + if (error == -ESTALE) { + d_drop(nd-dentry); + path_release(nd); + goto top; + } I wonder if a tail-call might not work better here. -- Intel are signing my paycheques ... these opinions are still mine Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: file locks: Use wait_event_interruptible_timeout()
On Tue, Jan 15, 2008 at 09:48:51AM -0500, J. Bruce Fields wrote: On Mon, Jan 14, 2008 at 09:28:30PM -0700, Matthew Wilcox wrote: interruptible_sleep_on_locked() is just an open-coded wait_event_interruptible_timeout() with a few assumptions since we know we hold the BKL. locks_block_on_timeout() is only used in one place, so it's actually simpler to inline it into its caller. Makes sense, thanks. So the assumption we were depending on the BKL for was that we could count on the wake-up not coming till after we block, so we could skip a check -fl_next that's normally needed to resolve the usual sleeping-on-some-condition race? That's right. -- Intel are signing my paycheques ... these opinions are still mine Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch] rewrite rd
On Tue, Dec 04, 2007 at 05:26:28AM +0100, Nick Piggin wrote: +static void copy_to_brd(struct brd_device *brd, const void *src, + sector_t sector, size_t n) +{ + struct page *page; + void *dst; + unsigned int offset = (sector (PAGE_SECTORS-1)) SECTOR_SHIFT; + size_t copy; + + copy = min((unsigned long)n, PAGE_SIZE - offset); + page = brd_lookup_page(brd, sector); + BUG_ON(!page); + + dst = kmap_atomic(page, KM_USER1); + memcpy(dst + offset, src, copy); + kunmap_atomic(dst, KM_USER1); You're using kmap_atomic, but I see no reason you can't be preempted. Don't you need to at least disable preemption while you have stuff atomically kmapped? -- Intel are signing my paycheques ... these opinions are still mine Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
file locks: Use wait_event_interruptible_timeout()
interruptible_sleep_on_locked() is just an open-coded wait_event_interruptible_timeout() with a few assumptions since we know we hold the BKL. locks_block_on_timeout() is only used in one place, so it's actually simpler to inline it into its caller. Signed-off-by: Matthew Wilcox [EMAIL PROTECTED] locks.c | 33 - 1 file changed, 4 insertions(+), 29 deletions(-) diff --git a/fs/locks.c b/fs/locks.c index 8b8388e..b681459 100644 --- a/fs/locks.c +++ b/fs/locks.c @@ -634,33 +634,6 @@ static int flock_locks_conflict(struct file_lock *caller_fl, struct file_lock *s return (locks_conflict(caller_fl, sys_fl)); } -static int interruptible_sleep_on_locked(wait_queue_head_t *fl_wait, int timeout) -{ - int result = 0; - DECLARE_WAITQUEUE(wait, current); - - __set_current_state(TASK_INTERRUPTIBLE); - add_wait_queue(fl_wait, wait); - if (timeout == 0) - schedule(); - else - result = schedule_timeout(timeout); - if (signal_pending(current)) - result = -ERESTARTSYS; - remove_wait_queue(fl_wait, wait); - __set_current_state(TASK_RUNNING); - return result; -} - -static int locks_block_on_timeout(struct file_lock *blocker, struct file_lock *waiter, int time) -{ - int result; - locks_insert_block(blocker, waiter); - result = interruptible_sleep_on_locked(waiter-fl_wait, time); - __locks_delete_block(waiter); - return result; -} - void posix_test_lock(struct file *filp, struct file_lock *fl) { @@ -1256,7 +1229,10 @@ restart: if (break_time == 0) break_time++; } - error = locks_block_on_timeout(flock, new_fl, break_time); + locks_insert_block(flock, new_fl); + error = wait_event_interruptible_timeout(new_fl-fl_wait, + !new_fl-fl_next, break_time); + __locks_delete_block(new_fl); if (error = 0) { if (error == 0) time_out_leases(inode); -- Intel are signing my paycheques ... these opinions are still mine Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Leak in nlmsvc_testlock for async GETFL case
On Mon, Jan 14, 2008 at 03:44:19PM -0500, J. Bruce Fields wrote: Thanks! I've queued it up for 2.6.25. Hi Bruce, I haven't had as much time to play with de-BKL-ising fs/locks.c as I would like, so fixing that for 2.6.25 is probably out of the question, but here are two janitorial patches that hopefully can be applied and will make the next steps easier. They make sense all by themselves, even if I don't get back to this project for a few months. -- Intel are signing my paycheques ... these opinions are still mine Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
file locks: Split flock_find_conflict out of flock_lock_file
Reduce the spaghetti-like nature of flock_lock_file by making the chunk of code labelled find_conflict into its own function. Also allocate memory before taking the kernel lock in preparation for switching to a normal spinlock. Signed-off-by: Matthew Wilcox [EMAIL PROTECTED] diff --git a/fs/locks.c b/fs/locks.c index b681459..bc691e5 100644 --- a/fs/locks.c +++ b/fs/locks.c @@ -699,6 +699,33 @@ next_task: return 0; } +/* + * A helper function for flock_lock_file(). It searches the list of locks + * for @inode looking for one which would conflict with @request. + * If it does not find one, it returns the address where the requested lock + * should be inserted. If it does find a conflicting lock, it returns NULL. + */ +static struct file_lock ** +flock_find_conflict(struct inode *inode, struct file_lock *request) +{ + struct file_lock **before; + + for_each_lock(inode, before) { + struct file_lock *fl = *before; + if (IS_POSIX(fl)) + break; + if (IS_LEASE(fl)) + continue; + if (!flock_locks_conflict(request, fl)) + continue; + + if (request-fl_flags FL_SLEEP) + locks_insert_block(fl, request); + return NULL; + } + return before; +} + /* Try to create a FLOCK lock on filp. We always insert new FLOCK locks * after any leases, but before any posix locks. * @@ -714,18 +741,21 @@ static int flock_lock_file(struct file *filp, struct file_lock *request) int error = 0; int found = 0; - lock_kernel(); - if (request-fl_flags FL_ACCESS) - goto find_conflict; + if (request-fl_flags FL_ACCESS) { + lock_kernel(); + before = flock_find_conflict(inode, request); + unlock_kernel(); + return before ? 0 : -EAGAIN; + } if (request-fl_type != F_UNLCK) { - error = -ENOMEM; new_fl = locks_alloc_lock(); - if (new_fl == NULL) - goto out; - error = 0; + if (!new_fl) + return -ENOMEM; } + lock_kernel(); + for_each_lock(inode, before) { struct file_lock *fl = *before; if (IS_POSIX(fl)) @@ -754,22 +784,12 @@ static int flock_lock_file(struct file *filp, struct file_lock *request) if (found) cond_resched(); -find_conflict: - for_each_lock(inode, before) { - struct file_lock *fl = *before; - if (IS_POSIX(fl)) - break; - if (IS_LEASE(fl)) - continue; - if (!flock_locks_conflict(request, fl)) - continue; + before = flock_find_conflict(inode, request); + if (!before) { error = -EAGAIN; - if (request-fl_flags FL_SLEEP) - locks_insert_block(fl, request); goto out; } - if (request-fl_flags FL_ACCESS) - goto out; + locks_copy_lock(new_fl, request); locks_insert_lock(before, new_fl); new_fl = NULL; -- Intel are signing my paycheques ... these opinions are still mine Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
On setting a lease across a cluster
Hi Bruce, The current implementation of vfs_setlease/generic_setlease/etc is a bit quirky. I've been thinking it over for the past couple of days, and I think we need to refactor it to work sensibly. As you may have noticed, I've been mulling over getting rid of the BKL in fs/locks.c and the lease interface is particularly problematic. Here's one reason why: fcntl_setlease lock_kernel vfs_setlease lock_kernel generic_setlease unlock_kernel fasync_helper unlock_kernel This is perfectly legitimate for the BKL of course, but other spinlocks can't be acquired recursively in this way. At first I thought I'd just push the call to generic_setlease into __vfs_setlease and have two ways of calling it: fcntl_setlease lock_kernel __vfs_setlease fasync_helper unlock_kernel vfs_setlease lock_kernel __vfs_setlease unlock_kernel But generic_setlease allocates memory (GFP_KERNEL) under the lock, so that's bad for converting to spnlocks later. Then I thought I'd move the spinlock acquisition into generic_setlease(). But I can't do that either as fcntl_setlease() needs to hold the spinlock around the call to generic_setlease() and fasync_helper(). Then I started to wonder about the current split of functionality between fcntl_setlease, vfs_setlease and generic_setlease. The check for no other process having this file open should be common to all filesystems. And it should be honoured by NFS (it shouldn't hand out a delegation if a local user has the file open), so it needs to be in vfs_setlease(). Now I'm worrying about the mechanics of calling out to a filesystem to perform a lock. Obviously, a network filesystem is going to have to sleep waiting for a response from the fileserver, so that precludes the use of a spinlock held over the call to f_op-setlease. I'm not keen on the idea of EXPORT_SYMBOL_GPL() a spinlock -- verifying locking is quite hard enough without worrying what a filesystem might be doing with it. So I think we need to refactor the interface, and I'd like to hear your thoughts on my ideas of how to handle this. First, have clients of vfs_setlease call lease_alloc() and pass it in, rather than allocate it on the stack and have this horrible interface where we may pass back an allocated lock. This allows us to not allocate memory (hence sleep) during generic_setlease(). Second, move some of the functionality of generic_setlease() to vfs_setlease(), as mentioned above. Third, change the purpose of f_op-setlease. We can rename it if you like to catch any out of tree users. I'd propose using it like this: vfs_setlease() if (f_op-setlease()) res = f_op-setlease() if (res) return res; lock_kernel() generic_setlease() unlock_kernel() fcntl_setlease if (f_op-setlease()) res = f_op-setlease() if (res) return res; lock_kernel generic_setlease() ... fasync ... unlock_kernel So 'f_op-setlease' now means Try to get a lease from the fileserver. We can optimise this a bit to not even call setlease if, say, we already have a read lease and we're trying to get a second read lease. But we have to record our local leases (that way they'll show up in /proc/locks). I think the combination of these three ideas gives us a sane interface to the various setlease functions, and let us convert from lock_kernel() to spin_lock() later. Have I missed something? I don't often think about the needs of cluster filesystems, so I may misunderstand how they need this operation to work. At some point, we need to revisit the logic for 'is this file open by another process' -- it's clearly buggy since it doesn't hold the inode_lock while checking i_count, so it could race against an __iget(). -- Intel are signing my paycheques ... these opinions are still mine Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: On setting a lease across a cluster
On Fri, Jan 04, 2008 at 02:47:18PM -0500, J. Bruce Fields wrote: Then I started to wonder about the current split of functionality between fcntl_setlease, vfs_setlease and generic_setlease. The check for no other process having this file open should be common to all filesystems. And it should be honoured by NFS (it shouldn't hand out a delegation if a local user has the file open), so it needs to be in vfs_setlease(). Of course, in the case of a (still unfortunately theoretical) cluster filesystem implementation, those checks will be insufficient on their own. Oh yes, definitely insufficient, but also necessary. Right now, filesystems bypass them entirely. And the last thing we want is individual filesystems duplicating these tests ... Though the write-lease check especially: if ((arg == F_WRLCK) ((atomic_read(dentry-d_count) 1) || (atomic_read(inode-i_count) 1))) goto out; seems like a terrible hack, and gives some annoying false positives: http://www.ussg.iu.edu/hypermail/linux/kernel/0711.1/2255.html It is a terrible hack. It was sufficient for the time, but it wasn't supposed to be merged in that state ... I haven't had the courage to figure out how to solve it properly yet. Now I'm worrying about the mechanics of calling out to a filesystem to perform a lock. Obviously, a network filesystem is going to have to sleep waiting for a response from the fileserver, so that precludes the use of a spinlock held over the call to f_op-setlease. I'm not keen on the idea of EXPORT_SYMBOL_GPL() a spinlock -- verifying locking is quite hard enough without worrying what a filesystem might be doing with it. I'm confused as to what the locks are actually protecting. If they're mainly just protecting the inode's lock list, for example, then we should let the filesystem call back to helpers in locks.c for any manipulations of that list and do the locking in those helpers. But we need to ensure the lock list is always consistent for, eg, /proc/locks. We don't want locks to temporarily appear on the list when they never get granted (due to an error elsewhere in the chain). vfs_setlease() if (f_op-setlease()) res = f_op-setlease() if (res) return res; lock_kernel() generic_setlease() unlock_kernel() Why can't the filesystem call into generic_setlease() on its own? Because (assuming we're rid of the BKL), fcntl_setlease() needs to acquire the spinlock and hold it while generic_setlease() runs, so generic_setlease() can't acquire the lock. Christoph Hellwig has suggested removing the second case and just doing .setlease = generic_setlease for all filesystems that should support leases; see the final patch in git://linux-nfs.org/~bfields/linux.git server-cluster-lease-api Yeah, but that doesn't work if we repurpose the setlease f_op. -- Intel are signing my paycheques ... these opinions are still mine Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: On setting a lease across a cluster
On Fri, Jan 04, 2008 at 01:55:36PM -0500, david m. richter wrote: fwiw, i've done some work on extending the lease subsystem to help support the full range of requirements for NFSv4 file and directory delegations (e.g., breaking a lease when unlinking a file) and we ended up actually doing most of what you've just suggested here, which i take to be a good sign. As long as it's great minds thinking alike and not fools seldom differing ;-) most of my refactoring came out of trying to simplify locking and avoid holding locks too long (rather than specifically focusing on cluster-oriented stuff, but the goals dovetail) and your work on getting the BKL out of locks.c entirely is something i really like and look forward to. Excellent. Shall I make the patch myself, or did you want to post a patch based on working code? ;-) -- Intel are signing my paycheques ... these opinions are still mine Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: On setting a lease across a cluster
On Fri, Jan 04, 2008 at 03:53:04PM -0500, J. Bruce Fields wrote: So, the problem is that fcntl_setlease() does vfs_setlease() fasync_helper() which the bkl held over both, and you want to preserve that? But what that BKL is doing is a mystery to me--the very first thing that fasync_helper() does is kmem_cache_alloc(., GFP_KERNEL). So you won't be introducing any new problem if you lock those two operations separately. Unless I'm totally missing something. A very good point. So yet another race caused by using the BKL rather than thinking ... but maybe it's an inconsequential race. The consequences are that (if the kmalloc in fasync_helper sleeps) a lease appears that isn't fully set-up yet (and may be removed if the kmalloc fails). Actually, it seems bad if the kmalloc eventually succeeds -- there's a window while kmalloc is sleeping where another process could open the file, break the lease, fl_fasync will be NULL, so no signal is sent. Then 30 seconds later the lease is removed without the leaseholder being sent a signal. Bad. How can we fix this situation? I think we need a better interface than fasync_helper() -- fasync_alloc() and fasync_setup() would seem to do the trick. -- Intel are signing my paycheques ... these opinions are still mine Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Remove BKL from fs/locks.c
On Sun, Dec 30, 2007 at 08:36:44PM +1100, Stephen Rothwell wrote: We should probably do some performance testing on this because the last time we tried the impact was quite noticeable. You should ping Tridge as he has some good lock testing setups. And he cares if we slow him down :-) Last time I did this, I switched to a semaphore instead of a spinlock. That was what slowed us down. I doubt we can see a performance loss with this patch since it's a 1-1 substitution of the BKL spinlock with a private spinlock. Good idea about asking tridge for an evaluation of the patch though. -- Intel are signing my paycheques ... these opinions are still mine Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC] Remove BKL from fs/locks.c
I've been promising to do this for about seven years now. It seems to work well enough, but I haven't run any serious stress tests on it. This implementation uses one spinlock to protect both lock lists and all the i_flock chains. It doesn't seem worth splitting up the locking any further. I had to move one memory allocation out from under the file_lock_lock. I hope I got that logic right. I'm rather tempted to split out the find_conflict algorithm from that function into something that can be called separately for the FL_ACCESS case. I also have to drop and reacquire the file_lock_lock around the call to cond_resched(). This was done automatically for us before by the special BKL semantics. I had to change vfs_setlease() as it relied on the special BKL ability to recursively acquire the same lock. The internal caller now calls __vfs_setlease and the exported interface acquires and releases the file_lock_lock around calling __vfs_setlease. I should probably split out the removal of interruptible_sleep_on_locked() as it's basically unrelated to all this. diff --git a/fs/locks.c b/fs/locks.c index 8b8388e..68de569 100644 --- a/fs/locks.c +++ b/fs/locks.c @@ -139,9 +139,23 @@ int lease_break_time = 45; #define for_each_lock(inode, lockp) \ for (lockp = inode-i_flock; *lockp != NULL; lockp = (*lockp)-fl_next) +/* + * Protects the two list heads below, plus the inode-i_flock list + */ +static DEFINE_SPINLOCK(file_lock_lock); static LIST_HEAD(file_lock_list); static LIST_HEAD(blocked_list); +static inline void lock_flocks(void) +{ + spin_lock(file_lock_lock); +} + +static inline void unlock_flocks(void) +{ + spin_unlock(file_lock_lock); +} + static struct kmem_cache *filelock_cache __read_mostly; /* Allocate an empty lock structure. */ @@ -507,9 +521,9 @@ static void __locks_delete_block(struct file_lock *waiter) */ static void locks_delete_block(struct file_lock *waiter) { - lock_kernel(); + lock_flocks(); __locks_delete_block(waiter); - unlock_kernel(); + unlock_flocks(); } /* Insert waiter into blocker's block list. @@ -634,29 +648,15 @@ static int flock_locks_conflict(struct file_lock *caller_fl, struct file_lock *s return (locks_conflict(caller_fl, sys_fl)); } -static int interruptible_sleep_on_locked(wait_queue_head_t *fl_wait, int timeout) -{ - int result = 0; - DECLARE_WAITQUEUE(wait, current); - - __set_current_state(TASK_INTERRUPTIBLE); - add_wait_queue(fl_wait, wait); - if (timeout == 0) - schedule(); - else - result = schedule_timeout(timeout); - if (signal_pending(current)) - result = -ERESTARTSYS; - remove_wait_queue(fl_wait, wait); - __set_current_state(TASK_RUNNING); - return result; -} - -static int locks_block_on_timeout(struct file_lock *blocker, struct file_lock *waiter, int time) +static int locks_block_on_timeout(struct file_lock *blocker, + struct file_lock *waiter, int time) { int result; locks_insert_block(blocker, waiter); - result = interruptible_sleep_on_locked(waiter-fl_wait, time); + unlock_flocks(); + result = wait_event_interruptible_timeout(waiter-fl_wait, + !waiter-fl_next, time); + lock_flocks(); __locks_delete_block(waiter); return result; } @@ -666,7 +666,7 @@ posix_test_lock(struct file *filp, struct file_lock *fl) { struct file_lock *cfl; - lock_kernel(); + lock_flocks(); for (cfl = filp-f_path.dentry-d_inode-i_flock; cfl; cfl = cfl-fl_next) { if (!IS_POSIX(cfl)) continue; @@ -677,7 +677,7 @@ posix_test_lock(struct file *filp, struct file_lock *fl) __locks_copy_lock(fl, cfl); else fl-fl_type = F_UNLCK; - unlock_kernel(); + unlock_flocks(); return; } @@ -741,18 +741,16 @@ static int flock_lock_file(struct file *filp, struct file_lock *request) int error = 0; int found = 0; - lock_kernel(); - if (request-fl_flags FL_ACCESS) - goto find_conflict; - - if (request-fl_type != F_UNLCK) { - error = -ENOMEM; + if (!(request-fl_flags FL_ACCESS) (request-fl_type != F_UNLCK)) { new_fl = locks_alloc_lock(); - if (new_fl == NULL) - goto out; - error = 0; + if (!new_fl) + return -ENOMEM; } + lock_flocks(); + if (request-fl_flags FL_ACCESS) + goto find_conflict; + for_each_lock(inode, before) { struct file_lock *fl = *before; if (IS_POSIX(fl)) @@ -778,8 +776,11 @@ static int flock_lock_file(struct file *filp, struct file_lock *request) * If a higher-priority process was
Re: NFS Killable tasks request comments on patch
On Thu, Dec 06, 2007 at 10:34:26AM -0700, Matthew Wilcox wrote: I've done a more thorough review of Liam's work and come up with a few more fixes (which I'll publish later today) I've put up a git tree for this work; see http://git.kernel.org/?p=linux/kernel/git/willy/misc.git;a=shortlog;h=task_killable I've split up some of the earlier patches, hopefully in a way which will make akpm less grumpy. Andrew, let me know if I've misunderstood how you want to see it. Here's the commit for the NFS people to chew on a bit. commit f05b88f294044cdf56fbe637a246ba6c7a14d6f1 Author: Matthew Wilcox [EMAIL PROTECTED] Date: Thu Dec 6 16:24:39 2007 -0500 NFS: Switch from intr mount option to TASK_KILLABLE By using the TASK_KILLABLE infrastructure, we can get rid of the 'intr' mount option. We have to use _killable everywhere instead of _interruptible as we get rid of rpc_clnt_sigmask/sigunmask. Signed-off-by: Liam R. Howlett [EMAIL PROTECTED] Signed-off-by: Matthew Wilcox [EMAIL PROTECTED] diff --git a/fs/nfs/client.c b/fs/nfs/client.c index 70587f3..310fa2f 100644 --- a/fs/nfs/client.c +++ b/fs/nfs/client.c @@ -302,7 +302,7 @@ found_client: if (new) nfs_free_client(new); - error = wait_event_interruptible(nfs_client_active_wq, + error = wait_event_killable(nfs_client_active_wq, clp-cl_cons_state != NFS_CS_INITING); if (error 0) { nfs_put_client(clp); @@ -494,10 +494,6 @@ static int nfs_init_server_rpcclient(struct nfs_server *server, rpc_authflavor_t if (server-flags NFS_MOUNT_SOFT) server-client-cl_softrtry = 1; - server-client-cl_intr = 0; - if (server-flags NFS4_MOUNT_INTR) - server-client-cl_intr = 1; - return 0; } diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c index 5e8d82f..7b994b2 100644 --- a/fs/nfs/direct.c +++ b/fs/nfs/direct.c @@ -193,7 +193,7 @@ static ssize_t nfs_direct_wait(struct nfs_direct_req *dreq) if (dreq-iocb) goto out; - result = wait_for_completion_interruptible(dreq-completion); + result = wait_for_completion_killable(dreq-completion); if (!result) result = dreq-error; @@ -391,9 +391,7 @@ static ssize_t nfs_direct_read(struct kiocb *iocb, const struct iovec *iov, unsigned long nr_segs, loff_t pos) { ssize_t result = 0; - sigset_t oldset; struct inode *inode = iocb-ki_filp-f_mapping-host; - struct rpc_clnt *clnt = NFS_CLIENT(inode); struct nfs_direct_req *dreq; dreq = nfs_direct_req_alloc(); @@ -405,11 +403,9 @@ static ssize_t nfs_direct_read(struct kiocb *iocb, const struct iovec *iov, if (!is_sync_kiocb(iocb)) dreq-iocb = iocb; - rpc_clnt_sigmask(clnt, oldset); result = nfs_direct_read_schedule_iovec(dreq, iov, nr_segs, pos); if (!result) result = nfs_direct_wait(dreq); - rpc_clnt_sigunmask(clnt, oldset); nfs_direct_req_release(dreq); return result; @@ -767,9 +763,7 @@ static ssize_t nfs_direct_write(struct kiocb *iocb, const struct iovec *iov, size_t count) { ssize_t result = 0; - sigset_t oldset; struct inode *inode = iocb-ki_filp-f_mapping-host; - struct rpc_clnt *clnt = NFS_CLIENT(inode); struct nfs_direct_req *dreq; size_t wsize = NFS_SERVER(inode)-wsize; int sync = 0; @@ -787,11 +781,9 @@ static ssize_t nfs_direct_write(struct kiocb *iocb, const struct iovec *iov, if (!is_sync_kiocb(iocb)) dreq-iocb = iocb; - rpc_clnt_sigmask(clnt, oldset); result = nfs_direct_write_schedule_iovec(dreq, iov, nr_segs, pos, sync); if (!result) result = nfs_direct_wait(dreq); - rpc_clnt_sigunmask(clnt, oldset); nfs_direct_req_release(dreq); return result; diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c index db5d96d..f68c222 100644 --- a/fs/nfs/inode.c +++ b/fs/nfs/inode.c @@ -433,15 +433,11 @@ static int nfs_wait_schedule(void *word) */ static int nfs_wait_on_inode(struct inode *inode) { - struct rpc_clnt *clnt = NFS_CLIENT(inode); struct nfs_inode *nfsi = NFS_I(inode); - sigset_t oldmask; int error; - rpc_clnt_sigmask(clnt, oldmask); error = wait_on_bit_lock(nfsi-flags, NFS_INO_REVALIDATING, - nfs_wait_schedule, TASK_INTERRUPTIBLE); - rpc_clnt_sigunmask(clnt, oldmask); + nfs_wait_schedule, TASK_KILLABLE); return error; } diff --git a/fs/nfs/mount_clnt.c b/fs/nfs/mount_clnt.c index 8afd9f7..49c7cd0 100644 --- a/fs/nfs/mount_clnt.c +++ b/fs/nfs/mount_clnt.c @@ -56,7 +56,7 @@ int nfs_mount(struct sockaddr *addr, size_t len, char *hostname, char *path
Re: Massive slowdown when re-querying large nfs dir
On Mon, Nov 05, 2007 at 07:58:38AM +0300, Al Boldi wrote: Any ideas? How about tcpdumping and seeing what requests are flowing across the wire? You might be able to figure out what's being done differently. -- Intel are signing my paycheques ... these opinions are still mine Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC, PATCH] locks: remove posix deadlock detection
On Sun, Oct 28, 2007 at 01:43:21PM -0400, J. Bruce Fields wrote: We currently attempt to return -EDEALK to blocking fcntl() file locking requests that would create a cycle in the graph of tasks waiting on locks. This is inefficient: in the general case it requires us determining whether we're adding a cycle to an arbitrary directed acyclic graph. And this calculation has to be performed while holding a lock (currently the BKL) that prevents that graph from changing. It has historically been a source of bugs; most recently it was noticed that it could loop indefinitely while holding the BKL. It can also return -EDEADLK spuriously. So yeah, just kill it. -- Intel are signing my paycheques ... these opinions are still mine Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC, PATCH] locks: remove posix deadlock detection
On Sun, Oct 28, 2007 at 06:40:52PM +, Alan Cox wrote: NAK. This is an ABI change. It was also comprehensively rejected before because - EDEADLK behaviour is ABI Not in any meaningful way. - EDEADLK behaviour is required by SuSv3 What SuSv3 actually says is: If the system detects that sleeping until a locked region is unlocked would cause a deadlock, fcntl() shall fail with an [EDEADLK] error. It doesn't require the system to detect it, only mandate what to return if it does detect it. - We have no idea what applications may rely on this behaviour. I've never heard of one that does. so we need to fix the bugs - the lock usage and the looping. At that point it merely becomes a performance concern to those who use it, which is the proper behaviour. If you want a faster non-checking one use flock(), or add another flag that is a Linux don't check for deadlock You can't fix the false EDEADLK detection without solving the halting problem. Best of luck with that. -- Intel are signing my paycheques ... these opinions are still mine Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC, PATCH] locks: remove posix deadlock detection
On Sun, Oct 28, 2007 at 05:50:30PM -0400, Trond Myklebust wrote: You can't fix the false EDEADLK detection without solving the halting problem. Best of luck with that. I can see that it would be difficult to do efficiently, but basically, this boils down to finding a circular path in a graph. That is hardly an unsolvable issue... Bzzt. You get a false deadlock with multiple threads like so: Thread A of task B takes lock 1 Thread C of task D takes lock 2 Thread C of task D blocks on lock 1 Thread E of task B blocks on lock 2 We currently declare deadlock at this point (unless the deadlock detection code has changed since I last looked at it), despite thread A being about to release lock 1. Oh, and by the way, thread E is capable of releasing lock 1, so you can't just say well, detect by thread instead of by task. So the only way I can see to accurately detect deadlock is to simulate the future execution of all threads in task B to see if any of them will release lock 1 without first gaining lock 2. Which, I believe, is halting-equivalent. -- Intel are signing my paycheques ... these opinions are still mine Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC, PATCH] locks: remove posix deadlock detection
On Sun, Oct 28, 2007 at 10:48:33PM +, Alan Cox wrote: Bzzt. You get a false deadlock with multiple threads like so: Thread A of task B takes lock 1 Thread C of task D takes lock 2 Thread C of task D blocks on lock 1 Thread E of task B blocks on lock 2 The spec and SYSV certainly ignore threading in this situation and you know that perfectly well (or did in 2004) The discussion petered out (or that mailing list archive lost articles from the thread) without any kind of resolution, or indeed interest. What is your suggestion for handling this problem? As it is now, the kernel 'detects' deadlock where there is none, which doesn't seem allowed by SuSv3 either. -- Intel are signing my paycheques ... these opinions are still mine Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC, PATCH] locks: remove posix deadlock detection
On Sun, Oct 28, 2007 at 11:55:52PM +0100, Jiri Kosina wrote: On Sun, 28 Oct 2007, Matthew Wilcox wrote: Bzzt. You get a false deadlock with multiple threads like so: Thread A of task B takes lock 1 Thread C of task D takes lock 2 Thread C of task D blocks on lock 1 Thread E of task B blocks on lock 2 A potential for deadlock occurs if a process controlling a locked region is put to sleep by attempting to lock another process' locked region. If the system detects that sleeping until a locked region is unlocked would cause a deadlock, fcntl() shall fail with an [EDEADLK] error. This is what POSIX says [1], even after being modified with respect to POSIX Threads Extension, right? So it doesn't deal with threads at all, just processess are taken into account. Probably for a reason :) Did you have a concrete suggestion, or are you just quoting the spec? The problem is that it's nonsense -- processes don't sleep, threads do. I think the key is would deadlock, not might deadlock. Our current behaviour is clearly in violation of SuSv3. -- Intel are signing my paycheques ... these opinions are still mine Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC, PATCH] locks: remove posix deadlock detection
On Sun, Oct 28, 2007 at 09:38:55PM +, Alan Cox wrote: It doesn't require the system to detect it, only mandate what to return if it does detect it. We should be detecting at least the obvious case. What is the obvious case? A task that has never called clone()? If SYSV only spots simple AB - BA deadlocks or taking the same lock twice yourself then that ought to be sufficient for us too. You can't deadlock against yourself -- either with POSIX locks or BSD locks (POSIX simple upgrades/downgrades the lock; each byte in a file can only be in either read-locked state or write-locked state for a given process. BSD locks release the lock and wake all waiters before attempting to acquire the lock in its new state). In my other post, I showed a simple AB-BA deadlock that can't be easily detected. -- Intel are signing my paycheques ... these opinions are still mine Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SLUB performance regression vs SLAB
On Fri, Oct 05, 2007 at 08:48:53AM +0200, Jens Axboe wrote: I'd like to second Davids emails here, this is a serious problem. Having a reproducible test case lowers the barrier for getting the problem fixed by orders of magnitude. It's the difference between the problem getting fixed in a day or two and it potentially lingering for months, because email ping-pong takes forever and the test team has moved on to other tests, we'll let you know the results of test foo in 3 weeks time when we have a new slot on the box just removing any developer motivation to work on the issue. I vaguely remembered something called orasim, so I went looking for it. I found http://oss.oracle.com/~wcoekaer/orasim/ which is dated from 2004, and I found http://oss.oracle.com/projects/orasimjobfiles/ which seems to be a stillborn project. Is there anything else I should know about orasim? ;-) -- Intel are signing my paycheques ... these opinions are still mine Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
SLUB performance regression vs SLAB
On Mon, Oct 01, 2007 at 01:50:44PM -0700, Christoph Lameter wrote: The problem is with the weird way of Intel testing and communication. Every 3-6 month or so they will tell you the system is X% up or down on arch Y (and they wont give you details because its somehow secret). And then there are conflicting statements by the two or so performance test departments. One of them repeatedly assured me that they do not see any regressions. Could you cut out the snarky remarks? It takes a long time to run a test, and testing every one of the patches you send really isn't high on anyone's priority list. The performance team have also been having problems getting stable results with recent kernels, adding to the delay. The good news is that we do now have committment to testing upstream kernels, so you should see results more frequently than you have been. I'm taking over from Suresh as liason for the performance team, so if you hear *anything* from *anyone* else at Intel about performance, I want you to cc me about it. OK? And I don't want to hear any more whining about hearing different things from different people. So, on a well-known OLTP benchmark which prohibits publishing absolute numbers and on an x86-64 system (I don't think exactly which model is important), we're seeing *6.51%* performance loss on slub vs slab. This is with a 2.6.23-rc3 kernel. Tuning the boot parameters, as you've asked for before (slub_min_order=2, slub_max_order=4, slub_min_objects=8) gets back 0.38% of that. It's still down 6.13% over slab. For what it's worth, 2.6.23-rc3 already has a 1.19% regression versus RHEL 4.5, so the performance guys are really unhappy about going up to almost 8% regression. In the detailed profiles, __slab_free is the third most expensive function, behind only spin locks. get_partial_node is right behind it in fourth place, and kmem_cache_alloc is sixth. __slab_alloc is eight and kmem_cache_free is tenth. These positions don't change with the slub boot parameters. Now, where do we go next? I suspect that 2.6.23-rc9 has significant changes since -rc3, but I'd like to confirm that before kicking off another (expensive) run. Please, tell me what useful kernels are to test. -- Intel are signing my paycheques ... these opinions are still mine Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SLUB performance regression vs SLAB
On Thu, Oct 04, 2007 at 10:38:15AM -0700, Christoph Lameter wrote: On Thu, 4 Oct 2007, Matthew Wilcox wrote: So, on a well-known OLTP benchmark which prohibits publishing absolute numbers and on an x86-64 system (I don't think exactly which model is important), we're seeing *6.51%* performance loss on slub vs slab. This is with a 2.6.23-rc3 kernel. Tuning the boot parameters, as you've asked for before (slub_min_order=2, slub_max_order=4, slub_min_objects=8) gets back 0.38% of that. It's still down 6.13% over slab. Yeah the fastpath vs. slow path is not the issue as Siddha and I concluded earlier. Seems that we are mainly seeing cacheline bouncing due to two cpus accessing meta data in the same page struct. The patches in MM that are scheduled to be merged for .24 address that issue. I have repeatedly asked that these patches be tested. The patches were posted months ago. I just checked with the guys who did the test. When I said -rc3, I mis-spoke; this is 2.6.23-rc3 *plus* the patches which Suresh agreed to test for you. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SLUB performance regression vs SLAB
On Thu, Oct 04, 2007 at 01:48:34PM -0700, David Miller wrote: There comes a point where it is the reporter's responsibility to help the developer come up with a publishable test case the developer can use to work on fixing the problem and help ensure it stays fixed. That's a lot of effort. Is it more effort than doing some remote debugging with Christoph? I don't know. Using an unpublishable benchmark, whose results even cannot be published, really stretches the limits of reasonable don't you think? Yet here we stand. Christoph is aggressively trying to get slab removed from the tree. There is a testcase which shows slub performing worse than slab. It's not my fault I can't publish it. And just because I can't publish it doesn't mean it doesn't exist. Slab needs to not get removed until slub is as good a performer on this benchmark. -- Intel are signing my paycheques ... these opinions are still mine Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SLUB performance regression vs SLAB
On Thu, Oct 04, 2007 at 01:55:37PM -0700, David Miller wrote: Anything, I do mean anything, can be simulated using small test programs. Pointing at a big fancy machine with lots of storage and disk is a passive aggressive way to avoid the real issues, in that nobody is putting forth the effort to try and come up with an at least publishable test case that Christoph can use to help you guys. If coming up with a reproducable and publishable test case is the difference between this getting fixed and it not getting fixed, are you going to invest the time to do that? If that's what it takes, then yes. But I'm far from convinced that it's as easy to come up with a TPC benchmark simulator as you think. There have been efforts in the past (orasim, for example), but presumably Christoph has already tried these benchmarks. -- Intel are signing my paycheques ... these opinions are still mine Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] isofs: add +w bit for non-RR discs
On Tue, Oct 02, 2007 at 08:00:26PM +0200, Jan Engelhardt wrote: Add %S_IWUGO bit for files on ISO-9660 filesystems without RockRidge Looks to me like you've added S_IWUSR, not S_IWUGO. - popt-mode = S_IRUGO | S_IXUGO; /* + popt-mode = S_IRUGO | S_IWUSR | S_IXUGO; - inode-i_mode = S_IRUGO | S_IXUGO | S_IFDIR; + inode-i_mode = S_IRUGO | S_IWUSR | S_IXUGO | S_IFDIR; -- Intel are signing my paycheques ... these opinions are still mine Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] fs: Correct SuS compliance for open of large file without options
On Thu, Sep 27, 2007 at 02:37:42PM -0400, Theodore Tso wrote: I'm reminded of Rusty's 2003 OLS Keynote, where he points out that what's important is not making an interface easy to use, but _hard_ _to_ _misuse_. That fact that sysfs is all laid out in a directory, but for which some directories/symlinks are OK to use, and some are NOT OK to use --- is why I call the sysfs interface an open pit. Sure, if you have the map to the minefield, a minefield is perfectly safe when you know what to avoid. But is that the best way to construct a path/interface for an application programmer to get from point A to point B? Maybe, maybe not. I agree with your criticism of sysfs (indeed, I think I've made many of them before in somewhat stronger language), but do you have a suggestion as to an interface that would be harder to misuse? -- Intel are signing my paycheques ... these opinions are still mine Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] fs: Correct SuS compliance for open of large file without options
On Thu, Sep 27, 2007 at 07:19:27PM -0400, Theodore Tso wrote: Would you accept a patch which causes the deprecated sysfs files/directories to disappear, even if CONFIG_SYS_DEPRECATED is defined, via a boot-time parameter? How about a mount option? That way people can test without a reboot: mount -o remount,deprecated={yes,no} /sys -- Intel are signing my paycheques ... these opinions are still mine Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/2] VFS: new fgetattr() file operation
On Mon, Sep 24, 2007 at 02:48:08PM +0200, Miklos Szeredi wrote: and if that means adding silly rename support so be it. That's what is done currently. But it's has various dawbacks, like rmdir doesn't work if there are open files within an otherwise empty directory. I'd happily accept suggestions on how to deal with this differenty. Only sillyrename files with nlink 1? I don't see how attributes can change anything for a deleted file. -- Intel are signing my paycheques ... these opinions are still mine Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/2] VFS: new fgetattr() file operation
On Mon, Sep 24, 2007 at 03:06:06PM +0200, Miklos Szeredi wrote: A file isn't deleted while there are still links or open files refering to it. So getting the attributes for a file with nlink==0 is perfectly valid while the file is still open. Is it? Why not just pretend that the attributes are wiped when the file is deleted. Effectively, they are, since they can't affect anything. If a network filesystem protocol can't handle operations (be it data or metadata) on an unlinked file, we must do sillirenaming, so that the file is not actually unlinked. Or you could call getattr right before you unlink and cache the result in the client. -- Intel are signing my paycheques ... these opinions are still mine Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/2] avoid clobbering registers with J_ASSERT macro
On Mon, Aug 20, 2007 at 04:22:04PM +0100, Stephen C. Tweedie wrote: Hi, On Mon, 2007-08-20 at 09:18 -0400, Chris Snook wrote: How's about we just remove that printk? Do #define J_ASSERT(e) BUG_ON(e)? ITYM #define J_ASSERT(e) BUG_ON(!e) It did. The original J_ASSERT predates BUG() entirely, and was added so that we got the file/line-no information. But with the current BUG() macro, I can't see any reason for J_ASSERT still to try to gather that information itself. Do you still want to keep J_ASSERT, or should all uses of it be replaced with BUG_ON? (to put it another way; if you were writing JBD now, would you add your own J_ASSERT, or would you just use BUG_ON directly?) -- Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Advocacy] Re: 3ware 9650 tips
On Mon, Jul 16, 2007 at 08:40:00PM +0300, Al Boldi wrote: XFS surely rocks, but it's missing one critical component: data=ordered And that's one component that's just too critical to overlook for an enterprise environment that is built on data-integrity over performance. So that's the secret why people still use ext3, and XFS' reliance on external hardware to ensure integrity is really misplaced. Now, maybe when we get the data=ordered onto the VFS level, then maybe XFS may become viable for the enterprise, and ext3 may cease to be KING. Wow, thanks for bringing an advocacy thread onto linux-fsdevel. Just what we wanted. Do you have any insight into how to get the data=ordered onto the VFS level? Because to me, that sounds like pure nonsense. -- Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] VFS: data=ordered (was: [Advocacy] Re: 3ware 9650 tips)
On Mon, Jul 16, 2007 at 09:28:08PM +0300, Al Boldi wrote: Well, conceptually it sounds like a piece of cake, technically your guess is as good as mine. IIRC, akpm once mentioned something like this. How much have you looked at the VFS? There's nothing journalling-related in the VFS right now. ext3 and XFS share no common journalling code, nor do I think that would be possible, due to the very different concepts they have of journalling. Here's a good hint: $ find fs -type f |xargs grep -l journal_start fs/ext3/acl.c fs/ext3/inode.c fs/ext3/ioctl.c fs/ext3/namei.c fs/ext3/resize.c fs/ext3/super.c fs/ext3/xattr.c fs/ext4/acl.c fs/ext4/extents.c fs/ext4/inode.c fs/ext4/ioctl.c fs/ext4/namei.c fs/ext4/resize.c fs/ext4/super.c fs/ext4/xattr.c fs/jbd/journal.c fs/jbd/transaction.c fs/jbd2/journal.c fs/jbd2/transaction.c fs/ocfs2/journal.c fs/ocfs2/super.c JBD and JBD2 provide a journalling implementation that ext3, ext4 and ocfs2 use. Note that XFS doesn't, it has its own journalling code. If you want XFS to support data=ordered, talk to the XFS folks. Or start picking through XFS yourself, of course -- you do have the source code. -- Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ANNOUNCE] util-linux-ng 2.13-rc1
On Thu, Jul 05, 2007 at 11:30:20PM +0200, Karel Zak wrote: The package build system is now based on autotools. The build system supports separate CFLAGS and LDFLAGS for suid programs (SUID_CFLAGS, SUID_LDFLAGS). For more details see the README file And this is really dumb. autotools is a completely pain in the ass and Well, Adrian Bunk added autotools stuff to util-linux during his work on v2.13. This stuff has been fixed and stabilized in util-linux-ng v2.13. I'm not fanatical autotools protagonist, but it seems useful in util-linux. We will see... I'm ready to change or fix arbitrary thing in util-linux-ng, but I always need a real reason -- bug report, new feature, or so. This discussion is about impressions and feelings only. No, it's based on long, hard and painful experiences attempting to debug the fuckups that autoconf creates when it goes wrong. -- Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Read/write counts
On Mon, Jun 04, 2007 at 09:56:07AM -0700, Bryan Henderson wrote: Programs that assume a full transfer are fairly common, but are universally regarded as either broken or just lazy, and when it does cause a problem, it is far more common to fix the application than the kernel. Linus has explicitly forbidden short reads from being returned. The original poster may get away with it for a specialised case, but for example, signals may not cause a return to userspace with a short read for exactly this reason. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [NFS] [PATCH] locks: provide a file lease method enabling cluster-coherent leases
On Fri, Jun 01, 2007 at 12:44:16PM -0400, J. Bruce Fields wrote: The only problem I'm aware of is that leases aren't broken on rename, link, and unlink. This is kind of tricky to fix. David Richter (cc'd) and I sketched out a few different approaches, and I think he has some patches implementing at least one of them. This may require defining some new type of lease, to avoid changing the documented behavior of the fcntl lease operations (which only break on open). Although actually I believe Samba needs the same behavior we do, and they're probably the most important user of leases Samba internally prohibits renaming or deleting an open file, to match Windows semantics. So it won't notice the difference. At least, that's what I remember from a discussion with Tridge when we were implementing leases back in 2000. I think it's an acceptable change in Linux semantics to break leases on rename/delete/link. Though I'm not quite sure why you need to -- nobody else is touching the contents of the file, so it's not like you need to write the data back to it, or discard your cached copy of it in the case of a read-only lease. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [nameidata 1/2] Don't pass NULL nameidata to vfs_create
On Mon, Apr 16, 2007 at 06:11:30PM +0200, Andreas Gruenbacher wrote: +static inline int +nfsd_do_create(struct inode *dir, struct dentry *child, struct vfsmount *mnt, +int mode) +{ + struct nameidata nd = { + .dentry = child, + .mnt = mnt, + }; + + return vfs_create(dir, child, mode, nd); +} + Wouldn't it normally result in fewer instructions (on most architectures ... maybe not x86) to keep the same argument order as vfs_create? ie: static inline int nfsd_do_create(struct inode *dir, struct dentry *child, int mode, struct vfsmount *mnt) - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: REISER4 FOR INCLUSION IN THE LINUX KERNEL.
On Sun, Apr 08, 2007 at 09:16:59PM -0700, [EMAIL PROTECTED] wrote: REISER4 FOR INCLUSION IN THE LINUX KERNEL. Fuck off. Cheerleading like this only hurts your cause. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: forced umount?
On Sun, Mar 18, 2007 at 08:16:19PM +0100, Arjan van de Ven wrote: the problem with the people who say they want forced umount is.. that most of the time they either want 1) get rid of the namespace entry or 2) want to stop any and all IO to a certain device/partition There is a third component - they want to deliver a fatal signal to all processes which are waiting on IO to that sb. My scenario here is a machine with an NFS mount of a server which has gone down. Cronjobs which scan the whole filesystem (eg updatedb) soon pile up sleeping on access. Equally, if one has one's ogg collection stored on said NFS server, the ogg player will be in uninterruptible sleep while holding the sound device open, preventing other applications from making sounds. It's desirable to be able to kill these apps dead, and the usual suggestion of 'mount it soft,intr' isn't the greatest idea (and somewhat hard to change after the fact). I remember Linus suggesting a sleeping state between UNINTERRUPTIBLE and INTERRUPTIBLE which would be FATAL_SIGNALS_ONLY. The usual problem (of short reads) isn't a problem if the task is only going to die when it receives them. Has anyone investigated this in any detail? Perhaps I'll take a look at doing it next week. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] sys_fallocate() system call
On Fri, Mar 16, 2007 at 05:17:04PM +0100, Heiko Carstens wrote: +asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len) e.g. asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len, int mode) would work even on s390 ;) How about: asmlinkage long sys_fallocate(int fd, int mode, u32 off_low, u32 off_high, u32 len_low, u32 len_high); That way we all suffer equally ... - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Rename file over another with the same inode number fails silently.
On Thu, Feb 08, 2007 at 11:48:16AM -0500, John Muir wrote: The attached test program creates a file, and then some hard links to that file (file0 - fileN). The test program then attempts to rename(fileN, file) for every hard link created. My expectation is that the hard links file0 - fileN would simply disappear, or that rename would respond with an error result and an appropriate errno value indicating the problem. My observation is that the hard links file0 to fileN do not in fact disappear and rename returns 0. Do I have the wrong expectations? If so, why should I have to stat the files to determine if they are the same inode before I rename? This is a POSIX requirement. See http://www.opengroup.org/onlinepubs/009695399/functions/rename.html where it says, If the old argument and the new argument resolve to the same existing file, rename() shall return successfully and perform no other action. There's not really much point in discussing whether your expectations are wrong or whether the extra work you have to do is silly ... POSIX says so, so we have to behave this way. Personally, I think this is an unexpected oversight on POSIX's part. The rationale refers to 'rename(x, x);', and doesn't discuss hard links. Possibly something to raise with the POSIX ctte, but I doubt they'd be interested in changing this now ... perhaps if you can find some other unices which behave differently, you might have a shot. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Symbolic links vs hard links
On Wed, Jan 10, 2007 at 09:38:11AM -0800, Bryan Henderson wrote: Other people are of the opinion that the invention of the symbolic link was a huge mistake. I guess I haven't heard that one. What is the argument that we were better off without symbolic links? I suppose http://www.cs.bell-labs.com/sys/doc/lexnames.html is as good a presentation of that argument as any, though there's also some good stuff in the Unix Haters Handbook (pages 164-165 in particular). - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ext2: conditional removal of NFSD code
On Sat, Jan 06, 2007 at 10:58:31PM +0300, Alexey Dobriyan wrote: Nor me nor my box is going to act as NFS server, so ifdef all exporting code. @@ -916,7 +918,9 @@ static int ext2_fill_super(struct super_ * set up enough so that it can read an inode */ sb-s_op = ext2_sops; +#if defined(CONFIG_EXPORTFS) || defined(CONFIG_EXPORTFS_MODULE) sb-s_export_op = ext2_export_ops; +#endif To avoid putting ifdefs within a function, how about adding: #if defined(CONFIG_EXPORTFS) || defined(CONFIG_EXPORTFS_MODULE) #define set_export_ops(sb, ops) sb-s_export_op = ops #else #define set_export_ops(sb, ops) 0 #endif That way you can get rid of the function pointer from the struct superblock too. But Dave Woodhouse is going to kill you for adding another CONFIG_*_MODULE dependency. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Finding hardlinks
On Wed, Jan 03, 2007 at 01:33:31PM +0100, Miklos Szeredi wrote: High probability is all you have. Cosmic radiation hitting your computer will more likly cause problems, than colliding 64bit inode numbers ;) Some of us have machines designed to cope with cosmic rays, and would be unimpressed with a decrease in reliability. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: inode i_blksize problem
On Wed, Dec 20, 2006 at 01:26:56PM +0100, Sergio Paracuellos wrote: I am trying to compile a module for kernel 2.6.18-1 that uses the 'inode struct' but the compiler tell me inode struct hasn't a member called i_blksize. I don't have that problem in kernel 2.6.16. What happend with i_blksize? git-log reveals: commit ba52de123d454b57369f291348266d86f4b35070 Author: Theodore Ts'o [EMAIL PROTECTED] Date: Wed Sep 27 01:50:49 2006 -0700 [PATCH] inode-diet: Eliminate i_blksize from the inode structure This eliminates the i_blksize field from struct inode. Filesystems that wan t to provide a per-inode st_blksize can do so by providing their own getattr routine instead of using the generic_fillattr() function. Note that some filesystems were providing pretty much random (and incorrect) values for i_blksize. [EMAIL PROTECTED]: cleanup] [EMAIL PROTECTED]: generic_fillattr() fix] Signed-off-by: Theodore Ts'o [EMAIL PROTECTED] Signed-off-by: Adrian Bunk [EMAIL PROTECTED] Signed-off-by: Andrew Morton [EMAIL PROTECTED] Signed-off-by: Linus Torvalds [EMAIL PROTECTED] So, it's gone. I'd just delete that line from your sources if I were you. By the way, what kind of module is this? Whatever it's doing looks pretty dodgy to me. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: NFSv4/pNFS possible POSIX I/O API standards
On Sun, Dec 17, 2006 at 11:07:27AM -0800, Ulrich Drepper wrote: And how often do the scripts which are in everyday use require such a command? And the same for the other programs. I know that the rsync load is a major factor on kernel.org right now. With all the git trees (particularly the ones that people haven't packed recently), there's a lot of files in a lot of directories. If readdirplus would help this situation, it would definitely have a real world benefit. Obviously, I haven't done any measurements or attempted to quantify what the improvement would be. For those not familiar with a git repo, it has an 'objects' directory with 256 directories named 00 to ff. Each of those directories can contain many files (with names like '8cd5bbfb4763322837cd1f7c621f02ebe22fef') Once a file is written, it is never modified, so all rsync needs to do is be able to compare the timestamps and sizes and notice they haven't changed. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: openg and path_to_handle
On Thu, Dec 14, 2006 at 03:00:41PM -0600, Rob Ross wrote: I don't think that I understand what you're saying here. The openg() call does not perform file open (not that that is necessarily even a first-class FS operation), it simply does the lookup. When we were naming these calls, from a POSIX consistency perspective it seemed best to keep the open nomenclature. That seems to be confusing to some. Perhaps we should rename the function lookup or something similar, to help keep from giving the wrong idea? There is a difference between the openg() and path_to_handle() approach in that we do permission checking at openg(), and that does have implications on how the handle might be stored and such. That's being discussed in a separate thread. I was just thinking about how one might implement this, when it struck me ... how much more efficient is a kernel implementation compared to: int openg(const char *path) { char *s; do { s = tempnam(FSROOT, .sutoc); link(path, s); } while (errno == EEXIST); mpi_broadcast(s); sleep(10); unlink(s); } and sutoc() becomes simply open(). Now you have a name that's quick to open (if a client has the filesystem mounted, it has a handle for the root already), has a defined lifespan, has minimal permission checking, and doesn't require standardisation. I suppose some cluster fs' might not support cross-directory links (AFS is one, I think), but then, no cluster fs's support openg/sutoc. If a filesystem's willing to add support for these handles, it shouldn't be too hard for them to treat files starting .sutoc specially, and as efficiently as adding the openg/sutoc concept. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: openg and path_to_handle
On Thu, Dec 07, 2006 at 07:40:05AM +1100, David Chinner wrote: Permission checks are done on the path_to_handle(), so in reality only root or CAP_SYS_ADMIN users can currently use the open_by_handle interface because of this lack of checking. Given that our current users of this interface need root permissions to do other things (data migration), this has never been an issue. This is an implementation detail - it is possible that file handle, being opaque, could encode a UID/GID of the user that constructed the handle and then allow any process with the same UID/GID to use open_by_handle() on that handle. (I think hch has already pointed this out.) While it could do that, I'd be interested to see how you'd construct the handle such that it's immune to a malicious user tampering with it, or saving it across a reboot, or constructing one from scratch. I suspect any real answer to this would have to involve cryptographical techniques (say, creating a secure hash of the information plus a boot-time generated nonce). Now you're starting to use a lot of bits, and compute time, and you'll need to be sure to keep the nonce secret. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: openg and path_to_handle
On Wed, Dec 06, 2006 at 03:09:10PM -0700, Andreas Dilger wrote: Considering that filesystems like GFS and OCFS allow clients DIRECT ACCESS to the block device itself (which no amount of authentication will fix, unless it is in the disks themselves), the risk of passing a file handle around is pretty minimal. That's either disingenuous, or missing the point. OCFS/GFS allow the kernel direct access to the block device. openg()sutoc() are about passing around file handles to untrusted users. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: NFSv4/pNFS possible POSIX I/O API standards
On Tue, Dec 05, 2006 at 10:07:48AM +, Christoph Hellwig wrote: The filehandle idiocy on the other hand is way of into crackpipe land. Right, and it needs to be discarded. Of course, there was a real problem that it addressed, so we need to come up with an acceptable alternative. The scenario is a cluster-wide application doing simultaneous opens of the same file. So thousands of nodes all hitting the same DLM locks (for read) all at once. The openg() non-solution implies that all nodes in the cluster share the same filehandle space, so I think a reasonable solution can be implemented entirely within the clusterfs, with an extra flag to open(), say O_CLUSTER_WIDE. When the clusterfs sees this flag set (in -lookup), it can treat it as a hint that this pathname component is likely to be opened again on other nodes and broadcast that fact to the other nodes within the cluster. Other nodes on seeing that hint (which could be structured as The child bin of filehandle e62438630ca37539c8cc1553710bbfaa3cf960a7 has filehandle ff51a98799931256b555446b2f5675db08de6229) can keep a record of that fact. When they see their own open, they can populate the path to that file without asking the server for extra metadata. There's obviously security issues there (why I say 'hint' rather than 'command'), but there's also security problems with open-by-filehandle. Note that this solution requires no syscall changes, no application changes, and also helps a scenario where each node opens a different file in the same directory. I've never worked on a clusterfs, so there may be some gotchas (eg, how do you invalidate the caches of nodes when you do a rename). But this has to be preferable to open-by-fh. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: NFSv4/pNFS possible POSIX I/O API standards
On Wed, Nov 29, 2006 at 09:04:50AM +, Christoph Hellwig wrote: - openg/sutoc No way. We already have a very nice file descriptor abstraction. You can pass file descriptors over unix sockets just fine. Yes, but it behaves like dup(). Gary replied to me off-list (which I didn't notice and continued replying to him off-list). I wrote: Is this for people who don't know about dup(), or do they need independent file offsets? If the latter, I think an xdup() would be preferable (would there be a security issue for OSes with revoke()?) Either that, or make the key be useful for something else. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: NFSv4/pNFS possible POSIX I/O API standards
On Wed, Nov 29, 2006 at 05:23:13AM -0700, Matthew Wilcox wrote: On Wed, Nov 29, 2006 at 09:04:50AM +, Christoph Hellwig wrote: - openg/sutoc No way. We already have a very nice file descriptor abstraction. You can pass file descriptors over unix sockets just fine. Yes, but it behaves like dup(). Gary replied to me off-list (which I didn't notice and continued replying to him off-list). I wrote: Is this for people who don't know about dup(), or do they need independent file offsets? If the latter, I think an xdup() would be preferable (would there be a security issue for OSes with revoke()?) Either that, or make the key be useful for something else. I further wonder if these people would see appreciable gains from doing sutoc rather than doing openat(dirfd, basename, flags, mode); If they did, I could also see openat being extended to allow dirfd to be a file fd, as long as pathname were NULL or a pointer to NUL. But with all the readx stuff being proposed, I bet they don't really need independent file offsets. That's, like, so *1970*s. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: NFSv4/pNFS possible POSIX I/O API standards
On Mon, Nov 27, 2006 at 09:34:05PM -0700, Gary Grider wrote: Things like openg() - on process opens a file and gets a key that is passed to lots of processes which use the key to get a handle (great for thousands of processes opening a file) I don't understand how this leads to a more efficient implementation. It seem to just add complexity. What does 'sutoc' mean anyway? readx/writex - scattergather readwrite - more appropriate and complete than the real time extended read/write These don't seem to be documented on the website. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] ensure i_ino uniqueness in filesystems without permanent inode numbers (via pointer conversion)
On Fri, Nov 17, 2006 at 08:43:00AM -0500, Jeff Layton wrote: 2) this scheme would effectively leak inode addresses into userspace. I'm not sure if that would be exploitable, but it's probably best not to do it. The patch adds a static unsigned int that is initialized to a random value at boot time. We'll xor the inode offset with this value. That should allow for a unique i_ino value, but since the xor mask would be secret, it shouldn't be possible to turn it back into an address. There may be a more secure way to do this. I'm definitely open to suggestions here. I *think* the xor mask is mere obfuscation. It looks likely that you can recover it with a little bit of trial and error. If you can force the filesystem to hand you back new inodes quickly such that there is a high probability you get consecutive allocations, you'll get a sequence which would be spaced 700-odd bytes apart, except that it's been xored. Since you know it's incrementing, if you see the sequence decrease, you'll know that was a 1 in that bit. It'd be a bit more complex than that, and cryptanalysis was never my forte, but I suspect we should either use a folded hash like md5, or give up and just divide the address by sizeof(struct inode). Sure, divides are slow, but this is a divide by a constant, so it shouldn't be that bad. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Make journal_commit_transaction() more understandable
journal_commit_transaction() is still 650+ lines long and contains 16 local variables. By moving phase 3 into its own function, we reduce its length by 150+ lines and reduce it to 5 local variables. Signed-off-by: Matthew Wilcox [EMAIL PROTECTED] commit.c | 367 +-- 1 files changed, 194 insertions(+), 173 deletions(-) Index: linux-2.6/fs/jbd/commit.c === RCS file: /var/lib/cvs/linux-2.6/fs/jbd/commit.c,v retrieving revision 1.14 diff -u -p -r1.14 commit.c --- linux-2.6/fs/jbd/commit.c 4 Apr 2005 20:47:15 - 1.14 +++ linux-2.6/fs/jbd/commit.c 29 Aug 2005 15:59:02 - @@ -93,6 +93,189 @@ static int inverted_lock(journal_t *jour return 1; } +/* + * Way to go: we have now written out all of the data for a transaction! + * Now comes the tricky part: we need to write out metadata. Loop over the + * transaction's entire buffer list: + */ +static int journal_write_metadata(journal_t *journal, + transaction_t *commit_transaction) +{ + char *tagp = NULL; + struct journal_head *descriptor = NULL; + int flags; + unsigned long blocknr; + journal_header_t *header; + journal_block_tag_t *tag = NULL; + int space_left = 0; + int first_tag = 0; + int tag_flag; + int i; + int err = 0; + int bufs = 0; + + struct buffer_head **wbuf = journal-j_wbuf; + + commit_transaction-t_state = T_COMMIT; + + while (commit_transaction-t_buffers) { + + /* Find the next buffer to be journaled... */ + + struct journal_head *jh, *new_jh; + jh = commit_transaction-t_buffers; + + /* If we're in abort mode, we just un-journal the buffer and + release it for background writing. */ + + if (is_journal_aborted(journal)) { + JBUFFER_TRACE(jh, journal is aborting: refile); + journal_refile_buffer(journal, jh); + /* If that was the last one, we need to clean up +* any descriptor buffers which may have been +* already allocated, even if we are now +* aborting. */ + if (!commit_transaction-t_buffers) + goto start_journal_io; + continue; + } + + /* Make sure we have a descriptor block in which to + record the metadata buffer. */ + + if (!descriptor) { + struct buffer_head *bh; + + J_ASSERT (bufs == 0); + + jbd_debug(4, JBD: get descriptor\n); + + descriptor = journal_get_descriptor_buffer(journal); + if (!descriptor) { + __journal_abort_hard(journal); + continue; + } + + bh = jh2bh(descriptor); + jbd_debug(4, JBD: got buffer %llu (%p)\n, + (unsigned long long)bh-b_blocknr, bh-b_data); + header = (journal_header_t *)bh-b_data[0]; + header-h_magic = cpu_to_be32(JFS_MAGIC_NUMBER); + header-h_blocktype = cpu_to_be32(JFS_DESCRIPTOR_BLOCK); + header-h_sequence = cpu_to_be32(commit_transaction-t_tid); + + tagp = bh-b_data[sizeof(journal_header_t)]; + space_left = bh-b_size - sizeof(journal_header_t); + first_tag = 1; + set_buffer_jwrite(bh); + set_buffer_dirty(bh); + wbuf[bufs++] = bh; + + /* Record it so that we can wait for IO + completion later */ + BUFFER_TRACE(bh, ph3: file as descriptor); + journal_file_buffer(descriptor, commit_transaction, + BJ_LogCtl); + } + + /* Where is the buffer to be written? */ + + err = journal_next_log_block(journal, blocknr); + /* If the block mapping failed, just abandon the buffer + and repeat this loop: we'll fall into the + refile-on-abort condition above. */ + if (err) { + __journal_abort_hard(journal); + continue; + } + + /* +* start_this_handle() uses t_outstanding_credits to determine +* the free space in the log, but this counter is changed +* by journal_next_log_block() also. +*/ + commit_transaction-t_outstanding_credits
Re: Kernel bug: Bad page state: related to generic symlink code and mmap
On Fri, Aug 19, 2005 at 08:38:34PM +0100, Al Viro wrote: FWIW, I'd rather take page_symlink(), page_symlink_inode_operations, page_put_link(), page_follow_link_light(), page_readlink(), page_getlink(), generic_readlink() and vfs_readlink() to the same place where these guys would live. They all belong together and none of them has any business in fs/namei.c. Options: fs/libfs.c or separate library since fs/libfs.c is getting crowded. Linus, do you have any objections to that or suggestions on filename here? fs/symlink.c? -- Next the statesmen will invent cheap lies, putting the blame upon the nation that is attacked, and every man will be glad of those conscience-soothing falsities, and will diligently study them, and refuse to examine any refutations of them; and thus he will by and by convince himself that the war is just, and will thank God for the better sleep he enjoys after this process of grotesque self-deception. -- Mark Twain - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] fs/fcntl.c : don't test unsigned value for less than zero
On Fri, Apr 15, 2005 at 10:03:05PM +1000, Herbert Xu wrote: I suppose it could be smart and stay quiet about val 0 || val BOUND However, gcc is slow enough as it is without adding unnecessary smarts like this. It only warns with -W on, not with -Wall, so I see no compelling reason to fix this. I think the real problem here is that 'arg' is declared 2 pages earlier in the function prototype (aka the function-growth-hormone-imbalance syndrome). There's two good ways of fixing this, adding a f_setsig() function: static inline int f_setsig(struct file *filp, unsigned long arg) { if (arg _NSIG) return -EINVAL; filp-f_owner.signum = arg; return 0; } ... case F_SETSIG: err = f_setsig(filp, arg); break; or add a function that checks a variable to see if it's a valid signal number: #define valid_signal(arg) ((unsigned long)arg = _NSIG) ... case F_SETSIG: if (!valid_signal(arg)) break; err = 0; filp-f_owner.signum = arg; break; Looks like futex.c, ptrace.c, signal.c, sys.c and almost every architecture's ptrace code could easily make use of the latter, but not the former. It also looks like we have a few off-by-one errors. For example, in h8300's ptrace code: case PTRACE_SYSCALL: case PTRACE_CONT: { ret = -EIO; if ((unsigned long) data = _NSIG) break ; but case PTRACE_SINGLESTEP: { ret = -EIO; if ((unsigned long) data _NSIG) break; so I'd recommend the second solution. But be careful not to fix up cases like: ./kernel/exit.c:if (sig 1 || sig _NSIG) where we really don't want to allow zero. -- Next the statesmen will invent cheap lies, putting the blame upon the nation that is attacked, and every man will be glad of those conscience-soothing falsities, and will diligently study them, and refuse to examine any refutations of them; and thus he will by and by convince himself that the war is just, and will thank God for the better sleep he enjoys after this process of grotesque self-deception. -- Mark Twain - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] fs/fcntl.c : don't test unsigned value for less than zero
On Fri, Apr 15, 2005 at 03:07:42AM +0200, Jesper Juhl wrote: 'arg' is unsigned so it can never be less than zero, so testing for that is pointless and also generates a warning when building with gcc -W. This patch eliminates the pointless check. Didn't Linus already reject this one 6 months ago? -- Next the statesmen will invent cheap lies, putting the blame upon the nation that is attacked, and every man will be glad of those conscience-soothing falsities, and will diligently study them, and refuse to examine any refutations of them; and thus he will by and by convince himself that the war is just, and will thank God for the better sleep he enjoys after this process of grotesque self-deception. -- Mark Twain - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: mmap question
On Mon, Mar 21, 2005 at 01:33:55PM -0800, Bryan Henderson wrote: It looks to me like you're running into the fundamental limitation that the CPU doesn't notify Linux every time you store into a memory location. It does, though, set the dirty flag in the page table, and Linux eventually inspects that flag and finds out that you have stored in the past. At that time, it can call set_page_dirty. well, we *could* know ... never map this page writable. have a per-vma flag that says emulate writes, and call the filesystem to update backing storage before returning to the application. the price of doing this would be exorbitant. -- Next the statesmen will invent cheap lies, putting the blame upon the nation that is attacked, and every man will be glad of those conscience-soothing falsities, and will diligently study them, and refuse to examine any refutations of them; and thus he will by and by convince himself that the war is just, and will thank God for the better sleep he enjoys after this process of grotesque self-deception. -- Mark Twain - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Efficient handling of sparse files
This problem came up with the systemimager program which uses rsync to install files from a master server to many clients. Red Hat has a system user with uid 2^32-1 which causes lastlog to grow to 1.2GB in size. rsync does understand the concept of sparse files (with the -S flag), but it has to read every block to discover that it is indeed empty. This sucks. I was wondering if we could introduce a new system call (or ioctl?) that, given an fd would find the next block with data in it. We could use the -bmap method ... except that has dire warnings about adding new callers and viro may soon be in testicle-gouging range. One system interface hack would be to introduce lseek(fd, 0, SEEK_DATA) ... but without permission to reuse -bmap for this purpose, it's pointless to discuss user interfaces. Suggestions? -- Next the statesmen will invent cheap lies, putting the blame upon the nation that is attacked, and every man will be glad of those conscience-soothing falsities, and will diligently study them, and refuse to examine any refutations of them; and thus he will by and by convince himself that the war is just, and will thank God for the better sleep he enjoys after this process of grotesque self-deception. -- Mark Twain - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Help!
On Fri, Feb 04, 2005 at 03:58:04PM +0530, Anirban Mukherjee wrote: I am doing a project on Ext2fs and Ext3fs and I require some materials on the physical structure of Ext2fs.It would be very helpful if someone sends me some information(or links where to find it) along with some general stuff on Ext2fs and Ext3fs. Please do my homework for me Did you even look in Documentation/filesystems/ext{2,3}.txt? -- Next the statesmen will invent cheap lies, putting the blame upon the nation that is attacked, and every man will be glad of those conscience-soothing falsities, and will diligently study them, and refuse to examine any refutations of them; and thus he will by and by convince himself that the war is just, and will thank God for the better sleep he enjoys after this process of grotesque self-deception. -- Mark Twain - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
NFSD needs EXPORTFS
Got this report about 2.6.11-rc3. Is this the correct solution? - Forwarded message from Joel Soete [EMAIL PROTECTED] - A short analyse, it seems that's because NFSD was builtin while EXPORTFS was a module in my previous config file. Imho EXPORTFS would be build as NFSD? Is the following hunk would do the trick: --- fs/Kconfig.Orig 2005-02-03 16:45:13.562275206 +0100 +++ fs/Kconfig 2005-02-03 16:46:36.496469111 +0100 @@ -1400,6 +1400,7 @@ tristate NFS server support depends on INET select LOCKD + select EXPORTFS select SUNRPC help If you want your Linux box to act as an NFS *server*, so that other Thanks in advance, Joel - End forwarded message - -- Next the statesmen will invent cheap lies, putting the blame upon the nation that is attacked, and every man will be glad of those conscience-soothing falsities, and will diligently study them, and refuse to examine any refutations of them; and thus he will by and by convince himself that the war is just, and will thank God for the better sleep he enjoys after this process of grotesque self-deception. -- Mark Twain - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC: [PATCH-2.6] Add helper function to lock multiple page cache pages.
On Wed, Feb 02, 2005 at 03:12:50PM +, Anton Altaparmakov wrote: I think the below loop would be clearer as a for loop ... err = 0; for (nr = 0; nr nr_pages; nr++, start++) { if (start == lp_idx) { pages[nr] = locked_page; if (!nr) continue; lock_page(locked_page); if (!wbc) continue; if (wbc-for_reclaim) { up(inode-i_sem); up_read(inode-i_sb-s_umount); } /* Was the page truncated under us? */ if (page_mapping(locked_page) != mapping) { err = -ESTALE; goto err_out_locked; } } else { pages[nr] = find_lock_page(mapping, start); if (pages[nr]) continue; if (!cached_page) { cached_page = alloc_page(gfp_mask); if (unlikely(!cached_page)) goto err_out; } err = add_to_page_cache_lru(cached_page, mapping, start, gfp_mask); if (unlikely(err)) { if (err == -EEXIST) continue; goto err_out; } pages[nr] = cached_page; cached_page = NULL; } } The above fixes two bugs in the below: - if (!unlikely(cached_page)) should be if (unlikely(!cached_page)) - The -EEXIST case after add_to_page_cache_lru() would result in an infinite loop in the original as nr wasn't being incremented. + err = nr = 0; + while (nr nr_pages) { + if (start == lp_idx) { + pages[nr] = locked_page; + if (nr) { + lock_page(locked_page); + if (wbc) { + if (wbc-for_reclaim) { + up(inode-i_sem); + up_read(inode-i_sb-s_umount); + } + /* Was the page truncated under us? */ + if (page_mapping(locked_page) != + mapping) { + err = -ESTALE; + goto err_out_locked; + } + } + } + } else { + pages[nr] = find_lock_page(mapping, start); + if (!pages[nr]) { + if (!cached_page) { + cached_page = alloc_page(gfp_mask); + if (!unlikely(cached_page)) + goto err_out; + } + err = add_to_page_cache_lru(cached_page, + mapping, start, gfp_mask); + if (unlikely(err)) { + if (err == -EEXIST) + continue; + goto err_out; + } + pages[nr] = cached_page; + cached_page = NULL; + } + } + nr++; + start++; + } -- Next the statesmen will invent cheap lies, putting the blame upon the nation that is attacked, and every man will be glad of those conscience-soothing falsities, and will diligently study them, and refuse to examine any refutations of them; and thus he will by and by convince himself that the war is just, and will thank God for the better sleep he enjoys after this process of grotesque self-deception. -- Mark Twain - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Minor ext2 speedup
Port Andreas Dilger's and Jan Kara's patch for ext3 to ext2. Also some whitespace changes to get ext2/ext3 closer in sync. Signed-off-by: Matthew Wilcox [EMAIL PROTECTED] Index: fs/ext2/balloc.c === RCS file: /var/cvs/linux-2.6/fs/ext2/balloc.c,v retrieving revision 1.4 diff -u -p -r1.4 balloc.c --- fs/ext2/balloc.c13 Sep 2004 15:23:44 - 1.4 +++ fs/ext2/balloc.c24 Jan 2005 22:10:01 - @@ -6,7 +6,7 @@ * Laboratoire MASI - Institut Blaise Pascal * Universite Pierre et Marie Curie (Paris VI) * - * Enhanced block allocation by Stephen Tweedie ([EMAIL PROTECTED]), 1993 + * Enhanced block allocation by Stephen Tweedie ([EMAIL PROTECTED]), 1993 * Big-endian to little-endian byte-swapping/bitmaps by *David S. Miller ([EMAIL PROTECTED]), 1995 */ @@ -52,9 +52,9 @@ struct ext2_group_desc * ext2_get_group_ return NULL; } - - group_desc = block_group / EXT2_DESC_PER_BLOCK(sb); - offset = block_group % EXT2_DESC_PER_BLOCK(sb); + + group_desc = block_group EXT2_DESC_PER_BLOCK_BITS(sb); + offset = block_group (EXT2_DESC_PER_BLOCK(sb) - 1); if (!sbi-s_group_desc[group_desc]) { ext2_error (sb, ext2_get_group_desc, Group descriptor not loaded - @@ -62,7 +62,7 @@ struct ext2_group_desc * ext2_get_group_ block_group, group_desc, offset); return NULL; } - + desc = (struct ext2_group_desc *) sbi-s_group_desc[group_desc]-b_data; if (bh) *bh = sbi-s_group_desc[group_desc]; @@ -236,12 +236,12 @@ do_more: for (i = 0, group_freed = 0; i count; i++) { if (!ext2_clear_bit_atomic(sb_bgl_lock(sbi, block_group), - bit + i, (void *) bitmap_bh-b_data)) - ext2_error (sb, ext2_free_blocks, - bit already cleared for block %lu, - block + i); - else + bit + i, bitmap_bh-b_data)) { + ext2_error(sb, __FUNCTION__, + bit already cleared for block %lu, block + i); + } else { group_freed++; + } } mark_buffer_dirty(bitmap_bh); @@ -569,25 +569,24 @@ unsigned long ext2_count_free_blocks (st static inline int block_in_use(unsigned long block, struct super_block *sb, unsigned char *map) { - return ext2_test_bit ((block - le32_to_cpu(EXT2_SB(sb)-s_es-s_first_data_block)) % + return ext2_test_bit ((block - + le32_to_cpu(EXT2_SB(sb)-s_es-s_first_data_block)) % EXT2_BLOCKS_PER_GROUP(sb), map); } static inline int test_root(int a, int b) { - if (a == 0) - return 1; - while (1) { - if (a == 1) - return 1; - if (a % b) - return 0; - a = a / b; - } + int num = b; + + while (a num) + num *= b; + return num == a; } static int ext2_group_sparse(int group) { + if (group = 1) + return 1; return (test_root(group, 3) || test_root(group, 5) || test_root(group, 7)); } -- Next the statesmen will invent cheap lies, putting the blame upon the nation that is attacked, and every man will be glad of those conscience-soothing falsities, and will diligently study them, and refuse to examine any refutations of them; and thus he will by and by convince himself that the war is just, and will thank God for the better sleep he enjoys after this process of grotesque self-deception. -- Mark Twain - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Minor ext3 speedup
On Thu, Jan 13, 2005 at 01:15:06PM +0100, Jan Kara wrote: Attached patch removes unnecessary division and modulo from ext3 code paths. It reduces (according to oprofile) the CPU usage measurably under a dbench load (see description of the patch for the numbers). I thought I'd apply Jan Andreas's patch to ext2 too. While doing so, I noticed several places where patches had been applied to ext2 and not to ext3. One of them was even a bugfix. This patch brings ext2/balloc.c and ext3/balloc.c closer together and fixes the bug. It includes the aforementioned patch for both ext2 and ext3. For the curious, the bug occurs when ext3_free_blocks_sb() decides to do_more. *pdquot_freed_blocks was updated each time around the loop, so bg_free_blocks_count was getting over-incremented. I fixed it the same way it had been fixed in ext2 -- by introducing a group_freed variable that is reset each time around the loop. Index: linux-2.6/fs/ext2/balloc.c === RCS file: /var/cvs/linux-2.6/fs/ext2/balloc.c,v retrieving revision 1.4 diff -u -p -r1.4 balloc.c --- linux-2.6/fs/ext2/balloc.c 13 Sep 2004 15:23:44 - 1.4 +++ linux-2.6/fs/ext2/balloc.c 15 Jan 2005 16:45:10 - @@ -6,7 +6,7 @@ * Laboratoire MASI - Institut Blaise Pascal * Universite Pierre et Marie Curie (Paris VI) * - * Enhanced block allocation by Stephen Tweedie ([EMAIL PROTECTED]), 1993 + * Enhanced block allocation by Stephen Tweedie ([EMAIL PROTECTED]), 1993 * Big-endian to little-endian byte-swapping/bitmaps by *David S. Miller ([EMAIL PROTECTED]), 1995 */ @@ -52,9 +52,9 @@ struct ext2_group_desc * ext2_get_group_ return NULL; } - - group_desc = block_group / EXT2_DESC_PER_BLOCK(sb); - offset = block_group % EXT2_DESC_PER_BLOCK(sb); + + group_desc = block_group EXT2_DESC_PER_BLOCK_BITS(sb); + offset = block_group (EXT2_DESC_PER_BLOCK(sb) - 1); if (!sbi-s_group_desc[group_desc]) { ext2_error (sb, ext2_get_group_desc, Group descriptor not loaded - @@ -62,7 +62,7 @@ struct ext2_group_desc * ext2_get_group_ block_group, group_desc, offset); return NULL; } - + desc = (struct ext2_group_desc *) sbi-s_group_desc[group_desc]-b_data; if (bh) *bh = sbi-s_group_desc[group_desc]; @@ -236,12 +236,12 @@ do_more: for (i = 0, group_freed = 0; i count; i++) { if (!ext2_clear_bit_atomic(sb_bgl_lock(sbi, block_group), - bit + i, (void *) bitmap_bh-b_data)) - ext2_error (sb, ext2_free_blocks, - bit already cleared for block %lu, - block + i); - else + bit + i, bitmap_bh-b_data)) { + ext2_error(sb, __FUNCTION__, + bit already cleared for block %lu, block + i); + } else { group_freed++; + } } mark_buffer_dirty(bitmap_bh); @@ -569,25 +569,24 @@ unsigned long ext2_count_free_blocks (st static inline int block_in_use(unsigned long block, struct super_block *sb, unsigned char *map) { - return ext2_test_bit ((block - le32_to_cpu(EXT2_SB(sb)-s_es-s_first_data_block)) % + return ext2_test_bit ((block - + le32_to_cpu(EXT2_SB(sb)-s_es-s_first_data_block)) % EXT2_BLOCKS_PER_GROUP(sb), map); } static inline int test_root(int a, int b) { - if (a == 0) - return 1; - while (1) { - if (a == 1) - return 1; - if (a % b) - return 0; - a = a / b; - } + int num = b; + + while (a num) + num *= b; + return num == a; } static int ext2_group_sparse(int group) { + if (group = 1) + return 1; return (test_root(group, 3) || test_root(group, 5) || test_root(group, 7)); } Index: linux-2.6/fs/ext3/balloc.c === RCS file: /var/cvs/linux-2.6/fs/ext3/balloc.c,v retrieving revision 1.9 diff -u -p -r1.9 balloc.c --- linux-2.6/fs/ext3/balloc.c 12 Jan 2005 20:17:23 - 1.9 +++ linux-2.6/fs/ext3/balloc.c 15 Jan 2005 16:45:10 - @@ -43,34 +43,34 @@ struct ext3_group_desc * ext3_get_group_ struct buffer_head ** bh) { unsigned long group_desc; - unsigned long desc; - struct ext3_group_desc * gdp; + unsigned long offset; + struct ext3_group_desc * desc; + struct ext3_sb_info *sbi = EXT3_SB(sb); - if (block_group = EXT3_SB(sb)-s_groups_count) { + if
Re: [Final call for testers][PATCH] superblock handling changes (2.4.6-pre3)
On Fri, Jun 15, 2001 at 01:10:00AM -0400, Alexander Viro wrote: +static inline void write_super(struct super_block *sb) +{ + lock_super(sb); + if (sb-s_root sb-s_dirt) When I first looked at this, I thought it was a typo. I don't think we should have s_dirty and s_dirt as fields of the superblock. how about s_dirty_inodes and s_isdirty, respectively? +restart: + spin_lock(sb_lock); + sb = sb_entry(super_blocks.next); + while (sb != sb_entry(super_blocks)) + if (sb-s_dirt) { + sb-s_count++; + spin_unlock(sb_lock); + down_read(sb-s_umount); + write_super(sb); + drop_super(sb); + goto restart; + } else + sb = sb_entry(sb-s_list.next); + spin_unlock(sb_lock); I think this could be clearer. struct list_head *tmp; restart: spin_lock(sb_lock); list_for_each(tmp, super_blocks) { struct super_block *sb = sb_entry(tmp); if (!sb-s_dirt) continue; spin_unlock(sb_lock); down_read(sb-s_umount); write_super(sb); drop_super(sb); goto restart; } spin_unlock(sb_lock); @@ -773,16 +810,16 @@ void *data, int silent) { struct super_block * s; - s = get_empty_super(); + s = alloc_super(); if (!s) goto out; s-s_dev = dev; s-s_bdev = bdev; s-s_flags = flags; - s-s_dirt = 0; s-s_type = type; - s-s_dquot.flags = 0; - s-s_maxbytes = MAX_NON_LFS; + spin_lock(sb_lock); + list_add (s-s_list, super_blocks.prev); I'd use list_add_tail(s-s_list, super_blocks); - if (mnt-mnt_instances.next != mnt-mnt_instances.prev) { + if (atomic_read(sb-s_active) 1) { I'm happy to see that line disappear. It was mightily confusing. -- Revolutions do not require corporate support. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED]
Re: about BKL in VFS
On Fri, Jun 08, 2001 at 02:04:16PM -0400, Alexander Viro wrote: the only areas in VFS that still rely on BKL are locks.c, dquot.c and super.c. The latter is fixed in the patches I'm feeding to Linus (OK, the pieces I'm feeding to Linus make shifting the BKL down to method calls trivial). Other two... Well, ask the poor guys who maintain them ;-) the file locking code always takes the BKL itself, and doesn't rely on the rest of the VFS to take the lock for it. it's not too much trouble to remove it, and this will happen as part of the 2.5 rewrite. which is in progress. annoyingly, there's very little common semantics between POSIX locks and FLOCK locks, which means that fs/locks.c is basically implementing two completely independent things through one interface. and then there's the other 3-4 different types of lock bleuch. -- Revolutions do not require corporate support. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED]
Re: [RFD w/info-PATCH] device arguments from lookup, partion code
On Tue, May 22, 2001 at 04:31:37PM +0100, Alan Cox wrote: `the class of devices in question' was cryptographic devices, and possibly other transactional DSPs. I don't consider audio to be transactional. in any case, you can do transactional things with two threads, as long as they each have their own fd on the device. Think of the fd as your transaction handle. Thats a bit pathetic. So I have to fill my app with expensive pthread locks or hack all the drivers and totally change the multi-open sematics in the ABI huh? void thread_init(void) { int fd = open(/dev/crypto); real_thread_init(fd); } where was that lock again? and notice this idea is only for transactional things -- what transactional things do sound drivers do? -- Revolutions do not require corporate support. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED]
Re: [RFD w/info-PATCH] device arguments from lookup, partion code
On Mon, May 21, 2001 at 06:13:18PM -0700, Linus Torvalds wrote: Nope. You can (and people do, quite often) share filps. So you can't associate state with it. For _devices_, though? I don't expect my mouse to work if gpm and xfree both try to consume device events from the same filp. Heck, it doesn't even work when they try to consume events from the same inode :-) I think this is a reasonable restriction for the class of devices in question. -- Revolutions do not require corporate support. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED]
Re: [RFD w/info-PATCH] device arguments from lookup, partion code
On Tue, May 22, 2001 at 02:22:34AM +0200, Ingo Oeser wrote: ioctl has actually 4 semantics: command only command + read command + write command + rw-transaction Separating these would be a first step. And yes, I consider each of them useful. command only: reset drive echo 'reset' /dev/sg0ctl command + rw-transaction: dear device please mangle this data (crypto processors come to mind...) I can't think of a reasonable tool-based approach to this, but I can definitely see that a program could use this well. It simply requires that you use the filp to store your state. fd = open(/dev/crypto) - creates filp write(fd, Death to all fanatics!\n); - calls crypto device, stores result in private data structure sleep(100); read(fd, Qrngu gb nyy snangvpf!\n); - frees data structure [You'll note the advanced design of my crypto processor.] Clearly, this is open to abuse by persons never calling read() and passing in far too much to write(). I think this can be alleviated by refusing to accept more than (say) 4k at a time, or bean-counter. A sick way would be to allow the -write() call to have its buffer modified. But I don't think we want to go down that path. -- Revolutions do not require corporate support. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED]
Re: [RFD w/info-PATCH] device arguments from lookup, partion code
On Sun, May 20, 2001 at 03:11:53PM -0400, Alexander Viro wrote: Pheeew... Could you spell about megabyte of stuff in ioctl.c? No. $ ls -l arch/*/kernel/ioctl32*.c -rw-r--r--1 willywilly 22479 Jan 24 16:59 arch/mips64/kernel/ioctl32.c -rw-r--r--1 willywilly 109475 May 18 16:39 arch/parisc/kernel/ioctl32.c -rw-r--r--1 willywilly 117605 Feb 1 20:35 arch/sparc64/kernel/ioctl32.c only about 100k. -- Revolutions do not require corporate support. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED]
Re: [RFD w/info-PATCH] device arguments from lookup, partion code
On Sat, May 19, 2001 at 12:51:23PM -0600, Richard Gooch wrote: Al, if you really want to kill ioctl(2), then perhaps you should implement a transaction(2) syscall. Something like: int transaction (int fd, void *rbuf, size_t rlen, void *wbuf, size_t wlen); Of course, there wouldn't be any practical gain, since we already have ioctl(2). Any gain would be aesthetic. I can tell you haven't had to write any 32-bit ioctl emulation code for a 64-bit kernel recently. -- Revolutions do not require corporate support. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED]
Re: [RFD w/info-PATCH] device arguments from lookup, partion code
On Sat, May 19, 2001 at 10:22:55PM -0400, Richard Gooch wrote: The transaction(2) syscall can be just as easily abused as ioctl(2) in this respect. But read() and write() cannot. -- Revolutions do not require corporate support. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED]
Re: Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH] device arguments from lookup)
On Sat, May 19, 2001 at 12:51:07PM -0400, Alexander Viro wrote: clone(), walk(), clunk(), stat() and open() ;-) Basically, we can add unopened descriptors. I.e. no IO until you open it (turning the thing into opened one), but we can do lookups (move to child), we can clone and kill them and we can stat them. Those who would like a more detailed explanation can find one at http://plan9.bell-labs.com/sys/man/5/INDEX.html -- Revolutions do not require corporate support. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED]
Re: [RFD w/info-PATCH] device arguments from lookup, partion code
On Sat, May 19, 2001 at 05:25:22PM +0100, Alan Cox wrote: Only to an English speaker. I suspect Quebec City canadians would prefer a different command set. Should we support `pas387' as well as `no387' as a kernel boot parameter then? Face it, a sysadmin has to know the limited subset of english which is used to configure a kernel. -- Revolutions do not require corporate support. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED]
F_SETLK_NP
Here's an initial implementation of F_SETLK_NP as previously discussed on linux-fsdevel. The _ONLY_ difference between these locks and POSIX locks is that they are not removed on a close(dup(fd)). I have a question about the interaction between POSIX and RANGE locks. If they overlap, should they be merged? And if so, what type should the resulting lock have? My initial thought is that they should _not_ be merged but _should_ conflict, which is going to require a slightly bigger patch. This patch merges them and gives the merged lock the type of the lock which arrived first. Anyway, here's the test program (I checked what was going on via /proc/locks, you can too). /* lock-test.c Copyright (c) Matthew Wilcox, Hewlett-Packard. GPL License */ #define _GNU_SOURCE #include fcntl.h #include stdio.h #include unistd.h #define F_SETLK_NP 15 int main(int argc, char **argv) { struct flock64 lock; int fd; lock.l_type = F_RDLCK; lock.l_whence = 0; lock.l_start = 50; lock.l_len = 100; fd = open(argv[0], O_RDONLY); fcntl(fd, F_SETLK_NP, lock); printf(Applied lock to file -- check and press the `Enter' key\n); getchar(); close(dup(fd)); printf(Closed dup -- check and press the `Enter' key\n); getchar(); close(fd); printf(Closed original -- check and press the `Enter' key\n); getchar(); return 0; } And now for the patch itself. It was generated against 2.4.5-pre1 but will probably apply against any 2.4.x kernel. I've taken the liberty of getting rid of some old cruft which wasn't used any more. This patch is i386 only -- just add constants to other arch include files as required. diff -urNX dontdiff linux-245p1/fs/fcntl.c linux-245p1-vfs/fs/fcntl.c --- linux-245p1/fs/fcntl.c Sun May 6 09:16:20 2001 +++ linux-245p1-vfs/fs/fcntl.c Sun May 6 05:52:35 2001 @@ -260,6 +260,10 @@ case F_SETLKW: err = fcntl_setlk(fd, cmd, (struct flock *) arg); break; + case F_SETLK_NP: + case F_SETLKW_NP: + err = fcntl_setlk_np(fd, cmd, (struct flock64 *) arg); + break; case F_GETOWN: /* * XXX If f_owner is a process group, the diff -urNX dontdiff linux-245p1/fs/locks.c linux-245p1-vfs/fs/locks.c --- linux-245p1/fs/locks.c Sun May 6 09:14:08 2001 +++ linux-245p1-vfs/fs/locks.c Sun May 6 05:38:12 2001 @@ -547,7 +547,7 @@ /* POSIX locks owned by the same process do not conflict with * each other. */ - if (!(sys_fl-fl_flags FL_POSIX) || + if (!(sys_fl-fl_flags (FL_POSIX | FL_RANGE)) || locks_same_owner(caller_fl, sys_fl)) return (0); @@ -620,7 +620,7 @@ lock_kernel(); for (cfl = filp-f_dentry-d_inode-i_flock; cfl; cfl = cfl-fl_next) { - if (!(cfl-fl_flags FL_POSIX)) + if (!(cfl-fl_flags (FL_POSIX | FL_RANGE))) continue; if (posix_locks_conflict(cfl, fl)) break; @@ -863,7 +863,7 @@ if (caller-fl_type != F_UNLCK) { repeat: for (fl = inode-i_flock; fl != NULL; fl = fl-fl_next) { - if (!(fl-fl_flags FL_POSIX)) + if (!(fl-fl_flags (FL_POSIX | FL_RANGE))) continue; if (!posix_locks_conflict(caller, fl)) continue; @@ -892,7 +892,7 @@ /* First skip locks owned by other processes. */ - while ((fl = *before) (!(fl-fl_flags FL_POSIX) || + while ((fl = *before) (!(fl-fl_flags (FL_POSIX | FL_RANGE)) || !locks_same_owner(caller, fl))) { before = fl-fl_next; } @@ -1458,23 +1458,72 @@ break; case F_UNLCK: break; - case F_SHLCK: - case F_EXLCK: -#ifdef __sparc__ -/* warn a bit for now, but don't overdo it */ -{ - static int count = 0; - if (!count) { - count=1; - printk(KERN_WARNING - fcntl_setlk() called by process %d (%s) with broken flock() emulation\n, - current-pid, current-comm); + default: + error = -EINVAL; + goto out_putf; + } + + if (filp-f_op filp-f_op-lock != NULL) { + error = filp-f_op-lock(filp, cmd, file_lock); + if (error 0) + goto out_putf; } + error = posix_lock_file(filp, file_lock, cmd == F_SETLKW); + +out_putf: + fput(filp); +out: + locks_free_lock(file_lock); + return error; } - if (!(filp-f_mode 3)) + +/* Apply the lock described by l to an open file descriptor
Re: File Locking in Linux 2.5
On Fri, May 04, 2001 at 01:33:44PM +0200, Trond Myklebust wrote: ??? It'll be a completely useless form of lock if programs can end up blocking forever on a set of (from their perspective) valid operations. Umm.. that's your opinion. I don't see how `making them mandatory' would make this better or worse than if they check for deadlock. If they're mandatory, a process doing a read() which conflicts will block forever whether they're checked for deadlock or not. Ideally the VFS locking manager wants to provide a service to samba, NFS and their ilk, whereby they ask for a lock. If waiting would deadlock, send back an immediate answer to the caller. If not, then the locking manager should be responsible for scheduling a lock immediately after the blocking lock is released, and notifying the caller that it can proceed. did you read my proposal? was that not exactly what i suggested? Anything less than that would be an unacceptable step backward wrt. the existing code. We don't want to cram more responsabilities down on the filesystem-level. Actually, we _do_ want to give the filesystems the opportunity to handle their own locks without poking around in the guts of locks.c as lockd currently does. Did my proposal seem unacceptable to you? How about allowing the process to use a cookie to define ownership at the kernel level? For normal processes this may or may not include some combination involving the pid, but kernel level server processes could use this to save the metadata required to identify the client. I know the NFSv4 people are keen to implement this sort of thing... I like this. That may be why the current code checks both owner and pid fields. Definitely worthy of a comment if so... -- Revolutions do not require corporate support. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED]
Re: File Locking in Linux 2.5
On Thu, May 03, 2001 at 12:19:31AM -0700, Jeremy Allison wrote: I want to add another byte-range lock, which looks and smells like a POSIX fcntl lock except that it is not removed by closing any fd which happens to be open on this file. If you do this I'll be eternally in your debt ! I had to write that code for Samba 2.2 and it was a *horrible* experience. With non-broken locking we can #ifdef out via autoconf a large chunk of code from the Linux Samba implementation, and free up fd's immediately on close. As soon as you have any prototypes of this please let us know on the [EMAIL PROTECTED] list - I'd like to try it asap ! I'll get to it this weekend then. Should be a relatively simple patch. Are there any other semantics you want changing from the POSIX lock? I'm thinking of removing deadlock detection, for one... -- Revolutions do not require corporate support. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED]
Re: File Locking in Linux 2.5
On Thu, May 03, 2001 at 01:28:33PM +0200, Andi Kleen wrote: Just don't forget to add a per user ulimit for it and probably an admin tool like ipcs. It'll use the same limit as the other locks do. -- Revolutions do not require corporate support. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED]
Re: File Locking in Linux 2.5
On Thu, May 03, 2001 at 02:10:48PM +0200, Trond Myklebust wrote: == Matthew Wilcox [EMAIL PROTECTED] writes: I'll get to it this weekend then. Should be a relatively simple patch. Are there any other semantics you want changing from the POSIX lock? I'm thinking of removing deadlock detection, for one... What do you mean by `removing' it? Eliminating deadlock detection entirely will get you in deep trouble when you add mandatory locking. Huh? (1) I don't intend to have mandatory locking on these new locks. (2) how exactly would it get me into trouble? Let's just go over the semantics we currently have for posix locks, and see if we can come up with `better' lock semantics. POSIX locks are advisory (unless you do the mandatory lock dance), and apply to a byte-range. They are checked for deadlock, merged with other compatible locks (same PID, same `owner'). (this seems redundant and there's a question mark by it in the source... never mind.) The POSIX locking rules mean that you can do funky things like: lock bytes 0-100 lock bytes 50-150 unlock bytes 50-150 and you have bytes 0-49 left locked. of course, you can then unlock bytes 0-100 and it will remove the lock from 0-49, but this behaviour seems unintuitive. You can also (and this causes a fair bit of ugliness in the kernel code) do: lock 0-100 unlock 25-75 or, even worse lock_read 0-100 lock_write 25-75 which leaves you with 0-24 read-locked, 25-75 write-locked and 76-100 read-locked. i can't imagine this is behaviour people actually use/desire. So for a new lock type, i propose the following semantics: byte-range, advisory, non-merging, non-deadlock-checking Another semantic issue I want to tie down. Is it useful for different threadsd to be able to share a lock? Or should locks from different threads conflict? I'm currently leaning towards the latter, but you may be able to convince me I'm wrong. -- Revolutions do not require corporate support. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED]
Re: File Locking in 2.5
On Wed, May 02, 2001 at 04:49:30PM +0200, Andi Kleen wrote: On Mon, Apr 30, 2001 at 12:39:23PM -0600, Matthew Wilcox wrote: * All filesystems will fill in their -lock method. Why when a common stub should work for 90% of them? Please keep global search-and-edit operation low when not absolutely possible. I think this will be one of the more minor changes that filesystems will see during 2.5 :-) * Local filesystems should all use local_lock() for this method, unless they have a good reason to provide their own facilities. So this should be default for -lock Perhaps we misunderstand one another? I propose most filesystems should do struct file_operations { ... lock: local_lock } and I think you're proposing if (i_fop-lock) i_fop-lock(); else local_lock(); please correct me if I'm wrong. A clustered filesystem might call out to the network and say `I want to put this lock on this file'. Either some other node in the cluster says `Denied', `Blocked' or `Granted' (ie handles the request), OR no other node accepts responsibility for the lock, in which case we lock it locally by calling local_lock(). It would be nice if it was possible to interface it to the distributed lock manager IBM has recently GPLed... It would. I hadn't heard that got released, I shall investigate. It's still linear lists I guess? Any evidence that a hash table or a ternary search tree makes sense? I'm not going to change it from a linear list until I've made these other changes. In the meantime, if anyone knows of a data structure which lets me record ranges and check for overlaps, let me know... I would add a printk warning that fires 10 times, then you can get rid of it. They're only supported for the benefit of really old applications (ie stuff that runs under iBCS). We could probably drop them or make them a config option. I've never met anyone who mounts their filesystems with -o mand :-) I never understood why you can't do it completely inode local protected by the inode lock; but then I don't claim that I understand fs/locks.c inode_lock claims it's for protecting the list manipulations, not for protecting any element in an individual inode. Though in practice it seems to be used for protecting i_state as well. I think I'd rather use a different lock to protect i_flock -- inode_lock is probably already highly contended. Or just walk the inode hash ? yes, could do that. but /proc/locks really ought to die. i don't see why we need to expose kernel addresses to userspace. the extreme security people have also voiced concerns about the covert channel involved and the way this can be used to spy on other processes. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED]
File Locking in 2.5
that it is not removed by closing any fd which happens to be open on this file. Samba keeps a list of open fds which are not currently in use on any locked file to work around this stupidity in the spec. I'd like the external interface to this to be fcntl(F_SETLK_NP) and F_SETLKW_NP. Clearly F_GETLK does not need to be altered or replaced. Restructuring locks.c still runs almost entirely under the BKL. An earlier attempt to move it to a different locking scheme was thwarted when the code was integrated into 2.4.0-test9 while I was on holiday, and without me submitting it to Linus. Grumble. I plan to move it to _one_ spinlock to cover all lock-related structures, and I think that will be possible with the plan described above (since this code will no longer sleep). As soon as lockd no longer needs to keep its fingers inside locks.c, I want to remove the global list of locks. It's also used by /proc/locks -- which probably needs to go away anyway. So what's useful about /proc/locks? I'd like to be able to see which locks my process has, and which processes have a lock on a given file. The former is easy -- /proc/$PID/locks can be constructed relatively easily from the fd's open by that process. The latter? I don't know. Ideas welcome. Links [1]POSIX file locking [2]Olaf Kirch's page on NLM (warning: out of date) _ Matthew Wilcox [EMAIL PROTECTED] References 1. http://www.opengroup.org/onlinepubs/007908799/xsh/fcntl.html 2. http://www.swb.de/personal/okir/lockd.html - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED]
Re: [gfs-admin@sistina.com: Your message to gfs awaits moderator approval]
On Mon, Apr 30, 2001 at 02:15:50PM -0700, David S. Miller wrote: Now since kpreslan is the only one from sistina.com on linux-fsdevel I would say that would be the problem address. However, I've made some postings to linux-kernel today (which this address is also on) yet I received no such bounces. In the meantime I've removed kpreslan. I'm not going to remove the entire domain, I only do that if I see evidence of site negligence not personal negligence. We have no proof of the latter at this point. ok, thanks. i had assumed that they'd subscribed the gfs list to -fsdevel. i guess what actually happened is that kpreslan `bounced' instead of `forwarded' the message to the gfs list, so the fascist mail software checked whether i was a subscriber. thanks for your prompt action. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED]
Re: 64-bit block sizes on 32-bit systems
On Mon, Mar 26, 2001 at 08:39:21AM -0800, LA Walsh wrote: I vaguely remember a discussion about this a few months back. If I remember, the reasoning was it would unnecessarily slow down smaller systems that would never have block devices in the 4-28T range attached. 4k page size * 2GB = 8TB. i consider it much more likely on such systems that the page size will be increased to maybe 16 or 64k which would give us 32TB or 128TB. you keep on trying to increase the size of types without looking at what gcc outputs in the way of code that manipulates 64-bit types. seriously, why don't you just try it? see what the performance is. see what the code size is. then come back with some numbers. and i mean numbers, not `it doesn't feel any slower'. personally, i'm going to see what the situation looks like in 5 years time and try to solve the problem then. there're enough real problems with the VFS today that i don't feel inclined to fix tomorrow's potential problems. -- Revolutions do not require corporate support. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED]
Re: 64-bit block sizes on 32-bit systems
On Mon, Mar 26, 2001 at 10:47:13AM -0700, Andreas Dilger wrote: What do you mean by problems 5 years down the road? The real issue is that this 32-bit block count limit affects composite devices like MD RAID and LVM today, not just individual disks. There have been several postings I have seen with people having a problem _today_ with a 2TB limit on devices. people who can afford 2TB of disc can afford to buy a 64-bit processor. -- Revolutions do not require corporate support. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED]
Re: 64-bit block sizes on 32-bit systems
On Mon, Mar 26, 2001 at 01:13:32PM -0800, Ion Badulescu wrote: Yes, there are millions of 32-bit systems in use today. They do their job just fine with the 32-bit device support we have right now. Do you really want to penalize them *all* for the sake of the few idiotic sysadmins who want multi-TB storage on their 32-bit system? And they can even have that multi-TB storage today. It's the lazy ones who want to combine all their 80GB discs into one virtual device. In case anyone's getting the wrong idea from this, I _do_ think we ought to make the 2TB limit go away. I _don't_ think we should do that by going to 64-bit block numbers on 32-bit machines. Linus suggested a while back (in one of the kiobuf flamewa^W discussions) that the block device layer ought to work fine with non-512-byte-blocks. Perhaps those who are keen to see Linux support large drives should investigate that instead. -- Revolutions do not require corporate support. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED]
Re: [RFC] sane access to per-fs metadata (was Re: [PATCH] Documentation/ioctl-number.txt)
On Fri, Mar 23, 2001 at 09:56:47AM -0700, Bryan Henderson wrote: There's a lot of cool simplicity in this, both in implementation and application, but it leaves something to be desired in functionality. This is partly because the price you pay for being able to use existing, well-worn Unix interfaces is the ancient limitations of those interfaces -- like the inability to return adequate error information. hmm... open("defrag-error") first, then read from it if it fails? effective the defrag was? And bear in mind that multiple processes may be issuing commands to /mnt/control simultaneously. you should probably serialise them. you probably have to do this anyway. With ioctl, I can easily match a response of any kind to a request. I can even return an English text message if I want to be friendly. yes, one of the nice plan9 changes was the change to returning strings instead of numerics. -- Revolutions do not require corporate support. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED]
Re: (struct dentry *)-vfsmnt;
On Wed, Mar 14, 2001 at 10:26:50AM -0700, Andreas Dilger wrote: Let me put it that way: I don't understand why (if it is useful at all) it is done in the fs. Looks like a wrong level... For the same reason that the UUID and LABEL are stored in the superblock: you want this infomation kept with the filesystem and not anywhere else, otherwise it will quickly get out-of-date. Wherever you mounted the filesystem last is where it would be mounted if you import the VG on another system. You can obviously edit /etc/fstab afterwards if it is wrong, and then remount the filesystem(s), and this will store the correct mountpoint into the filesystem for the next vgimport. Al is saying `why not do this in mount(8) instead of mount(2)?' I haven't seen you answer that yet. -- Revolutions do not require corporate support. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED]
remove unused entries from ext2 inode
Andi Kleen mentioned that the ext2 part of the inode union was rather large. So I wondered if I could chop it down a bit. * i_osync is only referenced, never set * i_faddr, i_frag_no, i_frag_size -- we don't support fragments. * not_used_1 can clearly be removed. * i_high_size is obsoleted by the VFS support for 64-bit files. * i_new_inode is set but never referenced. I decided not to remove the ACL entries yet. Index: fs/ext2/ialloc.c === RCS file: /var/cvs/linux/fs/ext2/ialloc.c,v retrieving revision 1.1.1.3 diff -u -p -r1.1.1.3 ialloc.c --- fs/ext2/ialloc.c2000/12/26 00:20:32 1.1.1.3 +++ fs/ext2/ialloc.c2001/03/12 13:44:25 @@ -432,13 +432,9 @@ repeat: inode-i_blksize = PAGE_SIZE; /* This is the optimal IO size (for stat), not the fs block size */ inode-i_blocks = 0; inode-i_mtime = inode-i_atime = inode-i_ctime = CURRENT_TIME; - inode-u.ext2_i.i_new_inode = 1; inode-u.ext2_i.i_flags = dir-u.ext2_i.i_flags; if (S_ISLNK(mode)) inode-u.ext2_i.i_flags = ~(EXT2_IMMUTABLE_FL | EXT2_APPEND_FL); - inode-u.ext2_i.i_faddr = 0; - inode-u.ext2_i.i_frag_no = 0; - inode-u.ext2_i.i_frag_size = 0; inode-u.ext2_i.i_file_acl = 0; inode-u.ext2_i.i_dir_acl = 0; inode-u.ext2_i.i_dtime = 0; Index: fs/ext2/inode.c === RCS file: /var/cvs/linux/fs/ext2/inode.c,v retrieving revision 1.1.1.3 diff -u -p -r1.1.1.3 inode.c --- fs/ext2/inode.c 2001/01/01 10:32:55 1.1.1.3 +++ fs/ext2/inode.c 2001/03/12 13:43:00 @@ -405,7 +405,7 @@ static int ext2_alloc_branch(struct inod *branch[n].p = branch[n].key; mark_buffer_uptodate(bh, 1); mark_buffer_dirty_inode(bh, inode); - if (IS_SYNC(inode) || inode-u.ext2_i.i_osync) { + if (IS_SYNC(inode)) { ll_rw_block (WRITE, 1, bh); wait_on_buffer (bh); } @@ -470,13 +470,13 @@ static inline int ext2_splice_branch(str /* had we spliced it onto indirect block? */ if (where-bh) { mark_buffer_dirty_inode(where-bh, inode); - if (IS_SYNC(inode) || inode-u.ext2_i.i_osync) { + if (IS_SYNC(inode)) { ll_rw_block (WRITE, 1, where-bh); wait_on_buffer(where-bh); } } - if (IS_SYNC(inode) || inode-u.ext2_i.i_osync) + if (IS_SYNC(inode)) ext2_sync_inode (inode); else mark_inode_dirty(inode); @@ -1036,14 +1036,10 @@ void ext2_read_inode (struct inode * ino inode-i_blocks = le32_to_cpu(raw_inode-i_blocks); inode-i_version = ++event; inode-u.ext2_i.i_flags = le32_to_cpu(raw_inode-i_flags); - inode-u.ext2_i.i_faddr = le32_to_cpu(raw_inode-i_faddr); - inode-u.ext2_i.i_frag_no = raw_inode-i_frag; - inode-u.ext2_i.i_frag_size = raw_inode-i_fsize; inode-u.ext2_i.i_file_acl = le32_to_cpu(raw_inode-i_file_acl); if (S_ISDIR(inode-i_mode)) inode-u.ext2_i.i_dir_acl = le32_to_cpu(raw_inode-i_dir_acl); else { - inode-u.ext2_i.i_high_size = le32_to_cpu(raw_inode-i_size_high); inode-i_size |= ((__u64)le32_to_cpu(raw_inode-i_size_high)) 32; } inode-i_generation = le32_to_cpu(raw_inode-i_generation); @@ -1112,25 +1108,26 @@ static int ext2_update_inode(struct inod unsigned long offset; int err = 0; struct ext2_group_desc * gdp; + struct super_block *sb = inode-i_sb; if ((inode-i_ino != EXT2_ROOT_INO -inode-i_ino EXT2_FIRST_INO(inode-i_sb)) || - inode-i_ino le32_to_cpu(inode-i_sb-u.ext2_sb.s_es-s_inodes_count)) { - ext2_error (inode-i_sb, "ext2_write_inode", +inode-i_ino EXT2_FIRST_INO(sb)) || + inode-i_ino le32_to_cpu(sb-u.ext2_sb.s_es-s_inodes_count)) { + ext2_error (sb, "ext2_write_inode", "bad inode number: %lu", inode-i_ino); return -EIO; } - block_group = (inode-i_ino - 1) / EXT2_INODES_PER_GROUP(inode-i_sb); - if (block_group = inode-i_sb-u.ext2_sb.s_groups_count) { - ext2_error (inode-i_sb, "ext2_write_inode", + block_group = (inode-i_ino - 1) / EXT2_INODES_PER_GROUP(sb); + if (block_group = sb-u.ext2_sb.s_groups_count) { + ext2_error (sb, "ext2_write_inode", "group = groups count"); return -EIO; } - group_desc = block_group EXT2_DESC_PER_BLOCK_BITS(inode-i_sb); - desc = block_group (EXT2_DESC_PER_BLOCK(inode-i_sb) - 1); - bh = inode-i_sb-u.ext2_sb.s_group_desc[group_desc]; + group_desc = block_group
NULL f_ops
Someone tell me if my chain of reasoning is wrong here... (1) The only way to get a `struct file' is to call get_empty_filp() (2) All callers of get_empty_filp() set -f_ops to a non-NULL value (3) All checks for f_ops being NULL can be removed -- Revolutions do not require corporate support. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED]
Re: NULL f_ops
On Thu, Mar 08, 2001 at 11:16:10AM -0500, Alexander Viro wrote: I'm not sure on your #2. In principle, -i_fop can be NULL. It may be a good thing to declare that it should never happens, but right now it's not guaranteed. Besides, revoke-like thing in proc/generic.c _does_ set f_op to NULL. It's damn ugly and the whole kill_inodes() should die slow and painful death, but right now that's the way it is done. Same thing for devfs, BTW - not that I thought that devfs deserved a different fate, but... I'd rather see it set to null_file_ops then NULL. If it can be null, that means we need to go through the fs directory adding them; there are a few places it isn't checked right now. -- Revolutions do not require corporate support. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED]
Re: More better in mount(2)
On Fri, Jan 05, 2001 at 01:00:58PM -0800, LA Walsh wrote: There is no file-system specific code. even better, you can do it in the vfs. file systems, but that seems as ugly or uglier than adding a 6th parameter. Note that the 6th parameter would only be looked at if the MS_EXTRA bit is set -- meaning no current mount programs would have to change -- they would just continue to do mounts the way they always have and specify no extra 'sysdata'. there have been times before when this kind of extension have been added. it almost always hurts us later. don't add new arguments to syscalls without changing the syscall number. -- Revolutions do not require corporate support. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED]
Re: dnotify enhancement
On Tue, Dec 12, 2000 at 03:47:54PM -0500, James Antill wrote: Say you have... /a/b/c/d ...and you are monitoring the 'd' directory. Then someone does... % mv /a/b /a/b.old % mkdir -p /a/b/c/d that shouldn't trigger an event. i'm strongly against doing recursion of this in the kernel. if you want to monitor that kind of thing, it makes sense to do it in userspace. are there any cases where you want to know that the directory itself has been renamed, but not that one of its more remote ancestors has been renamed? -- Revolutions do not require corporate support. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED]
Re: big kernel locks for inode ops
On Wed, Dec 06, 2000 at 03:03:00PM -0800, LA Walsh wrote: I notice many places in open.c where 'lock_kernel()' has been dropped in 2.4. There are still a few places. Should there be a separate file-sub-system lock, perhaps later evolving into a /per filesystem lock? Under what circumstances in the file-system code does one still have to (or should) lock down the kernel? Rusty wrote an `unreliable guide to kernel locking', available as linux/Documentation/DocBook/kernel-locking.* (there are various formats available for it, see the makefile for details). -- Revolutions do not require corporate support. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED]