Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
On Mon, May 28, 2007 at 02:48:45PM +1000, Timothy Shimmin wrote: I'm taking it that the FUA write will just guarantee that that particular write has made it to disk on i/o completion (and no write cache flush is done). Correct. It only applies to that one write command. jeremy - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
2007/5/25, Neil Brown [EMAIL PROTECTED]: BIO_RW_FAILFAST: means low-level driver shouldn't do much (or no) error recovery. Mainly used by mutlipath targets to avoid long SCSI recovery. This should just be propagated when passing requests on. Is it much or no? Would it be reasonable to use this for reads from a non-degraded raid1? What about writes? This depends on the device driver's implementation. AFAIK there is no fix rule how to handle that flag exactly. The SCSI driver seems to omit internal recovery procedures but requests still can take as long as the SCSI request time-out. I am not sure of all internals. Maybe some error recovery is done as long as it shouldn't take very long. For the DASD driver on zSeries this flags will only affect situations when the driver decides there is no other way of succeeding. Recovery is still done. Using this flag was intended to move error handling to an upper layer in the device stack. For multipathing it is good to be able to map a request to another path instead of waiting until the SCSI layer finally would give up with one path. For a RAID1 this might cause requests to fail which would have been recovered. This might require more error handling in md. The error code as it is at this time doesn't say much in detail. I once saw patches (and there are comments about a path missing from Jens Axboe) to pass sense data (from SCSI) in the bio. I am not sure whether this was dropped for some reason or just is in the pipe. Jens? Stefan - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] AFS: Implement file locking
J. Bruce Fields [EMAIL PROTECTED] wrote: At the moment, yes. Don't the POSIX and flock lock-handling routines in the kernel normally do that anyway? No, they'd upgrade in that case. I just checked. The OpenAFS server supports neither lock upgrading nor lock downgrading. Attempts to do either incur an abort with code 0x02f6df0a (which I believe to be equivalent to EAGAIN). This means that I can't practically support lock upgrading. Lock downgrading I can emulate by handing apparent readlocks to local processes whilst holding a writelock on the server. David - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[1/4] 2.6.22-rc3: known regressions
Hi all, Here is a list of some known regressions in 2.6.22-rc3. Feel free to add new regressions/remove fixed etc. http://kernelnewbies.org/known_regressions Unclassified Subject: long freezes on thinkpad t60 References : http://lkml.org/lkml/2007/5/24/100 Submitter : Miklos Szeredi [EMAIL PROTECTED] Handled-By : Ingo Molnar [EMAIL PROTECTED] Status : problem is being debugged ACPI Subject: unable to shutdown on kernel 2.6.22-rc2 References : http://bugzilla.kernel.org/show_bug.cgi?id=8516 Submitter : Thierry Volpiatto [EMAIL PROTECTED] Status : Unknown ALSA Subject: snd-aoa causes badness in lib/kref.c:33 References : http://bugzilla.kernel.org/show_bug.cgi?id=8513 Submitter : Ben Collins [EMAIL PROTECTED] Status : Unknown File systems Subject: Oops in dentry_iput with 2.6.22-rc2 on AMD64 References : http://lkml.org/lkml/2007/5/22/4 Submitter : Florin Iucha [EMAIL PROTECTED] Status : Unknown Kbuild Subject: make M=$PWD modules_install does nothing References : http://lkml.org/lkml/2007/5/27/190 Submitter : Andrey Borzenkov [EMAIL PROTECTED] Status : Unknown Regards, Michal -- Najbardziej brakowało mi twojego milczenia. -- Andrzej Sapkowski Coś więcej - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [1/4] 2.6.22-rc3: known regressions
On Tue, May 29, 2007 at 04:34:59PM +0200, Jan Kara wrote: On Tue 29-05-07 14:52:53, Michal Piotrowski wrote: Here is a list of some known regressions in 2.6.22-rc3. Subject: Oops in dentry_iput with 2.6.22-rc2 on AMD64 References : http://lkml.org/lkml/2007/5/22/4 Submitter : Florin Iucha [EMAIL PROTECTED] Status : Unknown Actually, the bug seems to be unreproducible and it has probably been a 1-bit flip. So I'd be reluctant to call it a regression... I agree with this statement. I'll ping Michal and Jan if the oops resurfaces. florin -- Bruce Schneier expects the Spanish Inquisition. http://geekz.co.uk/schneierfacts/fact/163 signature.asc Description: Digital signature
[PATCH] AFS: Implement file locking [try #2]
Implement file locking for AFS. [try #2]: (*) Start the lock manager thread under a mutex to avoid a race. (*) Made the locking non-fair: New readlocks will jump pending writelocks if there's a readlock currently granted on a file. This makes the behaviour similar to Linux's VFS locking. Regrading of locks is not currently supported as this is not supported by the server. Byte-range locking is also not currently supported for the same reason. Signed-off-by: David Howells [EMAIL PROTECTED] --- fs/afs/Makefile|1 fs/afs/afs.h |8 + fs/afs/afs_fs.h|3 fs/afs/callback.c |3 fs/afs/dir.c |1 fs/afs/file.c |2 fs/afs/flock.c | 590 fs/afs/fsclient.c | 155 ++ fs/afs/internal.h | 30 +++ fs/afs/main.c |1 fs/afs/misc.c |1 fs/afs/super.c |3 fs/afs/vnode.c | 130 ++- include/linux/fs.h |4 14 files changed, 917 insertions(+), 15 deletions(-) diff --git a/fs/afs/Makefile b/fs/afs/Makefile index 73ce561..a666710 100644 --- a/fs/afs/Makefile +++ b/fs/afs/Makefile @@ -8,6 +8,7 @@ kafs-objs := \ cmservice.o \ dir.o \ file.o \ + flock.o \ fsclient.o \ inode.o \ main.o \ diff --git a/fs/afs/afs.h b/fs/afs/afs.h index 2452579..c548aa3 100644 --- a/fs/afs/afs.h +++ b/fs/afs/afs.h @@ -37,6 +37,13 @@ typedef enum { AFS_FTYPE_SYMLINK = 3, } afs_file_type_t; +typedef enum { + AFS_LOCK_READ = 0,/* read lock request */ + AFS_LOCK_WRITE = 1,/* write lock request */ +} afs_lock_type_t; + +#define AFS_LOCKWAIT (5 * 60) /* time until a lock times out (seconds) */ + /* * AFS file identifier */ @@ -120,6 +127,7 @@ struct afs_file_status { struct afs_fid parent; /* parent dir ID for non-dirs only */ time_t mtime_client; /* last time client changed data */ time_t mtime_server; /* last time server changed data */ + s32 lock_count; /* file lock count (0=UNLK -1=WRLCK +ve=#RDLCK */ }; /* diff --git a/fs/afs/afs_fs.h b/fs/afs/afs_fs.h index a18c374..eb64732 100644 --- a/fs/afs/afs_fs.h +++ b/fs/afs/afs_fs.h @@ -31,6 +31,9 @@ enum AFS_FS_Operations { FSGETVOLUMEINFO = 148, /* AFS Get information about a volume */ FSGETVOLUMESTATUS = 149, /* AFS Get volume status information */ FSGETROOTVOLUME = 151, /* AFS Get root volume name */ + FSSETLOCK = 156, /* AFS Request a file lock */ + FSEXTENDLOCK= 157, /* AFS Extend a file lock */ + FSRELEASELOCK = 158, /* AFS Release a file lock */ FSLOOKUP= 161, /* AFS lookup file in directory */ FSFETCHDATA64 = 65537, /* AFS Fetch file data */ FSSTOREDATA64 = 65538, /* AFS Store file data */ diff --git a/fs/afs/callback.c b/fs/afs/callback.c index bacf518..b824394 100644 --- a/fs/afs/callback.c +++ b/fs/afs/callback.c @@ -125,6 +125,9 @@ static void afs_break_callback(struct afs_server *server, spin_unlock(server-cb_lock); queue_work(afs_callback_update_worker, vnode-cb_broken_work); + if (list_empty(vnode-granted_locks) + !list_empty(vnode-pending_locks)) + afs_lock_may_be_available(vnode); spin_unlock(vnode-lock); } } diff --git a/fs/afs/dir.c b/fs/afs/dir.c index 546c595..33fe39a 100644 --- a/fs/afs/dir.c +++ b/fs/afs/dir.c @@ -44,6 +44,7 @@ const struct file_operations afs_dir_file_operations = { .open = afs_dir_open, .release= afs_release, .readdir= afs_readdir, + .lock = afs_lock, }; const struct inode_operations afs_dir_inode_operations = { diff --git a/fs/afs/file.c b/fs/afs/file.c index 1547500..8aaa233 100644 --- a/fs/afs/file.c +++ b/fs/afs/file.c @@ -35,6 +35,8 @@ const struct file_operations afs_file_operations = { .mmap = afs_mmap, .sendfile = generic_file_sendfile, .fsync = afs_fsync, + .lock = afs_lock, + .flock = afs_flock, }; const struct inode_operations afs_file_inode_operations = { diff --git a/fs/afs/flock.c b/fs/afs/flock.c new file mode 100644 index 000..bb97105 --- /dev/null +++ b/fs/afs/flock.c @@ -0,0 +1,590 @@ +/* AFS file locking support + * + * Copyright (C) 2007 Red Hat, Inc. All Rights Reserved. + * Written by David Howells ([EMAIL PROTECTED]) + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; either version + * 2 of the License, or (at your option) any later version. + */ +
[PATCH] add procfs tunable to enable immediate panic when there are busy inodes after umount
After spending quite a bit of time tracking down a VFS: busy inodes after unmount problem, it occurs to me that it would be nice to be able to force a panic when that occurs. While an oops message alone is not generally helpful for tracking down this sort of problem, collecting and analyzing a coredump when this occurs can be. The following patch adds a procfs tunable that allows you to force a core when a busy inodes after umount problem occurs. It also changes the classic error message to be something a bit less cryptic to users. Signed-off-by: Jeff Layton [EMAIL PROTECTED] diff --git a/fs/block_dev.c b/fs/block_dev.c diff --git a/fs/inode.c b/fs/inode.c index 9a012cc..0e638b0 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -327,7 +327,7 @@ static int invalidate_list(struct list_head *head, struct list_head *dispose) count++; continue; } - busy = 1; + ++busy; } /* only unused inodes may be cached with i_count zero */ inodes_stat.nr_unused -= count; diff --git a/fs/super.c b/fs/super.c index 5260d62..9c2871b 100644 --- a/fs/super.c +++ b/fs/super.c @@ -287,6 +287,8 @@ int fsync_super(struct super_block *sb) void generic_shutdown_super(struct super_block *sb) { const struct super_operations *sop = sb-s_op; + extern int umount_debug; + int busy; if (sb-s_root) { shrink_dcache_for_umount(sb); @@ -303,10 +305,15 @@ void generic_shutdown_super(struct super_block *sb) sop-put_super(sb); /* Forget any remaining inodes */ - if (invalidate_inodes(sb)) { - printk(VFS: Busy inodes after unmount of %s. - Self-destruct in 5 seconds. Have a nice day...\n, - sb-s_id); + if (busy = invalidate_inodes(sb)) { + printk(VFS: %d busy inodes after unmount of %s. , +busy, sb-s_id); + if (umount_debug != 0) { + printk(Crashing host on request.\n); + BUG(); + } else { + printk(This machine will likely crash eventually. Consider a reboot.\n); + } } unlock_kernel(); diff --git a/include/linux/fs.h b/include/linux/fs.h diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h index 47f1c53..176b984 100644 --- a/include/linux/sysctl.h +++ b/include/linux/sysctl.h @@ -818,6 +818,7 @@ enum FS_AIO_NR=18, /* current system-wide number of aio requests */ FS_AIO_MAX_NR=19, /* system-wide maximum number of aio requests */ FS_INOTIFY=20, /* inotify submenu */ + FS_UMOUNT_DEBUG=21, /* busy inodes on umount debug switch */ FS_OCFS2=988, /* ocfs2 */ }; diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 30ee462..8e62c34 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -156,6 +156,7 @@ extern ctl_table pty_table[]; extern ctl_table inotify_table[]; #endif +int umount_debug; #ifdef HAVE_ARCH_PICK_MMAP_LAYOUT int sysctl_legacy_va_layout; #endif @@ -962,6 +963,14 @@ static ctl_table fs_table[] = { .mode = 0644, .proc_handler = proc_dointvec, }, + { + .ctl_name = FS_UMOUNT_DEBUG, + .procname = umount_debug, + .data = umount_debug, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_dointvec, + }, #ifdef CONFIG_DNOTIFY { .ctl_name = FS_DIR_NOTIFY, - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] add procfs tunable to enable immediate panic when there are busy inodes after umount
On Tue, May 29, 2007 at 11:40:42AM -0400, Jeff Layton wrote: After spending quite a bit of time tracking down a VFS: busy inodes after unmount problem, it occurs to me that it would be nice to be able to force a panic when that occurs. While an oops message alone is not generally helpful for tracking down this sort of problem, collecting and analyzing a coredump when this occurs can be. The following patch adds a procfs tunable that allows you to force a core when a busy inodes after umount problem occurs. It also changes the classic error message to be something a bit less cryptic to users. @@ -303,10 +305,15 @@ void generic_shutdown_super(struct super_block *sb) sop-put_super(sb); /* Forget any remaining inodes */ - if (invalidate_inodes(sb)) { - printk(VFS: Busy inodes after unmount of %s. -Self-destruct in 5 seconds. Have a nice day...\n, -sb-s_id); + if (busy = invalidate_inodes(sb)) { + printk(VFS: %d busy inodes after unmount of %s. , + busy, sb-s_id); + if (umount_debug != 0) { + printk(Crashing host on request.\n); + BUG(); + } else { + printk(This machine will likely crash eventually. Consider a reboot.\n); + } You can add just BUG_ON here and do echo 1 /proc/sys/kernel/panic_on_oops - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 2/2] i_version update - ext4 part
On Fri, 2007-05-25 at 18:25 +0200, Jean noel Cordenner wrote: The patch is on top of the ext4 tree: http://repo.or.cz/w/ext4-patch-queue.git In this part, the i_version counter is stored into 2 32bit fields of the ext4_inode structure osd1.linux1.l_i_version and i_version_hi. I included the ext4_expand_inode_extra_isize patch, which does part of the job, checking if there is enough room for extra fields in the inode (i_version_hi). The other patch increments the counter on inode modifications and set it on inode creation. plain text document attachment (i_version_update_ext4) This patch is on top of i_version_update_vfs. The i_version field of the inode is set on inode creation and incremented when the inode is being modified. I am a little bit confused about the two patches. It appears in the ext4_expand_inode_extra_isize patch by Kalpak, there a new 64 bit i_fs_version field is added to ext4 inode structure for inode versioning support. read/store of this counter are properly handled, but missing the inode versioning counter update. But later in the second patch by Jean Noel, he re-used the VFS inode- i_version for ext4 inode versioning, the counter is being updated every time the file is being changed. To me, i_fs_version and inode_version are the same thing, right? Shouldn't we choose one(I assume inode i_version?), and combine these two patch together? How about split the inode versioning part from the ext4_expand_inode_extra_isize patch(it does multiple things, and i_versioning doesn't longs there) and put it together with the rest of i_version update patches? BTW, how could NFS/user space to access the inode version counter? Thanks, Mingming Signed-off-by: Jean Noel Cordenner [EMAIL PROTECTED] Index: linux-2.6.22-rc2-ext4-1/fs/ext4/ialloc.c === --- linux-2.6.22-rc2-ext4-1.orig/fs/ext4/ialloc.c 2007-05-25 18:05:28.0 +0200 +++ linux-2.6.22-rc2-ext4-1/fs/ext4/ialloc.c 2007-05-25 18:05:40.0 +0200 @@ -565,6 +565,7 @@ inode-i_blocks = 0; inode-i_mtime = inode-i_atime = inode-i_ctime = ei-i_crtime = ext4_current_time(inode); + inode-i_version = 1; memset(ei-i_data, 0, sizeof(ei-i_data)); ei-i_dir_start_lookup = 0; Index: linux-2.6.22-rc2-ext4-1/fs/ext4/inode.c === --- linux-2.6.22-rc2-ext4-1.orig/fs/ext4/inode.c 2007-05-25 18:05:28.0 +0200 +++ linux-2.6.22-rc2-ext4-1/fs/ext4/inode.c 2007-05-25 18:05:40.0 +0200 @@ -3082,6 +3082,7 @@ { int err = 0; + inode-i_version++; /* the do_update_inode consumes one bh-b_count */ get_bh(iloc-bh); Index: linux-2.6.22-rc2-ext4-1/fs/ext4/super.c === --- linux-2.6.22-rc2-ext4-1.orig/fs/ext4/super.c 2007-05-25 18:05:28.0 +0200 +++ linux-2.6.22-rc2-ext4-1/fs/ext4/super.c 2007-05-25 18:05:40.0 +0200 @@ -2839,8 +2839,8 @@ i_size_write(inode, off+len-towrite); EXT4_I(inode)-i_disksize = inode-i_size; } - inode-i_version++; inode-i_mtime = inode-i_ctime = CURRENT_TIME; + inode-i_version = 1; ext4_mark_inode_dirty(handle, inode); mutex_unlock(inode-i_mutex); return len - towrite; - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] add procfs tunable to enable immediate panic when there are busy inodes after umount
On Tue, 29 May 2007 23:38:13 +0400 Alexey Dobriyan [EMAIL PROTECTED] wrote: On Tue, May 29, 2007 at 11:40:42AM -0400, Jeff Layton wrote: After spending quite a bit of time tracking down a VFS: busy inodes after unmount problem, it occurs to me that it would be nice to be able to force a panic when that occurs. While an oops message alone is not generally helpful for tracking down this sort of problem, collecting and analyzing a coredump when this occurs can be. The following patch adds a procfs tunable that allows you to force a core when a busy inodes after umount problem occurs. It also changes the classic error message to be something a bit less cryptic to users. @@ -303,10 +305,15 @@ void generic_shutdown_super(struct super_block *sb) sop-put_super(sb); /* Forget any remaining inodes */ - if (invalidate_inodes(sb)) { - printk(VFS: Busy inodes after unmount of %s. - Self-destruct in 5 seconds. Have a nice day...\n, - sb-s_id); + if (busy = invalidate_inodes(sb)) { + printk(VFS: %d busy inodes after unmount of %s. , +busy, sb-s_id); + if (umount_debug != 0) { + printk(Crashing host on request.\n); + BUG(); + } else { + printk(This machine will likely crash eventually. Consider a reboot.\n); + } You can add just BUG_ON here and do echo 1 /proc/sys/kernel/panic_on_oops I certainly could, but the problem is that there's little point in panicing immediately here if you can't collect a coredump. Oops messages aren't very helpful for tracking this sort of thing down, so I'd think we want the BUG() conditional on something more granular than panic_on_oops. -- Jeff Layton [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
Neil Brown wrote: md/dm modules could keep count of requests as has been suggested (though that would be a fairly big change for raid0 as it currently doesn't know when a request completes - bi_endio goes directly to the filesystem). Are you sure? I believe that dm handles bi_endio because it waits for all in progress bio to complete before switching tables. 2/ Maybe barriers provide stronger semantics than are required. All write requests are synchronised around a barrier write. This is often more than is required and apparently can cause a measurable slowdown. I'm not quite sure I understand this correctly, but the purpose of a barrier request is to prevent the elevator from reordering requests around a barrier. Previous requests must be completed before the barrier, and latter requests must be executed after. That is a sufficiently strong guarantee for careful write or journal filesystems to ensure that a log block hits the disk before the actual transaction blocks, and then the log block is marked as complete only after the actual transaction. This is a weaker guarantee than a flush, and allows for some reordering to improve performance. Also the FUA for the actual commit write might not be needed. It is important for consistency that the preceding writes are in safe storage before the commit write, but it is not so important that the commit write is immediately safe on storage. That isn't needed until a 'sync' or 'fsync' or similar. Right, the barrier doesn't need to be flushed right away, so the elevator could complete writes after the barrier if it wishes, then complete the ones before, and finally the barrier itself. Not setting the FUA bit allows the disk to cache the barrier write so it can be completed sooner, but before the queue sends any more requests to the disk, it must be flushed to ensure that the barrier has hit the media before the new requests. One possible alternative is: - writes can overtake barriers, but barrier cannot overtake writes. - flush before the barrier, not after. This is considerably weaker, and hence cheaper. But I think it is enough for all filesystems (providing it is still an option to call blkdev_issue_flush on 'fsync'). Again I am not sure I quite understand what you mean here, but only writes issued after the barrier can complete before the barrier. Those issued before the barrier can not overtake it in the queue. Another alternative would be to tag each bio was being in a particular barrier-group. Then bio's in different groups could overtake each other in either direction, but a BARRIER request must be totally ordered w.r.t. other requests in the barrier group. This would require an extra bio field, and would give the filesystem more appearance of control. I'm not yet sure how much it would really help... It would allow us to set FUA on all bios with a non-zero barrier-group. That would mean we don't have to flush the entire cache, just those blocks that are critical but I'm still not sure it's a good idea. This all seems unnecessary work. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
David Chinner wrote: Sounds good to me, but how do we test to see if the underlying device supports barriers? Do we just assume that they do and only change behaviour if -o nobarrier is specified in the mount options? The idea is that ALL block devices will support barriers; if the underlying driver doesn't, then the block layer will work around it. The use of barriers in XFS assumes the commit write to be on stable storage before it returns. One of the ordering guarantees that we need is that the transaction (commit write) is on disk before the metadata block containing the change in the transaction is written to disk and the current barrier behaviour gives us that. Barrier != synchronous write, so if XFS relies on that block being on the media when the request is completed, then it is broken. It should only care that the ordering of log-data-log is maintained, not exactly when each specific request completes. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] AFS: Implement file locking
On Tue, May 29, 2007 at 10:34:41AM +0100, David Howells wrote: I'll need to test the upgrade/downgrade case. I don't know whether the AFS server supports that. If it doesn't, I can emulate downgrade, but not upgrade - not unless I only ever ask it for exclusive locks. Lock upgrading is really, really easy to contrive deadlock for. Any such deadlock is the user's fault. But, right, I agree that upgrades are probably hard to use correctly. And that implementing them shouldn't be a priority in the case of AFS. Just as long as the implementation doesn't completely fall over when somebody attempts an upgrade. --b. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 08/13] NFS: Add functions to parse nfs mount options to fs/nfs/super.c
On Mon, May 21, 2007 at 12:09:54PM -0400, Chuck Lever wrote: For NFSv2 and NFSv3 mount options. Signed-off-by: Chuck Lever [EMAIL PROTECTED] +static int nfs_parse_options(char *raw, struct nfs_mount_args *mnt) +{ + char *p, *string; + + if (!raw) { + dprintk(NFS: mount options string was NULL.\n); + return 1; + } + + while ((p = strsep (raw, ,)) != NULL) { + substring_t args[MAX_OPT_ARGS]; + int option, token; + + if (!*p) + continue; + token = match_token(p, nfs_tokens, args); + + case Opt_context: + match_strcpy(mnt-nmd.context, args); + break; The userspace version (nfs-utils) of this code supports a quoted context strings. For example: context=aaa,bbb,ccc,hard It seems your code blindly parses a raw option string by ,. Karel -- Karel Zak [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 08/13] NFS: Add functions to parse nfs mount options to fs/nfs/super.c
Karel Zak wrote: On Mon, May 21, 2007 at 12:09:54PM -0400, Chuck Lever wrote: For NFSv2 and NFSv3 mount options. Signed-off-by: Chuck Lever [EMAIL PROTECTED] +static int nfs_parse_options(char *raw, struct nfs_mount_args *mnt) +{ + char *p, *string; + + if (!raw) { + dprintk(NFS: mount options string was NULL.\n); + return 1; + } + + while ((p = strsep (raw, ,)) != NULL) { + substring_t args[MAX_OPT_ARGS]; + int option, token; + + if (!*p) + continue; + token = match_token(p, nfs_tokens, args); + + case Opt_context: + match_strcpy(mnt-nmd.context, args); + break; The userspace version (nfs-utils) of this code supports a quoted context strings. For example: context=aaa,bbb,ccc,hard It seems your code blindly parses a raw option string by ,. Karel- I've never used the context= option, and didn't find any documentation describing how it was used. Is there a clean example of how to use the in-kernel parser to handle quoted strings containing commas? begin:vcard fn:Chuck Lever n:Lever;Chuck org:Oracle Corporation;Corporate Architecture: Linux Projects Group adr:;;1015 Granger Avenue;Ann Arbor;MI;48104;USA title:Principal Member of Staff tel;work:+1 248 614 5091 x-mozilla-html:FALSE url:http://oss.oracle.com/~cel/ version:2.1 end:vcard
Re: [AppArmor 01/41] Pass struct vfsmount to the inode_create LSM hook
On Mon, 28 May 2007 21:54:46 EDT, Kyle Moffett said: Average users are not supposed to be writing security policy. To be honest, even average-level system administrators should not be writing security policy. It's OK for such sysadmins to tweak existing policy to give access to additional web-docs or such, but only expert sysadmin/developers or security professionals should be writing security policy. It's just too damn easy to get completely wrong. The single biggest challenge in computer security at the present time is how to build *and deploy* servers that stay reasonably secure even when run by the average wave-a-dead-chicken sysadmin, and desktop-class boxes that can survive the best attempts of Joe Sixpack's Ooh shiny reflex, and Joe's kid's attempts to evade the nannyware that Joe had somebody install. (If you know how to build such things, don't bother replying. If you have actual field experience on getting significant percents of Joe Sixpacks to switch, I need to buy you a beer or something.. ;) pgp2xhvoFOnBw.pgp Description: PGP signature
Re: [PATCH] AFS: Implement file locking [try #2]
One more vague question I had while skimming the previous version-- On Tue, May 29, 2007 at 03:54:27PM +0100, David Howells wrote: +static void afs_grant_locks(struct afs_vnode *vnode, struct file_lock *fl) +{ + struct file_lock *p, *_p; + + list_move_tail(fl-fl_u.afs.link, vnode-granted_locks); + if (fl-fl_type == F_RDLCK) { + list_for_each_entry_safe(p, _p, vnode-pending_locks, + fl_u.afs.link) { + if (p-fl_type == F_RDLCK) { + p-fl_u.afs.state = AFS_LOCK_GRANTED; + list_move_tail(p-fl_u.afs.link, +vnode-granted_locks); + wake_up(p-fl_wait); + } + } + } +} --without having tried to understand how they're actually used, these data structures (like the pending_locks and granted_locks lists) seem to duplicate stuff that's already kept in fs/locks.c. Is there a reason they're required? --b. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
On Tue, May 29, 2007 at 11:25:42AM +0200, Stefan Bader wrote: doing a sort of suspend, issuing the barrier request, calling flush to all mapped devices and then wait for in-flight I/O to go to zero? Something like that is needed for some dm targets to support barriers. (We needn't always wait for *all* in-flight I/O.) When faced with -EOPNOTSUP, do all callers fall back to a sync in the places a barrier would have been used, or are there any more sophisticated strategies attempting to optimise code without barriers? I am not a hundred percent sure about that but I think that just passing the barrier flag on to mapped devices might in some (maybe they are rare) cases cause a layer above to think all data is on-disk while this isn't necessarily true (see my previous post). What do you think? An efficient I/O barrier implementation would not normally involve flushing AFAIK: dm surely wouldn't cause a higher layer to assume stronger semantics than are provided. Alasdair -- [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 2/2] i_version update - ext4 part
On May 29, 2007 12:44 -0700, Mingming Cao wrote: I am a little bit confused about the two patches. It appears in the ext4_expand_inode_extra_isize patch by Kalpak, there a new 64 bit i_fs_version field is added to ext4 inode structure for inode versioning support. read/store of this counter are properly handled, but missing the inode versioning counter update. For the Lustre use of the inode version we don't care about the VFS changes to i_version. In fact - we want to be able to control the changes to inode version ourselves so that e.g. file defragmenting or atime updates don't change the inode version, and that recovery can restore the version to a known state along with the rest of the metadata. That said, since Lustre isn't in the kernel and we patch our version of ext3 anyways it doesn't really matter what is done for NFS. We will just patch in our own behaviour if the final ext4 code isn't suitable in all of the details. Having 99% of the code the same at least makes this a lot less work. But later in the second patch by Jean Noel, he re-used the VFS inode- i_version for ext4 inode versioning, the counter is being updated every time the file is being changed. I don't know what the NFS requirements for the version are. There may also be some complaints from others if the i_version is 64 bits because this contributes to generic inode growth and isn't used for other filesystems. To me, i_fs_version and inode_version are the same thing, right? Shouldn't we choose one(I assume inode i_version?), and combine these two patch together? How about split the inode versioning part from the ext4_expand_inode_extra_isize patch(it does multiple things, and i_versioning doesn't longs there) and put it together with the rest of i_version update patches? I don't have an objection to that, but I don't think it is required. BTW, how could NFS/user space to access the inode version counter? If the Bull patch uses i_version then knfsd can just access it directly. I don't think there is any API to access it from userspace. One option is to add a virtual EA like user.inode_version and have the kernel fill this in from i_version. Lustre will manipulate the ei-i_fs_version directly. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
On Tue, May 29, 2007 at 04:03:43PM -0400, Phillip Susi wrote: David Chinner wrote: The use of barriers in XFS assumes the commit write to be on stable storage before it returns. One of the ordering guarantees that we need is that the transaction (commit write) is on disk before the metadata block containing the change in the transaction is written to disk and the current barrier behaviour gives us that. Barrier != synchronous write, Of course. FYI, XFS only issues barriers on *async* writes. But barrier semantics - as far as they've been described by everyone but you indicate that the barrier write is guaranteed to be on stable storage when it returns. so if XFS relies on that block being on the media when the request is completed, then it is broken. XFS relies on the block being stable before any other write goes to disk. That is the semantic that the barrier I/Os currently have. How that is implemented in the device is irrelevant to me, but if I issue a barrier I/O, I do not expect *any* I/O to be reordered around it. It should only care that the ordering of log-data-log is maintained, not exactly when each specific request completes. Yes, and that is provided to XFS by the fact that barrier I/Os are full barriers Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
On Wed, 30 May 2007, David Chinner wrote: On Tue, May 29, 2007 at 04:03:43PM -0400, Phillip Susi wrote: David Chinner wrote: The use of barriers in XFS assumes the commit write to be on stable storage before it returns. One of the ordering guarantees that we need is that the transaction (commit write) is on disk before the metadata block containing the change in the transaction is written to disk and the current barrier behaviour gives us that. Barrier != synchronous write, Of course. FYI, XFS only issues barriers on *async* writes. But barrier semantics - as far as they've been described by everyone but you indicate that the barrier write is guaranteed to be on stable storage when it returns. this doesn't match what I have seen wtih barriers it's perfectly legal to have the following sequence of events 1. app writes block 10 to OS 2. app writes block 4 to OS 3. app writes barrier to OS 4. app writes block 5 to OS 5. app writes block 20 to OS 6. OS writes block 4 to disk drive 7. OS writes block 10 to disk drive 8. OS writes barrier to disk drive 9. OS writes block 5 to disk drive 10. OS writes block 20 to disk drive 11. disk drive writes block 10 to platter 12. disk drive writes block 4 to platter 13. disk drive writes block 20 to platter 14. disk drive writes block 5 to platter there is nothing that says that when the app finishes step #3 that the OS has even sent the data to the drive, let alone that the drive has flushed it to a platter if the disk drive doesn't support barriers then step #8 becomes 'issue flush' and steps 11 and 12 take place before step #9, 13, 14 David Lang - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 08/13] NFS: Add functions to parse nfs mount options to fs/nfs/super.c
On Tue, May 29, 2007 at 05:08:01PM -0400, Chuck Lever wrote: Karel Zak wrote: On Mon, May 21, 2007 at 12:09:54PM -0400, Chuck Lever wrote: For NFSv2 and NFSv3 mount options. Signed-off-by: Chuck Lever [EMAIL PROTECTED] +static int nfs_parse_options(char *raw, struct nfs_mount_args *mnt) +{ + char *p, *string; + + if (!raw) { + dprintk(NFS: mount options string was NULL.\n); + return 1; + } + + while ((p = strsep (raw, ,)) != NULL) { + substring_t args[MAX_OPT_ARGS]; + int option, token; + + if (!*p) + continue; + token = match_token(p, nfs_tokens, args); + + case Opt_context: + match_strcpy(mnt-nmd.context, args); + break; The userspace version (nfs-utils) of this code supports a quoted context strings. For example: context=aaa,bbb,ccc,hard It seems your code blindly parses a raw option string by ,. Karel- I've never used the context= option, and didn't find any documentation describing how it was used. That's SELinux stuff. See original discussion: http://thread.gmane.org/gmane.linux.redhat.security.lspp/1002/focus=1004 There are also fscontext, defcontext and context for normal (non-NFS) mounts. See the mount.8 patch (where is basic docs): http://git.kernel.org/?p=utils/util-linux-ng/util-linux-ng.git;a=blobdiff;f=mount/mount.8;h=8ed5a11b77985c8da2dcac4602a67f8785a95070;hp=4692a42b3487b8e0db6dc0b7d17cfd214e8aefc8;hb=3a620ba4bffade41d81c429560c40bb65c9b81a7;hpb=6573c985a4077fa7d50ccb993bae177526fde8ec Is there a clean example of how to use the in-kernel parser to handle quoted strings containing commas? Not sure. It was introduced by [PATCH] SELinux: support mls categories for context mounts: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=3528a95322b5c1ce882ab723f175a1845430cd89 The SELinux specific options are extracted from mount options by the sb_copy_data hook (fs/super.c, vfs_kern_mount()) -- that's probably transparent for all filesystems, maybe for your NFS options too. (I didn't study it in detail.) Karel -- Karel Zak [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] add procfs tunable to enable immediate panic when there are busy inodes after umount
On Tue, May 29, 2007 at 11:40:42AM -0400, Jeff Layton wrote: After spending quite a bit of time tracking down a VFS: busy inodes after unmount problem, it occurs to me that it would be nice to be able to force a panic when that occurs. While an oops message alone is not generally helpful for tracking down this sort of problem, collecting and analyzing a coredump when this occurs can be. Agreed - we've found that we've had roughly 50% success in finding the cause of these problems from crash dumps triggered immediately like this vs ~0% from a crash that occurred some time later. Given that this problem will always result in a crash of the kernel at some random time in the future, why don't we just make this error an unconditional panic on get the crash over and done with? Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [AppArmor 01/41] Pass struct vfsmount to the inode_create LSM hook
2007/5/29, Kyle Moffett [EMAIL PROTECTED]: But writing policy with labels are somewhat indirect way (I mean, we need ls -Z or ps -Z). Indirect way can cause flaw so we need a lot of work that is what I wanted to tell. I don't really use ls -Z or ps -Z when writing SELinux policy; I do that only when I actually think I mislabeled files. I believe what you wrote, but it may not be as easy for average Linux users. As I said before, average Linux users should not be writing their own security policy. I have yet to meet an average Linux user who could reliably quote for me what the file permissions on the /tmp directory should be, or what the sticky bit was. A small percentage of average Linux system administrators don't get that right consistently, and if you don't understand the sticky bit then you should *definitely* not be controlling program permissions on a per- syscall basis. Thank you for your detailed and thoughtful explanation. Things are much clear now for me. Although your explanation was quite persuasive, I still have some concerns. Linux is now being used literately everywhere. As devices, technologies and Linux itself is evolving so quickly, I'm afraid the way you showed was right but could never meet the every goal perfectly. So some areas, including embedded and special distro I guess, there must be solutions and help for average level administrators. I think there are two ways to make secure systems. One is just you wrote: ask it professionals way, the other is making practices. You might ask how? My answer to the question is pahtname-based systems such as AppAmor and TOMOYO Linux. They can't be compared to SELinux, but they should be considered to supplemental tools. At least they are helpful to analyze how Linux works. Tweeking SELinux policy is not easy but writing policies for them is relatively easy (I'm not talking about security here). Not everybody can be a professional administrators, but he/she can be a professional administrator of his/her system. I believe there must be solutions for non professional administrators. That's why we developed TOMOYO Linux (http://tomoyo.sourceforge.jp/) and so was AppArmor I guess. You might laugh, but we are doing this because we want to contribute to Linux and its community. :) Thanks, Toshiharu Harada - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: + fs-introduce-write_begin-write_end-and-perform_write-aops.patch added to -mm tree
On Tue, May 29, 2007 at 02:19:55PM -0700, Andrew Morton wrote: The patch titled fs: introduce write_begin, write_end, and perform_write aops has been added to the -mm tree. Its filename is fs-introduce-write_begin-write_end-and-perform_write-aops.patch *** Remember to use Documentation/SubmitChecklist when testing your code *** See http://www.zip.com.au/~akpm/linux/patches/stuff/added-to-mm.txt to find out what to do about this -- Subject: fs: introduce write_begin, write_end, and perform_write aops From: Nick Piggin [EMAIL PROTECTED] These are intended to replace prepare_write and commit_write with more flexible alternatives that are also able to avoid the buffered write deadlock problems efficiently (which prepare_write is unable to do). OK, well now Andrew's merged a significant chunk of this work, I would like to try getting the clustered filesystem patches back in too (Steven, the last GFS2 patch you sent had rejects against this tree, so I dropped it... hope it isn't too much work to bring it back uptodate?). The cluster filesystems aren't 100% happy with the backward-compat code, because pagecache_write_end cannot handle AOP_TRUNCATED_PAGE from -commit_write... so if you were to try using loop over GFS2, it might go BUG. This is a bit bad of me, however the compat code would have been a whole lot uglier to support that, and I figure the cluster filesystems want to convert to the new aops ASAP anyway. I doubt anybody but the filesystem developers would be using -mm in such a way, but even so I hope we can fix this before long. Meanwhile, I'll look at redoing the rest of the filesystems that got left behind. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: + fs-introduce-write_begin-write_end-and-perform_write-aops.patch added to -mm tree
On Wed, 30 May 2007 05:13:54 +0200 Nick Piggin [EMAIL PROTECTED] wrote: On Tue, May 29, 2007 at 02:19:55PM -0700, Andrew Morton wrote: The patch titled fs: introduce write_begin, write_end, and perform_write aops has been added to the -mm tree. Its filename is fs-introduce-write_begin-write_end-and-perform_write-aops.patch *** Remember to use Documentation/SubmitChecklist when testing your code *** See http://www.zip.com.au/~akpm/linux/patches/stuff/added-to-mm.txt to find out what to do about this -- Subject: fs: introduce write_begin, write_end, and perform_write aops From: Nick Piggin [EMAIL PROTECTED] These are intended to replace prepare_write and commit_write with more flexible alternatives that are also able to avoid the buffered write deadlock problems efficiently (which prepare_write is unable to do). OK, well now Andrew's merged a significant chunk of this work, I would like to try getting the clustered filesystem patches back in too (Steven, the last GFS2 patch you sent had rejects against this tree, so I dropped it... hope it isn't too much work to bring it back uptodate?). The cluster filesystems aren't 100% happy with the backward-compat code, because pagecache_write_end cannot handle AOP_TRUNCATED_PAGE from -commit_write... so if you were to try using loop over GFS2, it might go BUG. This is a bit bad of me, however the compat code would have been a whole lot uglier to support that, and I figure the cluster filesystems want to convert to the new aops ASAP anyway. I doubt anybody but the filesystem developers would be using -mm in such a way, but even so I hope we can fix this before long. Meanwhile, I'll look at redoing the rest of the filesystems that got left behind. hm, I suppose that means I need to undrop git-ocfs2.patch. It has a mild disagreeement with the fault-vs-invalidate patches which I didn't feel like fixing. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [1/4] 2.6.22-rc3: known regressions
On Tue, May 29, 2007 at 02:52:53PM +0200, Michal Piotrowski wrote: Hi all, Here is a list of some known regressions in 2.6.22-rc3. Kbuild Subject: make M=$PWD modules_install does nothing References : http://lkml.org/lkml/2007/5/27/190 Submitter : Andrey Borzenkov [EMAIL PROTECTED] Status : Unknown Closed - see http://lkml.org/lkml/2007/5/29/497 Sam - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [AppArmor 01/41] Pass struct vfsmount to the inode_create LSM hook
[EMAIL PROTECTED] wrote: On Mon, 28 May 2007 21:54:46 EDT, Kyle Moffett said: Average users are not supposed to be writing security policy. To be honest, even average-level system administrators should not be writing security policy. That explains so much! SELinux: you're too dumb to use it, so just keep your hands in your pockets. :-) AppArmor was designed to allow your average sys admin to write a security policy. It makes different design choices than SELinux to achieve that goal. As a result, AppArmor is an utter failure when compared to SELinux's goals, and SELinux in turn is an utter failure when compared to AppArmor's goals. Which is why we have LSM: so we don't have to have this argument here, again. It's OK for such sysadmins to tweak existing policy to give access to additional web-docs or such, but only expert sysadmin/developers or security professionals should be writing security policy. It's just too damn easy to get completely wrong. The single biggest challenge in computer security at the present time is how to build *and deploy* servers that stay reasonably secure even when run by the average wave-a-dead-chicken sysadmin, and desktop-class boxes that can survive the best attempts of Joe Sixpack's Ooh shiny reflex, and Joe's kid's attempts to evade the nannyware that Joe had somebody install. That is a tall order. You can mostly achieve it by not giving the user the root password, but I'm not sure you would like the result :-) Both SELinux and AppArmor can be configured so tightly that you are not going to get to install malware, by preventing the user from installing software. This isn't what users want, so they invariably bypass security and install shiny things if they own the box. SELinux and AppArmor can't help but fail if you put them in that kind of harm's way. Crispin -- Crispin Cowan, Ph.D. http://crispincowan.com/~crispin/ Director of Software Engineering http://novell.com Security: It's not linear - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html