Re: 2.4.2: opening deleted directories
repeating neil's question from below "why is the open(".") behavior an issue?" there are a few reasons: * NFS close-to-open cache consistency: in NFS, close-to-open semantics require that attributes be fetched from the server when a file/directory is opened. this behavior is in part to help an application determine whether the file or directory still exists or has been replaced or removed, as in my 'ls' example. in the open(".") case, path_walk doesn't invoke either d_lookup or d_revalidate, so there is no opportunity in the present logic to retrieve the directory's attributes. this potentially breaks programs that depend on attributes being correct when opening ".". 'make' and 'ls' are just two examples. this is, btw, the original problem Trond and I were discussing. he pointed out this problem. * rmdir behavior: the POSIX 1003.1 definition of rmdir() states that: If the directory is the root directory or the current working directory of any process, the effect of this function is implementation-defined. cop-out. it later states that: If one or more processes have the directory open when the last link is removed, the dot and dot-dot entries, if present, are removed before rmdir() returns and no new entries may be created in the directory. this indicates to me that, while the directory may continue to exist if it's the cwd of some other process, the "." and ".." entries must be removed, or equivalently, that lookups of "." and ".." will always fail after a directory is deleted. * standard pathname resolution behavior: according to POSIX 1003.1, resolving a relative pathname means the resolution *begins* at the current working directory. In our case, if we follow POSIX resolution strategy, after starting at the cwd, a lookup of "." should be done. At this point I infer that since the directory has been deleted, "." doesn't exist, and open() returns ENOENT. IOW, according to the text in the standard, the current working directory is not an open file descriptor, it is simply a naming convenience used during pathname resolution. * other broken system calls: i haven't tried this, but i'd guess stat(".") would behave similarly. thus stat(".") on such a removed directory would tell an application that the directory exists when in fact it doesn't. this borders on insecure behavior. fortunately, no other operations are allowed on the directory. * consistent behavior across operating systems: there are other flavors of UNIX that don't appear to work this way. once the directory is removed, it cannot be opened, both on Solaris and OpenBSD. i don't have access to others at the moment. this complicates porting applications among operating systems, if only slightly. * open() is a name space operation: open() converts a pathname into a file descriptor; it's a name space operation. i believe that if a file or directory no longer exists, applications expect they will not be able to open the file or directory because it is no longer attached to the file system's name space. i don't believe there are any other cases in Linux where you can open a file or directory that has been removed, are there? * good design: i believe in reporting an error as soon as it occurs. there are no other operations allowed on a deleted directory, but the open() call is the first opportunity for the operating system to indicate to an application that the directory is gone. i'm not trying to start a rock fight. but i think this behavior is a little strange when compared to other systems, and especially the NFS part is bothersome. and yes, i know that ext2 doesn't support lookups on ".". but other file systems do... and Linux is operating in a larger universe these days. and NB: according to the POSIX standard's description of relative pathname lookup and rmdir, i'd say that, if the cwd is a deleted directory, open("..") should also fail . i haven't checked whether this is true or not. but we do know that ".." is handled similarly in path_walk -- no d_lookup or d_revalidate is done. - Original Message - From: "Neil Brown" [EMAIL PROTECTED] To: "Chuck Lever" [EMAIL PROTECTED] Cc: "Linux FS Developers" [EMAIL PROTECTED]; "Trond Myklebust" [EMAIL PROTECTED] Sent: Sunday, March 18, 2001 5:40 PM Subject: Re: 2.4.2: opening deleted directories On Sunday March 18, [EMAIL PROTECTED] wrote: on Linux, if my cwd is a deleted directory, i can still open it. to wit: notice the open(".") -- it opens the current working directory that is in effect for the "ls" command. but i just deleted that directory from another shell. shouldn't that open(".") return ENOENT? Note that the error message you expect is is "ENOENT" == Error, NO ENTry. The ENTRY th
[PATCH 0/2] RFC: exporting per-superblock statistics to user space
We still have a need to provide iostat like statistics for NFS clients. Following are a couple of patches, against 2.6.11.3, which prototype an approach for providing this kind of data to user programs. I'd like some comment on the approach. 01-mountstats.patch adds a new file called /proc/self/mountstats and a new file system method called show_stats. this just replicates /proc/mounts and the show_options hook. 02-nfs-iostat.patch teachs the NFS client to use the new show_stats hook as a demonstration. Note that this approach addresses previously voiced concerns about exporting per-superblock stats to user space. 1. Processes can't see stats for file systems mounted outside their namespace. 2. Reading the stats file is serialized with mount and unmount operations. 3. The approach doesn't use /sys or kobjects. 4. There are no lifetime issues tied to file systems loaded as a module. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2] VFS: New /proc file /proc/self/mountstats
Create a new file under /proc/self, called mountstats, where mounted file systems can export information (configuration options, performance counters, and so on). Use a mechanism similar to /proc/mounts and s_ops-show_options. This mechanism does not violate namespace security, and is safe to use while other processes are unmounting file systems. Version: Mon, 14 Mar 2005 17:06:04 -0500 Signed-off-by: Chuck Lever [EMAIL PROTECTED] --- fs/namespace.c | 66 + fs/proc/base.c | 40 +++ include/linux/fs.h |1 3 files changed, 107 insertions(+) diff -X /home/cel/src/linux/dont-diff -Naurp 00-stock/fs/namespace.c 01-mountstats/fs/namespace.c --- 00-stock/fs/namespace.c 2005-03-02 02:38:13.0 -0500 +++ 01-mountstats/fs/namespace.c2005-03-14 15:24:51.565085000 -0500 @@ -265,6 +265,72 @@ struct seq_operations mounts_op = { .show = show_vfsmnt }; +/* iterator */ +static void *ms_start(struct seq_file *m, loff_t *pos) +{ + struct namespace *n = m-private; + struct list_head *p; + loff_t l = *pos; + + down_read(n-sem); + list_for_each(p, n-list) + if (!l--) + return list_entry(p, struct vfsmount, mnt_list); + return NULL; +} + +static void *ms_next(struct seq_file *m, void *v, loff_t *pos) +{ + struct namespace *n = m-private; + struct list_head *p = ((struct vfsmount *)v)-mnt_list.next; + (*pos)++; + return p==n-list ? NULL : list_entry(p, struct vfsmount, mnt_list); +} + +static void ms_stop(struct seq_file *m, void *v) +{ + struct namespace *n = m-private; + up_read(n-sem); +} + +static int show_vfsstat(struct seq_file *m, void *v) +{ + struct vfsmount *mnt = v; + int err = 0; + + /* device */ + if (mnt-mnt_devname) { + seq_puts(m, device ); + mangle(m, mnt-mnt_devname); + } else + seq_puts(m, no device); + + /* mount point */ + seq_puts(m, mounted on ); + seq_path(m, mnt, mnt-mnt_root, \t\n\\); + seq_putc(m, ' '); + + /* file system type */ + seq_puts(m, with fstype ); + mangle(m, mnt-mnt_sb-s_type-name); + + /* optional statistics */ + if (mnt-mnt_sb-s_op-show_stats) { + seq_putc(m, ' '); + err = mnt-mnt_sb-s_op-show_stats(m, mnt); + } + + seq_putc(m, '\n'); + return err; +} + +struct seq_operations mountstats_op = { + .start = ms_start, + .next = ms_next, + .stop = ms_stop, + .show = show_vfsstat, +}; + /** * may_umount_tree - check if a mount tree is busy * @mnt: root of mount tree diff -X /home/cel/src/linux/dont-diff -Naurp 00-stock/fs/proc/base.c 01-mountstats/fs/proc/base.c --- 00-stock/fs/proc/base.c 2005-03-02 02:38:12.0 -0500 +++ 01-mountstats/fs/proc/base.c2005-03-14 15:24:51.571085000 -0500 @@ -60,6 +60,7 @@ enum pid_directory_inos { PROC_TGID_STATM, PROC_TGID_MAPS, PROC_TGID_MOUNTS, + PROC_TGID_MOUNTSTATS, PROC_TGID_WCHAN, #ifdef CONFIG_SCHEDSTATS PROC_TGID_SCHEDSTAT, @@ -91,6 +92,7 @@ enum pid_directory_inos { PROC_TID_STATM, PROC_TID_MAPS, PROC_TID_MOUNTS, + PROC_TID_MOUNTSTATS, PROC_TID_WCHAN, #ifdef CONFIG_SCHEDSTATS PROC_TID_SCHEDSTAT, @@ -134,6 +136,7 @@ static struct pid_entry tgid_base_stuff[ E(PROC_TGID_ROOT, root,S_IFLNK|S_IRWXUGO), E(PROC_TGID_EXE, exe, S_IFLNK|S_IRWXUGO), E(PROC_TGID_MOUNTS,mounts, S_IFREG|S_IRUGO), + E(PROC_TGID_MOUNTSTATS, mountstats, S_IFREG|S_IRUGO), #ifdef CONFIG_SECURITY E(PROC_TGID_ATTR, attr,S_IFDIR|S_IRUGO|S_IXUGO), #endif @@ -164,6 +167,7 @@ static struct pid_entry tid_base_stuff[] E(PROC_TID_ROOT, root,S_IFLNK|S_IRWXUGO), E(PROC_TID_EXE,exe, S_IFLNK|S_IRWXUGO), E(PROC_TID_MOUNTS, mounts, S_IFREG|S_IRUGO), + E(PROC_TID_MOUNTSTATS, mountstats, S_IFREG|S_IRUGO), #ifdef CONFIG_SECURITY E(PROC_TID_ATTR, attr,S_IFDIR|S_IRUGO|S_IXUGO), #endif @@ -528,6 +532,38 @@ static struct file_operations proc_mount .release= mounts_release, }; +extern struct seq_operations mountstats_op; +static int mountstats_open(struct inode *inode, struct file *file) +{ + struct task_struct *task = proc_task(inode); + int ret = seq_open(file, mountstats_op); + + if (!ret) { + struct seq_file *m = file-private_data; + struct namespace *namespace; + task_lock(task); + namespace = task-namespace; + if (namespace) + get_namespace(namespace); + task_unlock(task); + + if (namespace) + m-private = namespace
[PATCH 2/2] NFS: add I/O performance counters
Add an extensible per-superblock performance counter facility to the NFS client. This facility mimics the counters available for block devices and for networking. Expose these new counters via /proc/self/mountstats. Version: Mon, 14 Mar 2005 17:06:12 -0500 Signed-off-by: Chuck Lever [EMAIL PROTECTED] --- fs/nfs/dir.c |8 ++ fs/nfs/direct.c|5 + fs/nfs/file.c | 20 +++-- fs/nfs/inode.c | 126 +++-- fs/nfs/pagelist.c | 12 ++- fs/nfs/read.c |7 ++ fs/nfs/write.c | 10 ++ include/linux/nfs_fs_sb.h |5 + include/linux/nfs_iostat.h | 80 +++ 9 files changed, 256 insertions(+), 17 deletions(-) diff -X /home/cel/src/linux/dont-diff -Naurp 01-mountstats/fs/nfs/dir.c 02-nfs-iostat/fs/nfs/dir.c --- 01-mountstats/fs/nfs/dir.c 2005-03-02 02:38:09.0 -0500 +++ 02-nfs-iostat/fs/nfs/dir.c 2005-03-14 15:28:34.011484000 -0500 @@ -27,6 +27,7 @@ #include linux/mm.h #include linux/sunrpc/clnt.h #include linux/nfs_fs.h +#include linux/nfs_iostat.h #include linux/nfs_mount.h #include linux/pagemap.h #include linux/smp_lock.h @@ -428,6 +429,8 @@ static int nfs_readdir(struct file *filp lock_kernel(); + nfs_inc_stats(inode, NFS_VFS_GETDENTS); + res = nfs_revalidate_inode(NFS_SERVER(inode), inode); if (res 0) { unlock_kernel(); @@ -584,6 +587,7 @@ static int nfs_lookup_revalidate(struct parent = dget_parent(dentry); lock_kernel(); dir = parent-d_inode; + nfs_inc_stats(dir, NFS_DENTRY_REVALIDATE); inode = dentry-d_inode; if (nd !(nd-flags LOOKUP_CONTINUE) (nd-flags LOOKUP_OPEN)) @@ -712,6 +716,7 @@ static struct dentry *nfs_lookup(struct dfprintk(VFS, NFS: lookup(%s/%s)\n, dentry-d_parent-d_name.name, dentry-d_name.name); + nfs_inc_stats(dir, NFS_VFS_LOOKUP); res = ERR_PTR(-ENAMETOOLONG); if (dentry-d_name.len NFS_SERVER(dir)-namelen) @@ -1116,6 +1121,7 @@ static int nfs_sillyrename(struct inode dfprintk(VFS, NFS: silly-rename(%s/%s, ct=%d)\n, dentry-d_parent-d_name.name, dentry-d_name.name, atomic_read(dentry-d_count)); + nfs_inc_stats(dir, NFS_SILLY_RENAME); #ifdef NFS_PARANOIA if (!dentry-d_inode) @@ -1500,6 +1506,8 @@ int nfs_permission(struct inode *inode, struct rpc_cred *cred; int res; + nfs_inc_stats(inode, NFS_VFS_ACCESS); + if (mask == 0) return 0; diff -X /home/cel/src/linux/dont-diff -Naurp 01-mountstats/fs/nfs/direct.c 02-nfs-iostat/fs/nfs/direct.c --- 01-mountstats/fs/nfs/direct.c 2005-03-02 02:38:25.0 -0500 +++ 02-nfs-iostat/fs/nfs/direct.c 2005-03-14 15:26:16.401349000 -0500 @@ -47,6 +47,7 @@ #include linux/kref.h #include linux/nfs_fs.h +#include linux/nfs_iostat.h #include linux/nfs_page.h #include linux/sunrpc/clnt.h @@ -354,6 +355,8 @@ static ssize_t nfs_direct_read_seg(struc result = nfs_direct_read_wait(dreq, clnt-cl_intr); rpc_clnt_sigunmask(clnt, oldset); + nfs_add_stats(inode, NFS_WIRE_READ_BYTES, result); + nfs_add_stats(inode, NFS_DIRECT_READ_BYTES, result); return result; } @@ -576,6 +579,8 @@ static ssize_t nfs_direct_write(struct i if (result size) break; } + nfs_add_stats(inode, NFS_WIRE_WRITTEN_BYTES, tot_bytes); + nfs_add_stats(inode, NFS_DIRECT_WRITTEN_BYTES, tot_bytes); return tot_bytes; } diff -X /home/cel/src/linux/dont-diff -Naurp 01-mountstats/fs/nfs/file.c 02-nfs-iostat/fs/nfs/file.c --- 01-mountstats/fs/nfs/file.c 2005-03-02 02:38:38.0 -0500 +++ 02-nfs-iostat/fs/nfs/file.c 2005-03-14 15:42:52.446804000 -0500 @@ -22,6 +22,7 @@ #include linux/fcntl.h #include linux/stat.h #include linux/nfs_fs.h +#include linux/nfs_iostat.h #include linux/nfs_mount.h #include linux/mm.h #include linux/slab.h @@ -86,18 +87,15 @@ static int nfs_check_flags(int flags) static int nfs_file_open(struct inode *inode, struct file *filp) { - struct nfs_server *server = NFS_SERVER(inode); - int (*open)(struct inode *, struct file *); int res; res = nfs_check_flags(filp-f_flags); if (res) return res; + nfs_inc_stats(inode, NFS_VFS_OPEN); lock_kernel(); - /* Do NFSv4 open() call */ - if ((open = server-rpc_ops-file_open) != NULL) - res = open(inode, filp); + res = NFS_SERVER(inode)-rpc_ops-file_open(inode, filp); unlock_kernel(); return res; } @@ -105,6 +103,7 @@ nfs_file_open(struct inode *inode, struc static int nfs_file_release(struct inode *inode, struct file *filp) { + nfs_inc_stats(inode, NFS_VFS_CLOSE); return NFS_PROTO(inode)-file_release(inode, filp); } @@ -123,6 +122,7
[PATCH 13/13] NFS: Integrate support for processing nfs4 mount options in fs/nfs/super.c
Finally, hook in the new mount option parsing logic. Signed-off-by: Chuck Lever [EMAIL PROTECTED] --- fs/nfs/super.c | 87 1 files changed, 19 insertions(+), 68 deletions(-) diff --git a/fs/nfs/super.c b/fs/nfs/super.c index e0acd08..222bb49 100644 --- a/fs/nfs/super.c +++ b/fs/nfs/super.c @@ -13,6 +13,8 @@ * * Split from inode.c by David Howells [EMAIL PROTECTED] * + * In-kernel mount option parsing by Chuck Lever [EMAIL PROTECTED] + * * - superblocks are indexed on server only - all inodes, dentries, etc. associated with a * particular server are held in the same superblock * - NFS superblocks can have several effective roots to the dentry tree @@ -1532,7 +1534,6 @@ static int nfs4_parse_options(char *raw, struct nfs4_mount_args *mnt) if (len 80) goto out_clntaddr_long; match_strcpy(mnt-clientaddr, args); - mnt-nmd.client_addr.data = mnt-clientaddr; mnt-nmd.client_addr.len = len; break; } @@ -1605,10 +1606,8 @@ static struct nfs4_mount_data *nfs4_convert_mount_opts(const char *options) args-nmd.acdirmax = 60; args-nmd.auth_flavourlen = 0; - args-nmd.auth_flavours = args-authflavor; args-nmd.host_addrlen = sizeof(args-addr); - args-nmd.host_addr = (struct sockaddr *) args-addr; args-addr.sin_port = htons(NFS_PORT); @@ -1652,6 +1651,7 @@ static int nfs4_validate_mount_data(struct nfs4_mount_data **options, char *ip_addr) { struct nfs4_mount_data *data = *options; + struct nfs4_mount_args *args; char *c; unsigned len; @@ -1707,25 +1707,26 @@ static int nfs4_validate_mount_data(struct nfs4_mount_data **options, if (IS_ERR(data)) return PTR_ERR(data); *options = data; + args = (struct nfs4_mount_args *) data; - memcpy(addr, data-host_addr, sizeof(*addr)); - if (!nfs_verify_server_address((struct sockaddr *) addr, + if (!nfs_verify_server_address((struct sockaddr *) args-addr, data-host_addrlen)) return -EINVAL; + memcpy(addr, args-addr, sizeof(*addr)); switch (data-auth_flavourlen) { case 0: *authflavour = RPC_AUTH_UNIX; break; case 1: - *authflavour = (rpc_authflavor_t) data-auth_flavours[0]; + *authflavour = (rpc_authflavor_t) args-authflavor; break; default: goto out_inval_auth; } memset(ip_addr, '\0', data-client_addr.len + 1); - strncpy(ip_addr, data-client_addr.data, data-client_addr.len); + strncpy(ip_addr, args-clientaddr, data-client_addr.len); /* * Split dev_name into hostname:mntpath. @@ -1804,67 +1805,17 @@ static int nfs4_get_sb(struct file_system_type *fs_type, struct nfs_fh mntfh; struct dentry *mntroot; char *mntpath = NULL, *hostname = NULL, ip_addr[16]; - void *p; int error; - if (data == NULL) { - dprintk(%s: missing data argument\n, __FUNCTION__); - return -EINVAL; - } - if (data-version = 0 || data-version NFS4_MOUNT_VERSION) { - dprintk(%s: bad mount version\n, __FUNCTION__); - return -EINVAL; - } - - /* We now require that the mount process passes the remote address */ - if (data-host_addrlen != sizeof(addr)) - return -EINVAL; - - if (copy_from_user(addr, data-host_addr, sizeof(addr))) - return -EFAULT; - - if (!nfs_verify_server_address((struct sockaddr *) addr, - data-host_addrlen)) - return -EINVAL; - - /* RFC3530: The default port for NFS is 2049 */ - if (addr.sin_port == 0) - addr.sin_port = htons(NFS_PORT); - - /* Grab the authentication type */ - authflavour = RPC_AUTH_UNIX; - if (data-auth_flavourlen != 0) { - if (data-auth_flavourlen != 1) { - dprintk(%s: Invalid number of RPC auth flavours %d.\n, - __FUNCTION__, data-auth_flavourlen); - error = -EINVAL; - goto out_err_noserver; - } - - if (copy_from_user(authflavour, data-auth_flavours, - sizeof(authflavour))) { - error = -EFAULT; - goto out_err_noserver
[PATCH 06/13] NFS: Improve debugging output in NFS in-kernel mount client
Signed-off-by: Chuck Lever [EMAIL PROTECTED] --- fs/nfs/mount_clnt.c| 18 +- include/linux/nfs_fs.h |1 + 2 files changed, 14 insertions(+), 5 deletions(-) diff --git a/fs/nfs/mount_clnt.c b/fs/nfs/mount_clnt.c index f8584ad..81ea782 100644 --- a/fs/nfs/mount_clnt.c +++ b/fs/nfs/mount_clnt.c @@ -16,7 +16,7 @@ #include linux/nfs_fs.h #ifdef RPC_DEBUG -# define NFSDBG_FACILITY NFSDBG_ROOT +# define NFSDBG_FACILITY NFSDBG_MOUNT #endif static struct rpc_program mnt_program; @@ -72,8 +72,8 @@ int nfs_mount(struct sockaddr_in *addr, char *path, struct nfs_fh *fh, charhostname[32]; int status; - dprintk(NFS: nfs_mount(%08x:%s)\n, - (unsigned)ntohl(addr-sin_addr.s_addr), path); + dprintk(NFS: %s: mounting NIPQUAD_FMT :%s\n, + __FUNCTION__, NIPQUAD(addr-sin_addr.s_addr), path); sprintf(hostname, NIPQUAD_FMT, NIPQUAD(addr-sin_addr.s_addr)); mnt_clnt = mnt_create(hostname, addr, version, protocol); @@ -86,10 +86,18 @@ int nfs_mount(struct sockaddr_in *addr, char *path, struct nfs_fh *fh, msg.rpc_proc = mnt_clnt-cl_procinfo[MNTPROC_MNT]; status = rpc_call_sync(mnt_clnt, msg, 0); - if (status 0) + if (status 0) { + dprintk(NFS: %s: rpc_call_sync returned %d\n, + __FUNCTION__, status); return status; - if (result.status != 0) + } + if (result.status != 0) { + dprintk(NFS: %s: server returned %d\n, + __FUNCTION__, result.status); return -EACCES; + } + dprintk(NFS: %s: mount request succeeded\n, + __FUNCTION__); return 0; } diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h index 58f5b77..2f33ef7 100644 --- a/include/linux/nfs_fs.h +++ b/include/linux/nfs_fs.h @@ -555,6 +555,7 @@ extern void * nfs_root_data(void); #define NFSDBG_ROOT0x0080 #define NFSDBG_CALLBACK0x0100 #define NFSDBG_CLIENT 0x0200 +#define NFSDBG_MOUNT 0x0400 #define NFSDBG_ALL 0x #ifdef __KERNEL__ - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 09/13] NFS: Implement NFSv2/3 in-kernel mount option parsing
Signed-off-by: Chuck Lever [EMAIL PROTECTED] --- fs/nfs/super.c | 130 +++- 1 files changed, 82 insertions(+), 48 deletions(-) diff --git a/fs/nfs/super.c b/fs/nfs/super.c index a9f698b..7b7cacb 100644 --- a/fs/nfs/super.c +++ b/fs/nfs/super.c @@ -935,8 +935,6 @@ static struct nfs_mount_data *nfs_convert_mount_opts(const char *options) if (args == NULL) return ERR_PTR(-ENOMEM); - args-nmd.version = 7; - args-nmd.flags = (NFS_MOUNT_VER3 | NFS_MOUNT_TCP); args-nmd.rsize = NFS_MAX_FILE_IO_SIZE; args-nmd.wsize = NFS_MAX_FILE_IO_SIZE; @@ -989,71 +987,74 @@ out_invalid: * Validate the NFS2/NFS3 mount data * - fills in the mount root filehandle */ -static int nfs_validate_mount_data(struct nfs_mount_data *data, - struct nfs_fh *mntfh) +static int nfs_validate_mount_data(struct nfs_mount_data **options, + struct nfs_fh *mntfh, + const char *dev_name) { - if (data == NULL) { - dprintk(%s: missing data argument\n, __FUNCTION__); - return -EINVAL; - } + struct nfs_mount_data *data = *options; + unsigned int len; + char *c; + int status; - if (data-version = 0 || data-version NFS_MOUNT_VERSION) { - dprintk(%s: bad mount version\n, __FUNCTION__); - return -EINVAL; - } + if (data == NULL) + goto out_no_data; switch (data-version) { - case 1: - data-namlen = 0; - case 2: - data-bsize = 0; - case 3: - if (data-flags NFS_MOUNT_VER3) { - dprintk(%s: mount structure version %d does not support NFSv3\n, - __FUNCTION__, - data-version); - return -EINVAL; - } - data-root.size = NFS2_FHSIZE; - memcpy(data-root.data, data-old_root.data, NFS2_FHSIZE); - case 4: - if (data-flags NFS_MOUNT_SECFLAVOUR) { - dprintk(%s: mount structure version %d does not support strong security\n, - __FUNCTION__, - data-version); - return -EINVAL; - } - case 5: - memset(data-context, 0, sizeof(data-context)); + case 1: + data-namlen = 0; + case 2: + data-bsize = 0; + case 3: + if (data-flags NFS_MOUNT_VER3) + goto out_no_v3; + data-root.size = NFS2_FHSIZE; + memcpy(data-root.data, data-old_root.data, NFS2_FHSIZE); + case 4: + if (data-flags NFS_MOUNT_SECFLAVOUR) + goto out_no_sec; + case 5: + memset(data-context, 0, sizeof(data-context)); + case 6: + break; + default: + data = nfs_convert_mount_opts((char *) data); + if (IS_ERR(data)) + return PTR_ERR(data); + *options = data; + + c = strchr(dev_name, ':'); + if (c == NULL) + return -EINVAL; + len = c - dev_name - 1; + if (len 256) + return -EINVAL; + strncpy(data-hostname, dev_name, len); + + status = nfs_try_mount(data, ++c); + if (status) + return -EINVAL; } - /* Set the pseudoflavor */ if (!(data-flags NFS_MOUNT_SECFLAVOUR)) data-pseudoflavor = RPC_AUTH_UNIX; #ifndef CONFIG_NFS_V3 - /* If NFSv3 is not compiled in, return -EPROTONOSUPPORT */ - if (data-flags NFS_MOUNT_VER3) { - dprintk(%s: NFSv3 not compiled into kernel\n, __FUNCTION__); - return -EPROTONOSUPPORT; - } -#endif /* CONFIG_NFS_V3 */ + if (data-flags NFS_MOUNT_VER3) + goto out_v3_not_compiled; +#endif /* !CONFIG_NFS_V3 */ /* We now require that the mount process passes the remote address */ if (!nfs_verify_server_address((struct sockaddr *) data-addr, sizeof(data-addr))) return -EINVAL; - /* Prepare the root filehandle */ if (data-flags NFS_MOUNT_VER3) mntfh-size = data-root.size; else mntfh-size = NFS2_FHSIZE; - if (mntfh-size sizeof(mntfh-data)) { - dprintk(%s: invalid root filehandle\n, __FUNCTION__); - return
[PATCH 01/13] NFS: Refactor IP address sanity checks in NFS client
Provide mechanism for adding IPv6 address support at some later point. Signed-off-by: Chuck Lever [EMAIL PROTECTED] Cc: Aurelien Charbon [EMAIL PROTECTED] --- fs/nfs/super.c | 39 --- 1 files changed, 28 insertions(+), 11 deletions(-) diff --git a/fs/nfs/super.c b/fs/nfs/super.c index 1fce778..31f7313 100644 --- a/fs/nfs/super.c +++ b/fs/nfs/super.c @@ -436,6 +436,28 @@ static void nfs_umount_begin(struct vfsmount *vfsmnt, int flags) } /* + * Sanity-check a server address provided by the mount command + */ +static int nfs_verify_server_address(struct sockaddr *addr, size_t len) +{ + if (len sizeof(struct sockaddr)) + goto out_invalid; + + switch (addr-sa_family) { + case AF_INET: { + struct sockaddr_in *sa = (struct sockaddr_in *) addr; + if (sa-sin_addr.s_addr != INADDR_ANY) + return 1; + break; + } + } + +out_invalid: + dprintk(NFS: mount program passed an invalid remote address\n); + return 0; +} + +/* * Validate the NFS2/NFS3 mount data * - fills in the mount root filehandle */ @@ -490,11 +512,9 @@ static int nfs_validate_mount_data(struct nfs_mount_data *data, #endif /* CONFIG_NFS_V3 */ /* We now require that the mount process passes the remote address */ - if (data-addr.sin_addr.s_addr == INADDR_ANY) { - dprintk(%s: mount program didn't pass remote address!\n, - __FUNCTION__); - return -EINVAL; - } + if (!nfs_verify_server_address((struct sockaddr *) data-addr, + sizeof(data-addr))) + return -EINVAL; /* Prepare the root filehandle */ if (data-flags NFS_MOUNT_VER3) @@ -828,13 +848,10 @@ static int nfs4_get_sb(struct file_system_type *fs_type, if (copy_from_user(addr, data-host_addr, sizeof(addr))) return -EFAULT; - if (addr.sin_family != AF_INET || - addr.sin_addr.s_addr == INADDR_ANY - ) { - dprintk(%s: mount program didn't pass remote IP address!\n, - __FUNCTION__); + if (!nfs_verify_server_address((struct sockaddr *) addr, + data-host_addrlen)) return -EINVAL; - } + /* RFC3530: The default port for NFS is 2049 */ if (addr.sin_port == 0) addr.sin_port = htons(NFS_PORT); - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 08/13] NFS: Add functions to parse nfs mount options to fs/nfs/super.c
For NFSv2 and NFSv3 mount options. Signed-off-by: Chuck Lever [EMAIL PROTECTED] --- fs/nfs/super.c | 449 1 files changed, 449 insertions(+), 0 deletions(-) diff --git a/fs/nfs/super.c b/fs/nfs/super.c index 1974648..a9f698b 100644 --- a/fs/nfs/super.c +++ b/fs/nfs/super.c @@ -514,6 +514,455 @@ static void nfs_umount_begin(struct vfsmount *vfsmnt, int flags) shrink_submounts(vfsmnt, nfs_automount_list); } + +static match_table_t nfs_tokens = { + {Opt_userspace, bg}, + {Opt_userspace, fg}, + {Opt_soft, soft}, + {Opt_hard, hard}, + {Opt_intr, intr}, + {Opt_nointr, nointr}, + {Opt_posix, posix}, + {Opt_noposix, noposix}, + {Opt_cto, cto}, + {Opt_nocto, nocto}, + {Opt_ac, ac}, + {Opt_noac, noac}, + {Opt_lock, lock}, + {Opt_nolock, nolock}, + {Opt_v2, v2}, + {Opt_v3, v3}, + {Opt_udp, udp}, + {Opt_tcp, tcp}, + {Opt_acl, acl}, + {Opt_noacl, noacl}, + + {Opt_port, port=%u}, + {Opt_rsize, rsize=%u}, + {Opt_wsize, wsize=%u}, + {Opt_timeo, timeo=%u}, + {Opt_retrans, retrans=%u}, + {Opt_acregmin, acregmin=%u}, + {Opt_acregmax, acregmax=%u}, + {Opt_acdirmin, acdirmin=%u}, + {Opt_acdirmax, acdirmax=%u}, + {Opt_actimeo, actimeo=%u}, + {Opt_userspace, retry=%u}, + {Opt_namelen, namlen=%u}, + {Opt_mountport, mountport=%u}, + {Opt_mountprog, mountprog=%u}, + {Opt_mountvers, mountvers=%u}, + {Opt_nfsprog, nfsprog=%u}, + {Opt_nfsvers, nfsvers=%u}, + {Opt_nfsvers, vers=%u}, + + {Opt_sec, sec=%s}, + {Opt_proto, proto=%s}, + {Opt_addr, addr=%s}, + {Opt_mounthost, mounthost=%s}, + {Opt_context, context=%s}, + + {Opt_err, NULL}, +}; + +static int nfs_parse_options(char *raw, struct nfs_mount_args *mnt) +{ + char *p, *string; + + if (!raw) { + dprintk(NFS: mount options string was NULL.\n); + return 1; + } + + while ((p = strsep (raw, ,)) != NULL) { + substring_t args[MAX_OPT_ARGS]; + int option, token; + + if (!*p) + continue; + token = match_token(p, nfs_tokens, args); + + dprintk(NFS: nfs mount option '%s': parsing token %d\n, + p, token); + + switch (token) { + case Opt_soft: + mnt-nmd.flags |= NFS_MOUNT_SOFT; + break; + case Opt_hard: + mnt-nmd.flags = ~NFS_MOUNT_SOFT; + break; + case Opt_intr: + mnt-nmd.flags |= NFS_MOUNT_INTR; + break; + case Opt_nointr: + mnt-nmd.flags = ~NFS_MOUNT_INTR; + break; + case Opt_posix: + mnt-nmd.flags |= NFS_MOUNT_POSIX; + break; + case Opt_noposix: + mnt-nmd.flags = ~NFS_MOUNT_POSIX; + break; + case Opt_cto: + mnt-nmd.flags = ~NFS_MOUNT_NOCTO; + break; + case Opt_nocto: + mnt-nmd.flags |= NFS_MOUNT_NOCTO; + break; + case Opt_ac: + mnt-nmd.flags = ~NFS_MOUNT_NOAC; + break; + case Opt_noac: + mnt-nmd.flags |= NFS_MOUNT_NOAC; + break; + case Opt_lock: + mnt-nmd.flags = ~NFS_MOUNT_NONLM; + break; + case Opt_nolock: + mnt-nmd.flags |= NFS_MOUNT_NONLM; + break; + case Opt_v2: + mnt-nmd.flags = ~NFS_MOUNT_VER3; + break; + case Opt_v3: + mnt-nmd.flags |= NFS_MOUNT_VER3; + break; + case Opt_udp: + mnt-nmd.flags = ~NFS_MOUNT_TCP; + break; + case Opt_tcp: + mnt-nmd.flags |= NFS_MOUNT_TCP; + break; + case Opt_acl: + mnt-nmd.flags = ~NFS_MOUNT_NOACL; + break; + case Opt_noacl: + mnt-nmd.flags |= NFS_MOUNT_NOACL; + break; + + case Opt_port: + if (match_int(args, option)) + return 0; + if (option 0 || option 65535) + return 0; + mnt-nmd.addr.sin_port = htonl(option); + break; + case Opt_rsize
[PATCH 10/13] NFS: Add functions to parse nfs4 mount options to fs/nfs/super.c
Add helpers required for parsing nfs4 mount options in the NFS client. Signed-off-by: Chuck Lever [EMAIL PROTECTED] --- fs/nfs/super.c | 290 1 files changed, 290 insertions(+), 0 deletions(-) diff --git a/fs/nfs/super.c b/fs/nfs/super.c index 7b7cacb..927c1c2 100644 --- a/fs/nfs/super.c +++ b/fs/nfs/super.c @@ -1332,6 +1332,296 @@ error_splat_super: #ifdef CONFIG_NFS_V4 +static match_table_t nfs4_tokens = { + {Opt_userspace, bg}, + {Opt_userspace, fg}, + {Opt_soft, soft}, + {Opt_hard, hard}, + {Opt_intr, intr}, + {Opt_nointr, nointr}, + {Opt_cto, cto}, + {Opt_nocto, nocto}, + {Opt_ac, ac}, + {Opt_noac, noac}, + + {Opt_port, port=%u}, + {Opt_rsize, rsize=%u}, + {Opt_wsize, wsize=%u}, + {Opt_timeo, timeo=%u}, + {Opt_retrans, retrans=%u}, + {Opt_acregmin, acregmin=%u}, + {Opt_acregmax, acregmax=%u}, + {Opt_acdirmin, acdirmin=%u}, + {Opt_acdirmax, acdirmax=%u}, + {Opt_actimeo, actimeo=%u}, + {Opt_userspace, retry=%u}, + + {Opt_sec, sec=%s}, + {Opt_proto, proto=%s}, + {Opt_addr, addr=%s}, + {Opt_clientaddr, clientaddr=%s}, + + {Opt_err, NULL}, +}; + +static int nfs4_parse_options(char *raw, struct nfs4_mount_args *mnt) +{ + char *p, *string; + + if (!raw) + return 1; + + while ((p = strsep (raw, ,)) != NULL) { + substring_t args[MAX_OPT_ARGS]; + int option, token; + + if (!*p) + continue; + token = match_token(p, nfs4_tokens, args); + + dprintk(NFS: nfs4 mount option '%s': parsing token %d\n, + p, token); + + switch (token) { + case Opt_soft: + mnt-nmd.flags |= NFS4_MOUNT_SOFT; + break; + case Opt_hard: + mnt-nmd.flags = ~NFS4_MOUNT_SOFT; + break; + case Opt_intr: + mnt-nmd.flags |= NFS4_MOUNT_INTR; + break; + case Opt_nointr: + mnt-nmd.flags = ~NFS4_MOUNT_INTR; + break; + case Opt_cto: + mnt-nmd.flags = ~NFS4_MOUNT_NOCTO; + break; + case Opt_nocto: + mnt-nmd.flags |= NFS4_MOUNT_NOCTO; + break; + case Opt_ac: + mnt-nmd.flags = ~NFS4_MOUNT_NOAC; + break; + case Opt_noac: + mnt-nmd.flags |= NFS4_MOUNT_NOAC; + break; + + case Opt_port: + if (match_int(args, option)) + return 0; + if (option 0 || option 65535) + return 0; + mnt-addr.sin_port = htonl(option); + break; + case Opt_rsize: + if (match_int(args, mnt-nmd.rsize)) + return 0; + break; + case Opt_wsize: + if (match_int(args, mnt-nmd.wsize)) + return 0; + break; + case Opt_timeo: + if (match_int(args, mnt-nmd.timeo)) + return 0; + break; + case Opt_retrans: + if (match_int(args, mnt-nmd.retrans)) + return 0; + break; + case Opt_acregmin: + if (match_int(args, mnt-nmd.acregmin)) + return 0; + break; + case Opt_acregmax: + if (match_int(args, mnt-nmd.acregmax)) + return 0; + break; + case Opt_acdirmin: + if (match_int(args, mnt-nmd.acdirmin)) + return 0; + break; + case Opt_acdirmax: + if (match_int(args, mnt-nmd.acdirmax)) + return 0; + break; + case Opt_actimeo: + if (match_int(args, option)) + return 0; + if (option 0) + return 0; + mnt-nmd.acregmin = + mnt-nmd.acregmax = + mnt-nmd.acdirmin = + mnt-nmd.acdirmax = option; + break; + + case Opt_proto: { + string = match_strdup(args
[PATCH 04/13] NFS: Remake nfsroot_mount as a permanent part of NFS client
In preparation for supporting NFSv2 and NFSv3 mount option handling in the kernel NFS client, convert mount_clnt.c to be a permanent part of the NFS client, instead of built only when CONFIG_ROOT_NFS is enabled. Signed-off-by: Chuck Lever [EMAIL PROTECTED] --- fs/nfs/Makefile|4 ++-- fs/nfs/mount_clnt.c| 18 +- fs/nfs/nfsroot.c |2 +- include/linux/nfs_fs.h |4 +--- 4 files changed, 17 insertions(+), 11 deletions(-) diff --git a/fs/nfs/Makefile b/fs/nfs/Makefile index f4580b4..b55cb23 100644 --- a/fs/nfs/Makefile +++ b/fs/nfs/Makefile @@ -6,8 +6,8 @@ obj-$(CONFIG_NFS_FS) += nfs.o nfs-y := client.o dir.o file.o getroot.o inode.o super.o nfs2xdr.o \ pagelist.o proc.o read.o symlink.o unlink.o \ - write.o namespace.o -nfs-$(CONFIG_ROOT_NFS) += nfsroot.o mount_clnt.o + write.o namespace.o mount_clnt.o +nfs-$(CONFIG_ROOT_NFS) += nfsroot.o nfs-$(CONFIG_NFS_V3) += nfs3proc.o nfs3xdr.o nfs-$(CONFIG_NFS_V3_ACL) += nfs3acl.o nfs-$(CONFIG_NFS_V4) += nfs4proc.o nfs4xdr.o nfs4state.o nfs4renewd.o \ diff --git a/fs/nfs/mount_clnt.c b/fs/nfs/mount_clnt.c index ca5a266..82a8536 100644 --- a/fs/nfs/mount_clnt.c +++ b/fs/nfs/mount_clnt.c @@ -37,12 +37,20 @@ struct mnt_fhstatus { struct nfs_fh * fh; }; -/* - * Obtain an NFS file handle for the given host and path +/** + * nfs_mount - Obtain an NFS file handle for the given host and path + * @addr: pointer to server's address + * @path: pointer to string containing export path to mount + * @fh: pointer to location to place returned file handle + * @version: mount version to use for this request + * @protocol: transport protocol to use for thie request + * + * Uses default timeout parameters specified by underlying transport. + * + * XXX: Needs to support IPv6 */ -int -nfsroot_mount(struct sockaddr_in *addr, char *path, struct nfs_fh *fh, - int version, int protocol) +int nfs_mount(struct sockaddr_in *addr, char *path, struct nfs_fh *fh, + int version, int protocol) { struct rpc_clnt *mnt_clnt; struct mnt_fhstatus result = { diff --git a/fs/nfs/nfsroot.c b/fs/nfs/nfsroot.c index f0db470..a52c891 100644 --- a/fs/nfs/nfsroot.c +++ b/fs/nfs/nfsroot.c @@ -496,7 +496,7 @@ static int __init root_nfs_get_handle(void) NFS_MNT3_VERSION : NFS_MNT_VERSION; set_sockaddr(sin, servaddr, htons(mount_port)); - status = nfsroot_mount(sin, nfs_path, fh, version, protocol); + status = nfs_mount(sin, nfs_path, fh, version, protocol); if (status 0) printk(KERN_ERR Root-NFS: Server returned error %d while mounting %s\n, status, nfs_path); diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h index 0543439..58f5b77 100644 --- a/include/linux/nfs_fs.h +++ b/include/linux/nfs_fs.h @@ -496,10 +496,8 @@ static inline void nfs3_forget_cached_acls(struct inode *inode) /* * linux/fs/mount_clnt.c - * (Used only by nfsroot module) */ -extern int nfsroot_mount(struct sockaddr_in *, char *, struct nfs_fh *, - int, int); +extern int nfs_mount(struct sockaddr_in *, char *, struct nfs_fh *, int, int); /* * inline functions - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 05/13] NFS: Clean up in-kernel NFS mount
Clean up white space and coding conventions. Signed-off-by: Chuck Lever [EMAIL PROTECTED] --- fs/nfs/mount_clnt.c | 132 --- 1 files changed, 63 insertions(+), 69 deletions(-) diff --git a/fs/nfs/mount_clnt.c b/fs/nfs/mount_clnt.c index 82a8536..f8584ad 100644 --- a/fs/nfs/mount_clnt.c +++ b/fs/nfs/mount_clnt.c @@ -1,7 +1,5 @@ /* - * linux/fs/nfs/mount_clnt.c - * - * MOUNT client to support NFSroot. + * In-kernel MOUNT protocol client * * Copyright (C) 1997, Olaf Kirch [EMAIL PROTECTED] */ @@ -21,22 +19,33 @@ # define NFSDBG_FACILITY NFSDBG_ROOT #endif -/* -#define MOUNT_PROGRAM 15 -#define MOUNT_VERSION 1 -#define MOUNT_MNT 1 -#define MOUNT_UMNT 3 - */ - -static struct rpc_clnt * mnt_create(char *, struct sockaddr_in *, - int, int); static struct rpc_program mnt_program; struct mnt_fhstatus { - unsigned intstatus; - struct nfs_fh * fh; + u32 status; + struct nfs_fh *fh; }; +static struct rpc_clnt *mnt_create(char *hostname, + struct sockaddr_in *srvaddr, + int version, + int protocol) +{ + struct rpc_create_args args = { + .protocol = protocol, + .address= (struct sockaddr *)srvaddr, + .addrsize = sizeof(*srvaddr), + .servername = hostname, + .program= mnt_program, + .version= version, + .authflavor = RPC_AUTH_UNIX, + .flags = (RPC_CLNT_CREATE_ONESHOT | + RPC_CLNT_CREATE_INTR), + }; + + return rpc_create(args); +} + /** * nfs_mount - Obtain an NFS file handle for the given host and path * @addr: pointer to server's address @@ -66,7 +75,7 @@ int nfs_mount(struct sockaddr_in *addr, char *path, struct nfs_fh *fh, dprintk(NFS: nfs_mount(%08x:%s)\n, (unsigned)ntohl(addr-sin_addr.s_addr), path); - sprintf(hostname, %u.%u.%u.%u, NIPQUAD(addr-sin_addr.s_addr)); + sprintf(hostname, NIPQUAD_FMT, NIPQUAD(addr-sin_addr.s_addr)); mnt_clnt = mnt_create(hostname, addr, version, protocol); if (IS_ERR(mnt_clnt)) return PTR_ERR(mnt_clnt); @@ -77,33 +86,18 @@ int nfs_mount(struct sockaddr_in *addr, char *path, struct nfs_fh *fh, msg.rpc_proc = mnt_clnt-cl_procinfo[MNTPROC_MNT]; status = rpc_call_sync(mnt_clnt, msg, 0); - return status 0? status : (result.status? -EACCES : 0); -} - -static struct rpc_clnt * -mnt_create(char *hostname, struct sockaddr_in *srvaddr, int version, - int protocol) -{ - struct rpc_create_args args = { - .protocol = protocol, - .address= (struct sockaddr *)srvaddr, - .addrsize = sizeof(*srvaddr), - .servername = hostname, - .program= mnt_program, - .version= version, - .authflavor = RPC_AUTH_UNIX, - .flags = (RPC_CLNT_CREATE_ONESHOT | - RPC_CLNT_CREATE_INTR), - }; - - return rpc_create(args); + if (status 0) + return status; + if (result.status != 0) + return -EACCES; + return 0; } /* * XDR encode/decode functions for MOUNT */ -static int -xdr_encode_dirpath(struct rpc_rqst *req, __be32 *p, const char *path) +static int xdr_encode_dirpath(struct rpc_rqst *req, __be32 *p, + const char *path) { p = xdr_encode_string(p, path); @@ -111,8 +105,8 @@ xdr_encode_dirpath(struct rpc_rqst *req, __be32 *p, const char *path) return 0; } -static int -xdr_decode_fhstatus(struct rpc_rqst *req, __be32 *p, struct mnt_fhstatus *res) +static int xdr_decode_fhstatus(struct rpc_rqst *req, __be32 *p, + struct mnt_fhstatus *res) { struct nfs_fh *fh = res-fh; @@ -123,8 +117,8 @@ xdr_decode_fhstatus(struct rpc_rqst *req, __be32 *p, struct mnt_fhstatus *res) return 0; } -static int -xdr_decode_fhstatus3(struct rpc_rqst *req, __be32 *p, struct mnt_fhstatus *res) +static int xdr_decode_fhstatus3(struct rpc_rqst *req, __be32 *p, + struct mnt_fhstatus *res) { struct nfs_fh *fh = res-fh; @@ -143,53 +137,53 @@ xdr_decode_fhstatus3(struct rpc_rqst *req, __be32 *p, struct mnt_fhstatus *res) #define MNT_fhstatus_sz(1 + 8) #define MNT_fhstatus3_sz (1 + 16) -static struct rpc_procinfo mnt_procedures[] = { -[MNTPROC_MNT] = { - .p_proc = MNTPROC_MNT, - .p_encode = (kxdrproc_t
[PATCH 03/13] SUNRPC: Rename rpcb_getport to be consistent with new rpcb_getport_sync name
Clean up, for consistency. Rename rpcb_getport as rpcb_getport_async, to match the naming scheme of rpcb_getport_sync. Signed-off-by: Chuck Lever [EMAIL PROTECTED] --- include/linux/sunrpc/clnt.h |2 +- net/sunrpc/rpcb_clnt.c | 37 +++-- net/sunrpc/xprtsock.c |4 ++-- 3 files changed, 22 insertions(+), 21 deletions(-) diff --git a/include/linux/sunrpc/clnt.h b/include/linux/sunrpc/clnt.h index c51bc8c..9bea7b5 100644 --- a/include/linux/sunrpc/clnt.h +++ b/include/linux/sunrpc/clnt.h @@ -124,8 +124,8 @@ int rpc_destroy_client(struct rpc_clnt *); void rpc_release_client(struct rpc_clnt *); intrpcb_register(u32, u32, int, unsigned short, int *); -void rpcb_getport(struct rpc_task *); intrpcb_getport_sync(struct sockaddr_in *, __u32, __u32, int); +void rpcb_getport_async(struct rpc_task *); void rpc_call_setup(struct rpc_task *, struct rpc_message *, int); diff --git a/net/sunrpc/rpcb_clnt.c b/net/sunrpc/rpcb_clnt.c index 5a52604..905ba5a 100644 --- a/net/sunrpc/rpcb_clnt.c +++ b/net/sunrpc/rpcb_clnt.c @@ -298,13 +298,13 @@ int rpcb_getport_sync(struct sockaddr_in *sin, __u32 prog, EXPORT_SYMBOL_GPL(rpcb_getport_sync); /** - * rpcb_getport - obtain the port for a given RPC service on a given host + * rpcb_getport_async - obtain the port for a given RPC service on a given host * @task: task that is waiting for portmapper request * * This one can be called for an ongoing RPC request, and can be used in * an async (rpciod) context. */ -void rpcb_getport(struct rpc_task *task) +void rpcb_getport_async(struct rpc_task *task) { struct rpc_clnt *clnt = task-tk_client; int bind_version; @@ -315,17 +315,17 @@ void rpcb_getport(struct rpc_task *task) struct sockaddr addr; int status; - dprintk(RPC: %5u rpcb_getport(%s, %u, %u, %d)\n, - task-tk_pid, clnt-cl_server, - clnt-cl_prog, clnt-cl_vers, xprt-prot); + dprintk(RPC: %5u %s(%s, %u, %u, %d)\n, + task-tk_pid, __FUNCTION__, + clnt-cl_server, clnt-cl_prog, clnt-cl_vers, xprt-prot); /* Autobind on cloned rpc clients is discouraged */ BUG_ON(clnt-cl_parent != clnt); if (xprt_test_and_set_binding(xprt)) { status = -EACCES; /* tell caller to check again */ - dprintk(RPC: %5u rpcb_getport waiting for another binder\n, - task-tk_pid); + dprintk(RPC: %5u %s: waiting for another binder\n, + task-tk_pid, __FUNCTION__); goto bailout_nowake; } @@ -336,27 +336,28 @@ void rpcb_getport(struct rpc_task *task) /* Someone else may have bound if we slept */ if (xprt_bound(xprt)) { status = 0; - dprintk(RPC: %5u rpcb_getport already bound\n, task-tk_pid); + dprintk(RPC: %5u %s: already bound\n, + task-tk_pid, __FUNCTION__); goto bailout_nofree; } if (rpcb_next_version[xprt-bind_index].rpc_proc == NULL) { xprt-bind_index = 0; status = -EACCES; /* tell caller to try again later */ - dprintk(RPC: %5u rpcb_getport no more getport versions - available\n, task-tk_pid); + dprintk(RPC: %5u %s: no more getport versions available\n, + task-tk_pid, __FUNCTION__); goto bailout_nofree; } bind_version = rpcb_next_version[xprt-bind_index].rpc_vers; - dprintk(RPC: %5u rpcb_getport trying rpcbind version %u\n, - task-tk_pid, bind_version); + dprintk(RPC: %5u %s: trying rpcbind version %u\n, + task-tk_pid, __FUNCTION__, bind_version); map = kzalloc(sizeof(struct rpcbind_args), GFP_ATOMIC); if (!map) { status = -ENOMEM; - dprintk(RPC: %5u rpcb_getport no memory available\n, - task-tk_pid); + dprintk(RPC: %5u %s: no memory available\n, + task-tk_pid, __FUNCTION__); goto bailout_nofree; } map-r_prog = clnt-cl_prog; @@ -374,16 +375,16 @@ void rpcb_getport(struct rpc_task *task) rpcb_clnt = rpcb_create(clnt-cl_server, addr, xprt-prot, bind_version, 0); if (IS_ERR(rpcb_clnt)) { status = PTR_ERR(rpcb_clnt); - dprintk(RPC: %5u rpcb_getport rpcb_create failed, error %ld\n, - task-tk_pid, PTR_ERR(rpcb_clnt)); + dprintk(RPC: %5u %s: rpcb_create failed, error %ld\n, + task-tk_pid, __FUNCTION__, PTR_ERR(rpcb_clnt)); goto bailout; } child = rpc_run_task(rpcb_clnt, RPC_TASK_ASYNC
[PATCH 07/13] NFS: New infrastructure for NFS client in-kernel mount option parsing
Add some data structures and definitions to support parsing NFS mount options in the kernel NFS client. Signed-off-by: Chuck Lever [EMAIL PROTECTED] --- fs/nfs/super.c | 79 1 files changed, 79 insertions(+), 0 deletions(-) diff --git a/fs/nfs/super.c b/fs/nfs/super.c index 31f7313..1974648 100644 --- a/fs/nfs/super.c +++ b/fs/nfs/super.c @@ -45,6 +45,7 @@ #include linux/inet.h #include linux/nfs_xdr.h #include linux/magic.h +#include linux/parser.h #include asm/system.h #include asm/uaccess.h @@ -57,6 +58,84 @@ #define NFSDBG_FACILITYNFSDBG_VFS + +struct nfs_mount_args { + struct nfs_mount_data nmd; + unsigned int nfsprog; + unsigned use_mnthost; + struct sockaddr_in mnthost; + unsigned int mntprog; + unsigned int mntvers; + unsigned short mntport; +}; + +struct nfs4_mount_args { + struct nfs4_mount_data nmd; + struct sockaddr_in addr; + char clientaddr[16]; + int authflavor; +}; + +enum { + /* Mount options that take no arguments */ + Opt_soft, Opt_hard, + Opt_intr, Opt_nointr, + Opt_posix, Opt_noposix, + Opt_cto, Opt_nocto, + Opt_ac, Opt_noac, + Opt_lock, Opt_nolock, + Opt_v2, Opt_v3, + Opt_udp, Opt_tcp, + Opt_acl, Opt_noacl, + + /* Mount options that take integer arguments */ + Opt_port, + Opt_rsize, Opt_wsize, + Opt_timeo, Opt_retrans, + Opt_acregmin, Opt_acregmax, + Opt_acdirmin, Opt_acdirmax, + Opt_actimeo, + Opt_namelen, + Opt_mountport, + Opt_mountprog, Opt_mountvers, + Opt_nfsprog, Opt_nfsvers, + + /* Mount options that take string arguments */ + Opt_sec, Opt_proto, Opt_addr, + Opt_mounthost, Opt_clientaddr, Opt_context, + + /* Mount options that are ignored */ + Opt_userspace, Opt_deprecated, + + Opt_err, +}; + +enum { + Opt_sec_none, Opt_sec_sys, + Opt_sec_krb5, Opt_sec_krb5i, Opt_sec_krb5p, + Opt_sec_lkey, Opt_sec_lkeyi, Opt_sec_lkeyp, + Opt_sec_spkm, Opt_sec_spkmi, Opt_sec_spkmp, + + Opt_sec_err, +}; + +static match_table_t nfs_sec_tokens = { + {Opt_sec_none, none}, + {Opt_sec_none, null}, + {Opt_sec_sys, sys}, + + {Opt_sec_krb5, krb5}, + {Opt_sec_krb5i, krb5i}, + {Opt_sec_krb5p, krb5p}, + + {Opt_sec_lkey, lkey}, + {Opt_sec_lkeyi, lkeyi}, + {Opt_sec_lkeyp, lkeyp}, + + {Opt_sec_err, NULL}, +}; + + static void nfs_umount_begin(struct vfsmount *, int); static int nfs_statfs(struct dentry *, struct kstatfs *); static int nfs_show_options(struct seq_file *, struct vfsmount *); - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 00/13] Support NFS mount option parsing in the kernel
This patch series introduces support for parsing NFS mount options in the kernel, similar to support that exists for many other Linux file systems such as ext3, autofs, fat, cifs, hfs, and ocfs2. I'd like to integrate this patch set into -mm to encourage wide review and perhaps get some penetration testing before moving forward with integration into the mainline kernel. Future enhancements might include caching connections to mountd so we don't use up so many privileged ports during mount storms, removing similar infrastructure in NFSROOT in favor of this implementation, and support for NFS over IPv6 and RDMA. -- corporate:chuck dot lever at oracle dot com - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 12/13] NFS: More nfs4 in-kernel mount option parsing infrastructure
Add function for switching between an nfs4_mount_data structure from user space (the current nfs4 mount mechanism) and generating an nfs4_mount_data structure from a text string containing nfs4 mount options. Signed-off-by: Chuck Lever [EMAIL PROTECTED] --- fs/nfs/super.c | 123 1 files changed, 123 insertions(+), 0 deletions(-) diff --git a/fs/nfs/super.c b/fs/nfs/super.c index 8585fa5..e0acd08 100644 --- a/fs/nfs/super.c +++ b/fs/nfs/super.c @@ -1643,6 +1643,129 @@ static void *nfs_copy_user_string(char *dst, struct nfs_string *src, int maxlen) return dst; } +static int nfs4_validate_mount_data(struct nfs4_mount_data **options, + const char *dev_name, + struct sockaddr_in *addr, + rpc_authflavor_t *authflavour, + char **hostname, + char **mntpath, + char *ip_addr) +{ + struct nfs4_mount_data *data = *options; + char *c; + unsigned len; + + if (data == NULL) { + dprintk(%s: missing data argument\n, __FUNCTION__); + return -EINVAL; + } + + switch (data-version) { + case 1: + if (data-host_addrlen != sizeof(*addr)) + return -EINVAL; + if (copy_from_user(addr, data-host_addr, sizeof(*addr))) + return -EFAULT; + if (addr-sin_port == 0) + addr-sin_port = htons(NFS_PORT); + if (!nfs_verify_server_address((struct sockaddr *) addr, + data-host_addrlen)) + return -EINVAL; + + switch (data-auth_flavourlen) { + case 0: + *authflavour = RPC_AUTH_UNIX; + break; + case 1: + if (copy_from_user(authflavour, data-auth_flavours, + sizeof(*authflavour))) + return -EFAULT; + default: + goto out_inval_auth; + } + + c = nfs_copy_user_string(ip_addr, data-client_addr, 80); + if (IS_ERR(c)) + return PTR_ERR(c); + + c = nfs_copy_user_string(NULL, data-hostname, 256); + if (IS_ERR(c)) + return PTR_ERR(c); + *hostname = c; + + c = nfs_copy_user_string(NULL, data-mnt_path, 1024); + if (IS_ERR(c)) { + kfree(*hostname); + return PTR_ERR(c); + } + *mntpath = c; + dprintk(MNTPATH: %s\n, *mntpath); + + return 0; + default: + data = nfs4_convert_mount_opts((char *) data); + if (IS_ERR(data)) + return PTR_ERR(data); + *options = data; + + memcpy(addr, data-host_addr, sizeof(*addr)); + if (!nfs_verify_server_address((struct sockaddr *) addr, + data-host_addrlen)) + return -EINVAL; + + switch (data-auth_flavourlen) { + case 0: + *authflavour = RPC_AUTH_UNIX; + break; + case 1: + *authflavour = (rpc_authflavor_t) data-auth_flavours[0]; + break; + default: + goto out_inval_auth; + } + + memset(ip_addr, '\0', data-client_addr.len + 1); + strncpy(ip_addr, data-client_addr.data, data-client_addr.len); + + /* +* Split dev_name into hostname:mntpath. +*/ + c = strchr(dev_name, ':'); + if (c == NULL) + return -EINVAL; + /* while calculating len, pretend ':' is '\0' */ + len = c - dev_name; + if (len 256) + return -EINVAL; + *hostname = kzalloc(len, GFP_KERNEL); + if (*hostname == NULL) + return -ENOMEM; + strncpy(*hostname, dev_name, len - 1); + + c++;/* step over the ':' */ + len = strlen(c); + if (len 1023) { + kfree(*hostname); + return -EINVAL; + } + *mntpath = kzalloc(len + 1, GFP_KERNEL); + if (*mntpath == NULL) { + kfree(*hostname); + return -ENOMEM; + } + strncpy(*mntpath, c, len); + + dprintk(MNTPATH: %s\n, *mntpath
[PATCH 02/13] SUNRPC: Rename rpcb_getport_external routine
In preparation for handling NFS mount option parsing in the kernel, rename rpcb_getport_external as rpcb_get_port_sync, and make it available always (instead of only when CONFIG_ROOT_NFS is enabled). Signed-off-by: Chuck Lever [EMAIL PROTECTED] --- fs/nfs/nfsroot.c|2 +- include/linux/sunrpc/clnt.h |7 ++- net/sunrpc/rpcb_clnt.c | 21 +++-- 3 files changed, 14 insertions(+), 16 deletions(-) diff --git a/fs/nfs/nfsroot.c b/fs/nfs/nfsroot.c index 49d1008..f0db470 100644 --- a/fs/nfs/nfsroot.c +++ b/fs/nfs/nfsroot.c @@ -428,7 +428,7 @@ static int __init root_nfs_getport(int program, int version, int proto) printk(KERN_NOTICE Looking up port of RPC %d/%d on %u.%u.%u.%u\n, program, version, NIPQUAD(servaddr)); set_sockaddr(sin, servaddr, 0); - return rpcb_getport_external(sin, program, version, proto); + return rpcb_getport_sync(sin, program, version, proto); } diff --git a/include/linux/sunrpc/clnt.h b/include/linux/sunrpc/clnt.h index 6661142..c51bc8c 100644 --- a/include/linux/sunrpc/clnt.h +++ b/include/linux/sunrpc/clnt.h @@ -122,8 +122,10 @@ struct rpc_clnt *rpc_clone_client(struct rpc_clnt *); intrpc_shutdown_client(struct rpc_clnt *); intrpc_destroy_client(struct rpc_clnt *); void rpc_release_client(struct rpc_clnt *); + intrpcb_register(u32, u32, int, unsigned short, int *); void rpcb_getport(struct rpc_task *); +intrpcb_getport_sync(struct sockaddr_in *, __u32, __u32, int); void rpc_call_setup(struct rpc_task *, struct rpc_message *, int); @@ -142,10 +144,5 @@ intrpc_ping(struct rpc_clnt *clnt, int flags); size_t rpc_peeraddr(struct rpc_clnt *, struct sockaddr *, size_t); char * rpc_peeraddr2str(struct rpc_clnt *, enum rpc_display_format_t); -/* - * Helper function for NFSroot support - */ -intrpcb_getport_external(struct sockaddr_in *, __u32, __u32, int); - #endif /* __KERNEL__ */ #endif /* _LINUX_SUNRPC_CLNT_H */ diff --git a/net/sunrpc/rpcb_clnt.c b/net/sunrpc/rpcb_clnt.c index 6c7aa8a..5a52604 100644 --- a/net/sunrpc/rpcb_clnt.c +++ b/net/sunrpc/rpcb_clnt.c @@ -12,6 +12,8 @@ * Copyright (C) 1996, Olaf Kirch [EMAIL PROTECTED] */ +#include linux/module.h + #include linux/types.h #include linux/socket.h #include linux/kernel.h @@ -246,21 +248,20 @@ int rpcb_register(u32 prog, u32 vers, int prot, unsigned short port, int *okay) return error; } -#ifdef CONFIG_ROOT_NFS /** - * rpcb_getport_external - obtain the port for an RPC service on a given host + * rpcb_getport_sync - obtain the port for an RPC service on a given host * @sin: address of remote peer * @prog: RPC program number to bind * @vers: RPC version number to bind * @prot: transport protocol to use to make this request * * Called from outside the RPC client in a synchronous task context. + * Uses default timeout parameters specified by underlying transport. * - * For now, this supports only version 2 queries, but is used only by - * mount_clnt for NFS_ROOT. + * XXX: Needs to support IPv6, and rpcbind versions 3 and 4 */ -int rpcb_getport_external(struct sockaddr_in *sin, __u32 prog, - __u32 vers, int prot) +int rpcb_getport_sync(struct sockaddr_in *sin, __u32 prog, + __u32 vers, int prot) { struct rpcbind_args map = { .r_prog = prog, @@ -277,10 +278,10 @@ int rpcb_getport_external(struct sockaddr_in *sin, __u32 prog, char hostname[40]; int status; - dprintk(RPC: rpcb_getport_external(%u.%u.%u.%u, %u, %u, %d)\n, - NIPQUAD(sin-sin_addr.s_addr), prog, vers, prot); + dprintk(RPC: %s( NIPQUAD_FMT , %u, %u, %d)\n, + __FUNCTION__, NIPQUAD(sin-sin_addr.s_addr), prog, vers, prot); - sprintf(hostname, %u.%u.%u.%u, NIPQUAD(sin-sin_addr.s_addr)); + sprintf(hostname, NIPQUAD_FMT, NIPQUAD(sin-sin_addr.s_addr)); rpcb_clnt = rpcb_create(hostname, (struct sockaddr *)sin, prot, 2, 0); if (IS_ERR(rpcb_clnt)) return PTR_ERR(rpcb_clnt); @@ -294,7 +295,7 @@ int rpcb_getport_external(struct sockaddr_in *sin, __u32 prog, } return status; } -#endif +EXPORT_SYMBOL_GPL(rpcb_getport_sync); /** * rpcb_getport - obtain the port for a given RPC service on a given host - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 11/13] NFS: Move nfs_copy_user_string
Next patch will add a new function that calls nfs_copy_user_string. Signed-off-by: Chuck Lever [EMAIL PROTECTED] --- fs/nfs/super.c | 42 +- 1 files changed, 21 insertions(+), 21 deletions(-) diff --git a/fs/nfs/super.c b/fs/nfs/super.c index 927c1c2..8585fa5 100644 --- a/fs/nfs/super.c +++ b/fs/nfs/super.c @@ -1622,6 +1622,27 @@ out_err: return ERR_PTR(-EINVAL); } +static void *nfs_copy_user_string(char *dst, struct nfs_string *src, int maxlen) +{ + void *p = NULL; + + if (!src-len) + return ERR_PTR(-EINVAL); + if (src-len maxlen) + maxlen = src-len; + if (dst == NULL) { + p = dst = kmalloc(maxlen + 1, GFP_KERNEL); + if (p == NULL) + return ERR_PTR(-ENOMEM); + } + if (copy_from_user(dst, src-data, maxlen)) { + kfree(p); + return ERR_PTR(-EFAULT); + } + dst[maxlen] = '\0'; + return dst; +} + /* * Finish setting up a cloned NFS4 superblock */ @@ -1646,27 +1667,6 @@ static void nfs4_fill_super(struct super_block *sb) nfs_initialise_sb(sb); } -static void *nfs_copy_user_string(char *dst, struct nfs_string *src, int maxlen) -{ - void *p = NULL; - - if (!src-len) - return ERR_PTR(-EINVAL); - if (src-len maxlen) - maxlen = src-len; - if (dst == NULL) { - p = dst = kmalloc(maxlen + 1, GFP_KERNEL); - if (p == NULL) - return ERR_PTR(-ENOMEM); - } - if (copy_from_user(dst, src-data, maxlen)) { - kfree(p); - return ERR_PTR(-EFAULT); - } - dst[maxlen] = '\0'; - return dst; -} - /* * Get the superblock for an NFS4 mountpoint */ - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 08/13] NFS: Add functions to parse nfs mount options to fs/nfs/super.c
Karel Zak wrote: On Mon, May 21, 2007 at 12:09:54PM -0400, Chuck Lever wrote: For NFSv2 and NFSv3 mount options. Signed-off-by: Chuck Lever [EMAIL PROTECTED] +static int nfs_parse_options(char *raw, struct nfs_mount_args *mnt) +{ + char *p, *string; + + if (!raw) { + dprintk(NFS: mount options string was NULL.\n); + return 1; + } + + while ((p = strsep (raw, ,)) != NULL) { + substring_t args[MAX_OPT_ARGS]; + int option, token; + + if (!*p) + continue; + token = match_token(p, nfs_tokens, args); + + case Opt_context: + match_strcpy(mnt-nmd.context, args); + break; The userspace version (nfs-utils) of this code supports a quoted context strings. For example: context=aaa,bbb,ccc,hard It seems your code blindly parses a raw option string by ,. Karel- I've never used the context= option, and didn't find any documentation describing how it was used. Is there a clean example of how to use the in-kernel parser to handle quoted strings containing commas? begin:vcard fn:Chuck Lever n:Lever;Chuck org:Oracle Corporation;Corporate Architecture: Linux Projects Group adr:;;1015 Granger Avenue;Ann Arbor;MI;48104;USA title:Principal Member of Staff tel;work:+1 248 614 5091 x-mozilla-html:FALSE url:http://oss.oracle.com/~cel/ version:2.1 end:vcard
Re: [ANNOUNCE] Btrfs: a copy on write, snapshotting FS
Hi Chris- John Stoffel wrote: As a user of Netapps, having quotas (if only for reporting purposes) and some way to migrate non-used files to slower/cheaper storage would be great. Ie. being able to setup two pools, one being RAID6, the other being RAID1, where all currently accessed files are in the RAID1 setup, but if un-used get migrated to the RAID6 area. And of course some way for efficient backups and more importantly RESTORES of data which is segregated like this. I like the way dump and restore was handled in AFS (and now ZFS and NetApp). There is a simple command to flatten a file system and send it to another system, which can receive it and re-expand it. The dump/restore process uses snapshots and can easily send incremental backups which are significantly smaller than 0-level. This is somewhat better than rsync, because you don't need checksums to discover what data has changed -- you already have the new data segregated into copied-on-write blocks. NetApp happens to use the standard NDMP protocol for sending the flattened file system. NetApp uses it for synchronous replication, volume migration, and back up to nearline storage and tape. AFS used vol dump and vol restore for migration, replication, and back-up. ZFS has the zfs send and zfs receive commands that do basically the same (Eric Kustarz recently published a blog entry that described how these work). And of course, all file system objects are able to be sent this way: streams, xattrs, ACLs, and so on are all supported. Note also that NFSv4 supports the idea of migrated or replicated file objects. All that is needed to support it is a mechanism on the servers to actually move the data. begin:vcard fn:Chuck Lever n:Lever;Chuck org:Oracle Corporation;Corporate Architecture: Linux Projects Group adr:;;1015 Granger Avenue;Ann Arbor;MI;48104;USA email;internet:chuck dot lever at nospam oracle dot com title:Principal Member of Staff tel;work:+1 248 614 5091 x-mozilla-html:FALSE version:2.1 end:vcard
Re: [ANNOUNCE] Btrfs: a copy on write, snapshotting FS
Chris Mason wrote: On Thu, Jun 14, 2007 at 02:20:26PM -0400, Chuck Lever wrote: NetApp happens to use the standard NDMP protocol for sending the flattened file system. NetApp uses it for synchronous replication, volume migration, and back up to nearline storage and tape. AFS used vol dump and vol restore for migration, replication, and back-up. ZFS has the zfs send and zfs receive commands that do basically the same (Eric Kustarz recently published a blog entry that described how these work). And of course, all file system objects are able to be sent this way: streams, xattrs, ACLs, and so on are all supported. Note also that NFSv4 supports the idea of migrated or replicated file objects. All that is needed to support it is a mechanism on the servers to actually move the data. Stringing the replication together with the underlying FS would be neat. Is there a way to deal with a master/slave setup, where the slave may be out of date? Among the implementations I'm aware of, there is a varying degree of integration into the physical file system. In general, it depends on how far out of date the slave is, and how closely the slave is supposed to be synchronized to the master. A hot backup file system, for example, should be data-consistent within a few seconds of the master. A snapshot is used to initialize a slave, followed by a live stream of updates to the master being sent to slaves. Such a mechanism already exists on NetApp filers because they gather changes in NVRAM before committing them to the local file system. Simply put, these changes can also be bundled and sent to a local hot backup filer that is attached via Infiniband, or over the network to a remote hot backup filer. For AFS, replication is done by maintaining a rw and ro copy of a volume on the designated master server. Changes are made to the rw copy over time. When admins want to push out a new version to replicas on another server, the ro copy on the master is replaced with a new snapshot, then this is pushed to the slaves. The replicas are always ro and are used mostly for load balancing; clients contact the closest or fastest server containing a replica of the volume they want to access. They always have a complete copy of the volume (ie no COW on the slaves). I think you have designed into btrfs a lot of opportunity to implement this kind of data virtualization and management... I'm excited to see what can be done. begin:vcard fn:Chuck Lever n:Lever;Chuck org:Oracle Corporation;Corporate Architecture: Linux Projects Group adr:;;1015 Granger Avenue;Ann Arbor;MI;48104;USA title:Principal Member of Staff tel;work:+1 248 614 5091 x-mozilla-html:FALSE url:http://oss.oracle.com/~cel/ version:2.1 end:vcard
Re: Adding subroot information to /proc/mounts, or obtaining that through other means
Al Viro wrote: On Wed, Jun 20, 2007 at 01:57:33PM -0700, H. Peter Anvin wrote: ... or, alternatively, add a subfield to the first field (which would entail escaping whatever separator we choose): /dev/md6 /export ext3 rw,data=ordered 0 0 /dev/md6:/users/foo /home/foo ext3 rw,data=ordered 0 0 /dev/md6:/users/bar /home/bar ext3 rw,data=ordered 0 0 Hell, no. The first field is in principle impossible to parse unless you know the fs type. How about making a new file with sane format? From the very beginning. E.g. mountpoint + ID + relative path + type + options, where ID uniquely identifies superblock (e.g. numeric st_dev) and backing device (if any) is sitting among the options... To support NFS client performance statistics, I recently added /proc/self/mountstats. That might be a place to add details about --move and --bind mounts without changing the format of /proc/mounts. begin:vcard fn:Chuck Lever n:Lever;Chuck org:Oracle Corporation;Corporate Architecture: Linux Projects Group adr:;;1015 Granger Avenue;Ann Arbor;MI;48104;USA title:Principal Member of Staff tel;work:+1 248 614 5091 x-mozilla-html:FALSE url:http://oss.oracle.com/~cel/ version:2.1 end:vcard
Re: Adding subroot information to /proc/mounts, or obtaining that through other means
H. Peter Anvin wrote: Chuck Lever wrote: To support NFS client performance statistics, I recently added /proc/self/mountstats. That might be a place to add details about --move and --bind mounts without changing the format of /proc/mounts. I just looked at /proc/self/mountstats; it seems to have no more information than /proc/self/mounts, but in an even more annoying format. Either I'm missing something, this file doesn't add anything at all. The advantage is that it doesn't have strong user space dependencies on its format like /proc/mounts does. If you have NFS mount points, you will see that it includes a great deal of additional information about each mount. begin:vcard fn:Chuck Lever n:Lever;Chuck org:Oracle Corporation;Corporate Architecture: Linux Projects Group adr:;;1015 Granger Avenue;Ann Arbor;MI;48104;USA title:Principal Member of Staff tel;work:+1 248 614 5091 x-mozilla-html:FALSE url:http://oss.oracle.com/~cel/ version:2.1 end:vcard
Re: Adding subroot information to /proc/mounts, or obtaining that through other means
H. Peter Anvin wrote: Chuck Lever wrote: The advantage is that it doesn't have strong user space dependencies on its format like /proc/mounts does. If you have NFS mount points, you will see that it includes a great deal of additional information about each mount. OK, I see now: device raidtest:/export mounted on /net/raidtest/export with fstype nfs statvers=1.0 opts: rw,vers=3,rsize=131072,wsize=131072,acregmin=3,acregmax=60,acdirmin=30,acdirmax=60,hard,proto=tcp,timeo=600,retrans=2,sec=sys age:5 caps: caps=0x9,wtmult=4096,dtsize=4096,bsize=0,namelen=255 sec:flavor=1,pseudoflavor=1 events: 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 bytes: 0 0 0 0 0 0 0 0 RPC iostats version: 1.0 p/v: 13/3 (nfs) xprt: tcp 686 0 2 0 5 8 8 0 8 0 per-op statistics NULL: 0 0 0 0 0 0 0 0 GETATTR: 2 2 0 264 224 1 0 1 SETATTR: 0 0 0 0 0 0 0 0 LOOKUP: 0 0 0 0 0 0 0 0 ACCESS: 1 1 0 116 120 0 0 0 READLINK: 0 0 0 0 0 0 0 0 READ: 0 0 0 0 0 0 0 0 WRITE: 0 0 0 0 0 0 0 0 CREATE: 0 0 0 0 0 0 0 0 MKDIR: 0 0 0 0 0 0 0 0 SYMLINK: 0 0 0 0 0 0 0 0 MKNOD: 0 0 0 0 0 0 0 0 REMOVE: 0 0 0 0 0 0 0 0 RMDIR: 0 0 0 0 0 0 0 0 RENAME: 0 0 0 0 0 0 0 0 LINK: 0 0 0 0 0 0 0 0 READDIR: 0 0 0 0 0 0 0 0 READDIRPLUS: 0 0 0 0 0 0 0 0 FSSTAT: 1 1 0 132 84 0 1 1 FSINFO: 1 1 0 132 80 0 0 0 PATHCONF: 0 0 0 0 0 0 0 0 COMMIT: 0 0 0 0 0 0 0 0 This format is just awful for parsing. It's pretty clearly totally ad-hoc. It's not even self-consistent (it uses different separators, etc, in the same file!) It's reasonably compact for human consumption, but it doesn't show what the arrays mean. Heck, XML would have been better than this mess... Sigh. So where where you when I asked for review time and again? I have a couple of simple Python scripts that can parse this without any difficulty. I resent your tone. Quite a bit. begin:vcard fn:Chuck Lever n:Lever;Chuck org:Oracle Corporation;Corporate Architecture: Linux Projects Group adr:;;1015 Granger Avenue;Ann Arbor;MI;48104;USA title:Principal Member of Staff tel;work:+1 248 614 5091 x-mozilla-html:FALSE url:http://oss.oracle.com/~cel/ version:2.1 end:vcard
Re: request for patches: showing mount options
Miklos: Some mount options are never passed to the kernel, and thus can't appear in /proc/mounts. Examples include user, users, and _netdev for NFS. Miklos Szeredi wrote: [please consider pruning the CC list if discussing some aspect, which doesn't concern all] I've done an audit of all filesystems with regards to showing mount options in /proc/pid/mounts. Unfortunately most of them show none or only a part of all accepted options (for details see list of filesystems at the end of the mail). This is currently not a big problem, because mount(8) stores the given options in /etc/mtab. However we want to get rid of mtab, and this requires, that the option showing be fixed up. It would be easiest if this was done by the VFS instead of having to deal with it in filesystems. However there are differences in how filesytems handle options during mount and remount, and it would be impossible to take this into account in all cases. If you are CC-ed, and responsible for one of these filesystems, please take a moment to fully implement the -show_options() method. In most cases it should be an easy task. If for some reason you are unable to do this, please let me know and I'll fix it up. Here are some guidelines for showing options. I'll also add these to Documentation/filesystems/vfs.txt + If a filesystem accepts mount options, it must define show_options() + to show all the currently active options. The rules are: + + - options MUST be shown which are not default or their values differ + from the default + + - options MAY be shown which are enabled by default or have their + default value + + Options used only internally between a mount helper and the kernel + (such as file descriptors), or which only have an effect during the + mounting (such as ones controlling the creation of a journal) are exempt + from the above rules. Thanks, Miklos --- legend: all - fs has options, but doesn't define -show_options() some - fs defines -show_options(), but some options are not shown noopt - fs does not have options good - fs shows all options patch - I have a patch 9p some adfsall (maintainer?) affsall afs all autofs all autofs4 some befsall bfs noopt cifssome (odd parser) codanoopt configfsnoopt cramfs noopt debugfs noopt devpts patch ecryptfssome efs noopt ext2patch ext3patch ext4patch fat some freevxfsnoopt fusepatch gfs2good hfs good hfsplus good hostfs patch hpfsall hppfs noopt hugetlbfs all isofs all (maintainer?) jffs2 noopt jfs some minix noopt msdos -fat ncpfs all (FS_BINARY_MOUNTDATA?) nfs some nfsdnoopt ntfsgood (odd parser) ocfs2 all openpromfs noopt procnoopt qnx4noopt ramfs noopt reiserfsall romfs noopt smbfs good (odd parser) (maintainer?) sysfs noopt sysvnoopt udf all ufs all vfat-fat xfs some (odd parser) mm/shmem.cpatch drivers/oprofile/oprofilefs.c noopt drivers/infiniband/hw/ipath/ipath_fs.cnoopt drivers/misc/ibmasm/ibmasmfs.cnoopt drivers/usb/core (usbfs) noopt drivers/usb/gadget (gadgetfs) noopt drivers/isdn/capi/capifs.cnoopt kernel/cpuset.c noopt fs/binfmt_misc.c noopt net/sunrpc/rpc_pipe.c noopt arch/powerpc/platforms/cell/spufs all arch/s390/hypfs all ipc/mqueue.c noopt security (securityfs) noopt security/selinux/selinuxfs.c noopt in -mm: reiser4some (odd parser) kernel/container.c good (odd parser) - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html begin:vcard fn:Chuck Lever n:Lever;Chuck org:Oracle Corporation;Corporate Architecture: Linux Projects Group adr:;;1015 Granger Avenue;Ann Arbor;MI;48104;USA title:Principal Member of Staff tel;work:+1 248 614 5091 x-mozilla-html:FALSE version:2.1 end:vcard
Re: request for patches: showing mount options
Miklos Szeredi wrote: Some mount options are never passed to the kernel, and thus can't appear in /proc/mounts. Examples include user, users, and _netdev for NFS. These options control *who* may mount and *when* to mount. They are not a property of the mount itself and are not added to /etc/mtab. There's a user=ID option that is added to /etc/mtab in case of user mounts. This identifies the owner of the mount, so that it can be unmounted by that user. There are patches in -mm that enable the kernel to store this info. Do you have other examples in mind? [no]quota comes to mind; also auto, [no]owner, [no]group, and quiet/loud, but these may fall into the same category you mention above. Aside: It's a confusing artifact of the mount CLI that these options control who/when but are passed to the mount command in the same way the other options are. begin:vcard fn:Chuck Lever n:Lever;Chuck org:Oracle Corporation;Corporate Architecture: Linux Projects Group adr:;;1015 Granger Avenue;Ann Arbor;MI;48104;USA title:Principal Member of Staff tel;work:+1 248 614 5091 x-mozilla-html:FALSE version:2.1 end:vcard
Re: Correct behavior on O_DIRECT sparse file writes
Florian Weimer wrote: * Andrew Morton: I don't think it's a bug. Sure, O_DIRECT is synchronous, but that's because it is, err, direct. Not because it provides extra data-integrity guarantees. If you want those guarantees, use O_SYNC as well. This needs to be prominently documented. Right now, it's far from clear that you need both O_DIRECT and O_SYNC. It's certainly not a requirement for NFS. O_DIRECT on NFS forces data to the server, which always updates a file's metadata on each write, including indirect blocks. begin:vcard fn:Chuck Lever n:Lever;Chuck org:Oracle Corporation;Corporate Architecture: Linux Projects Group adr:;;1015 Granger Avenue;Ann Arbor;MI;48104;USA title:Principal Member of Staff tel;work:+1 248 614 5091 x-mozilla-html:FALSE version:2.1 end:vcard
Re: Beagle and logging inotify events
On Nov 13, 2007, at 7:04 PM, Jon Smirl wrote: Is it feasible to do something like this in the linux file system architecture? Beagle beats on my disk for an hour when I reboot. Of course I don't like that and I shut Beagle off. Leopard, by the way, does exactly this: it has a daemon that starts at boot time and taps FSEvents then journals file system changes to a well-known file on local disk. I don't see why this couldn't be done on Linux as well. -- Forwarded message -- From: Jon Smirl [EMAIL PROTECTED] Date: Nov 13, 2007 4:44 PM Subject: Re: Strange beagle interaction.. To: Linus Torvalds [EMAIL PROTECTED] Cc: J. Bruce Fields [EMAIL PROTECTED], Junio C Hamano [EMAIL PROTECTED], Git Mailing List [EMAIL PROTECTED], Johannes Schindelin [EMAIL PROTECTED] On 11/13/07, Linus Torvalds [EMAIL PROTECTED] wrote: On Tue, 13 Nov 2007, J. Bruce Fields wrote: Last I ran across this, I believe I found it was adding extended attributes to the file. Yeah, I just straced it and found the same thing. It's saving fingerprints and mtimes to files in the extended attributes. Things like Beagle need a guaranteed log of global inotify events. That would let them efficiently find changes made since the last time they updated their index. Right now every time Beagle starts it hasn't got a clue what has changed in the file system since it was last run. This forces Beagle to rescan the entire filesystem every time it is started. The xattrs are used as cache to reduce this load somewhat. A better solution would be for the kernel to log inotify events to disk in a manner that survives reboots. When Beagle starts it would locate its last checkpoint and then process the logged inotify events from that time forward. This inotify logging needs to be bullet proof or it will mess up your Beagle index. Logged files systems already contain the logged inotify data (in their own internal form). There's just no universal API for retrieving it in a file system independent manner. Yeah, I just turned off beagle. It looked to me like it was doing something wrongheaded. Gaah. The problem is, setting xattrs does actually change ctime. Which means that if we want to make git play nice with beagle, I guess we have to just remove the comparison of ctime. Oh, well. Git doesn't *require* it, but I like the notion of checking the inode really really carefully. But it looks like it may not be an option, because of file indexers hiding stuff behind our backs. Or we could just tell people not to run beagle on their git trees, but I suspect some people will actually *want* to. Even if it flushes their disk caches. Linus - To unsubscribe from this list: send the line unsubscribe git in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html -- Jon Smirl [EMAIL PROTECTED] -- Jon Smirl [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux- fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html -- Chuck Lever chuck[dot]lever[at]oracle[dot]com - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Beagle and logging inotify events
Jon Smirl wrote: On 11/14/07, Chuck Lever [EMAIL PROTECTED] wrote: On Nov 13, 2007, at 7:04 PM, Jon Smirl wrote: Is it feasible to do something like this in the linux file system architecture? Beagle beats on my disk for an hour when I reboot. Of course I don't like that and I shut Beagle off. Leopard, by the way, does exactly this: it has a daemon that starts at boot time and taps FSEvents then journals file system changes to a well-known file on local disk. Logging file systems have all of the needed info. Plus they know what is going on with rollback/replay after a crash. True, but not all file systems have a journal. Consider ext2 or FAT32, both of which are still common. How about a fs API where Beagle has a token for a checkpoint, and then it can ask for a recreation of inotify events from that point forward. It's always possible for the file system to say I can't do that and trigger a full rebuild from Beagle. Daemons that aren't coordinated with the file system have a window during crash/reboot where they can get confused. A reasonably effective solution can be implemented in user space without changes to the file system APIs or implementations. IOW we already have the tools to make something useful. For example, you don't need to record every file system event to make this useful. Listing only directory-level changes (ie some file in this directory has changed) is enough to prune most of Beagle's work when it starts up. Without low level support like this Beagle is forced to do a rescan on every boot. Since I crash my machine all of the time the disk load from rebooting is intolerable and I turn Beagle off. Even just turning the machine on in the morning generates an annoyingly large load on the disk. Understood. The need is clear. My Dad's WinXP system takes 10 minutes after every start-up before it's usable, simply because the virus scanner has to check every file in the system. Same problem! I don't see why this couldn't be done on Linux as well. -- Forwarded message -- From: Jon Smirl [EMAIL PROTECTED] Date: Nov 13, 2007 4:44 PM Subject: Re: Strange beagle interaction.. To: Linus Torvalds [EMAIL PROTECTED] Cc: J. Bruce Fields [EMAIL PROTECTED], Junio C Hamano [EMAIL PROTECTED], Git Mailing List [EMAIL PROTECTED], Johannes Schindelin [EMAIL PROTECTED] On 11/13/07, Linus Torvalds [EMAIL PROTECTED] wrote: On Tue, 13 Nov 2007, J. Bruce Fields wrote: Last I ran across this, I believe I found it was adding extended attributes to the file. Yeah, I just straced it and found the same thing. It's saving fingerprints and mtimes to files in the extended attributes. Things like Beagle need a guaranteed log of global inotify events. That would let them efficiently find changes made since the last time they updated their index. Right now every time Beagle starts it hasn't got a clue what has changed in the file system since it was last run. This forces Beagle to rescan the entire filesystem every time it is started. The xattrs are used as cache to reduce this load somewhat. A better solution would be for the kernel to log inotify events to disk in a manner that survives reboots. When Beagle starts it would locate its last checkpoint and then process the logged inotify events from that time forward. This inotify logging needs to be bullet proof or it will mess up your Beagle index. Logged files systems already contain the logged inotify data (in their own internal form). There's just no universal API for retrieving it in a file system independent manner. Yeah, I just turned off beagle. It looked to me like it was doing something wrongheaded. Gaah. The problem is, setting xattrs does actually change ctime. Which means that if we want to make git play nice with beagle, I guess we have to just remove the comparison of ctime. Oh, well. Git doesn't *require* it, but I like the notion of checking the inode really really carefully. But it looks like it may not be an option, because of file indexers hiding stuff behind our backs. Or we could just tell people not to run beagle on their git trees, but I suspect some people will actually *want* to. Even if it flushes their disk caches. Linus - To unsubscribe from this list: send the line unsubscribe git in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html -- Jon Smirl [EMAIL PROTECTED] -- Jon Smirl [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux- fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html -- Chuck Lever chuck[dot]lever[at]oracle[dot]com begin:vcard fn:Chuck Lever n:Lever;Chuck org:Oracle Corporation;Corporate Architecture: Linux Projects Group adr:;;1015 Granger Avenue;Ann Arbor;MI;48104;USA title:Principal Member of Staff tel;work:+1 248 614 5091 x-mozilla-html:FALSE version:2.1 end:vcard
Re: Beagle and logging inotify events
Jon Smirl wrote: On 11/14/07, Chuck Lever [EMAIL PROTECTED] wrote: Jon Smirl wrote: On 11/14/07, Chuck Lever [EMAIL PROTECTED] wrote: On Nov 13, 2007, at 7:04 PM, Jon Smirl wrote: Is it feasible to do something like this in the linux file system architecture? Beagle beats on my disk for an hour when I reboot. Of course I don't like that and I shut Beagle off. Leopard, by the way, does exactly this: it has a daemon that starts at boot time and taps FSEvents then journals file system changes to a well-known file on local disk. Logging file systems have all of the needed info. Plus they know what is going on with rollback/replay after a crash. True, but not all file systems have a journal. Consider ext2 or FAT32, both of which are still common. ext2/FAT32 can use the deamon approach you describe below which also works as a short term solution. The Beagle people do have a deamon but it can be turned off. Holes where you don't record the inotify events and update the index are really bad because they can make files that you know are on the disk disappear from the index. I don't believe Beagle distinguishes between someone turning it off for a day and then turning it back on, vs a reboot. In both cases it says there was a window where untracked changes could have happened and it triggers a full rescan. The root problem here is needing a bullet proof inotify log with no windows. I disagree: we don't need a bullet-proof log. We can get a significant performance improvement even with a permanent dnotify log implemented in user-space. We already have well-defined fallback behavior if such a log is missing or incomplete. The problem with a permanent inotify log is that it can become unmanageably enormous, and a performance problem to boot. Recording at that level of detail makes it more likely that the logger won't be able to keep up with file system activity. A lightweight solution gets us most of the way there, is simple to implement, and doesn't introduce many new issues. As long as it can tell us precisely where the holes are, it shouldn't be a problem. The only place that is going to happen is inside the file system logs. As Andi points out, existing block-based journaling implementations won't easily provide this. And most fs journals are actually pretty limited in size. Alternately, you could insert a stackable file system layer between the VFS and the on-disk fs to provide more seamless information about updates. We just need an API to say recreate the inotify stream from this checkpoint forward. Things like FAT/ext2 will always return a no data available error from this API. How about a fs API where Beagle has a token for a checkpoint, and then it can ask for a recreation of inotify events from that point forward. It's always possible for the file system to say I can't do that and trigger a full rebuild from Beagle. Daemons that aren't coordinated with the file system have a window during crash/reboot where they can get confused. A reasonably effective solution can be implemented in user space without changes to the file system APIs or implementations. IOW we already have the tools to make something useful. For example, you don't need to record every file system event to make this useful. Listing only directory-level changes (ie some file in this directory has changed) is enough to prune most of Beagle's work when it starts up. Without low level support like this Beagle is forced to do a rescan on every boot. Since I crash my machine all of the time the disk load from rebooting is intolerable and I turn Beagle off. Even just turning the machine on in the morning generates an annoyingly large load on the disk. Understood. The need is clear. My Dad's WinXP system takes 10 minutes after every start-up before it's usable, simply because the virus scanner has to check every file in the system. Same problem! I don't see why this couldn't be done on Linux as well. -- Forwarded message -- From: Jon Smirl [EMAIL PROTECTED] Date: Nov 13, 2007 4:44 PM Subject: Re: Strange beagle interaction.. To: Linus Torvalds [EMAIL PROTECTED] Cc: J. Bruce Fields [EMAIL PROTECTED], Junio C Hamano [EMAIL PROTECTED], Git Mailing List [EMAIL PROTECTED], Johannes Schindelin [EMAIL PROTECTED] On 11/13/07, Linus Torvalds [EMAIL PROTECTED] wrote: On Tue, 13 Nov 2007, J. Bruce Fields wrote: Last I ran across this, I believe I found it was adding extended attributes to the file. Yeah, I just straced it and found the same thing. It's saving fingerprints and mtimes to files in the extended attributes. Things like Beagle need a guaranteed log of global inotify events. That would let them efficiently find changes made since the last time they updated their index. Right now every time Beagle starts it hasn't got a clue what has changed in the file system since it was last run. This forces Beagle to rescan the entire filesystem every time it is started
Re: [patch] VFS: extend /proc/mounts
On Jan 17, 2008, at 3:55 AM, Miklos Szeredi wrote: Hey, I just found /proc/X/mountstats. How does this fit in to the big picture? It seems to show some counters for NFS mounts, no other filesystem uses it. Format looks rather less nice, than /proc/X/mounts (why do we need long english sentences under /proc?). I introduced /proc/self/mountstats because we need a way for non- block-device-based file systems to report I/O statistics. Everything else I tried was rejected, and apparently what we ended up with was reviewed by only a handful of people, so no one else likes it or uses it. It can go away for all I care, as long as we retain some flexible mechanism for non-block-based file systems to report I/O stats. As far as I am aware, there are only two user utilities that understand and parse this data, and I maintain both. -- Chuck Lever chuck[dot]lever[at]oracle[dot]com - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] enhanced ESTALE error handling
Hi Peter- On Jan 18, 2008, at 10:35 AM, Peter Staubach wrote: Hi. Here is a patch set which modifies the system to enhance the ESTALE error handling for system calls which take pathnames as arguments. The VFS already handles ESTALE. If a pathname resolution encounters an ESTALE at any point, the resolution is restarted exactly once, and an additional flag is passed to the file system during each lookup that forces each component in the path to be revalidated on the server. This has no possibility of causing an infinite loop. Is there some part of this logic that is no longer working? -- Chuck Lever chuck[dot]lever[at]oracle[dot]com - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] enhanced ESTALE error handling
On Jan 18, 2008, at 11:55 AM, Peter Staubach wrote: Chuck Lever wrote: Hi Peter- On Jan 18, 2008, at 10:35 AM, Peter Staubach wrote: Hi. Here is a patch set which modifies the system to enhance the ESTALE error handling for system calls which take pathnames as arguments. The VFS already handles ESTALE. If a pathname resolution encounters an ESTALE at any point, the resolution is restarted exactly once, and an additional flag is passed to the file system during each lookup that forces each component in the path to be revalidated on the server. This has no possibility of causing an infinite loop. Is there some part of this logic that is no longer working? The VFS does not fully handle ESTALE. An ESTALE error can occur during the second pathname resolution attempt. If an ESTALE occurs during the second resolution attempt, we should give up. When I addressed this issue two years ago, the two-try logic was the only acceptable solution because there's no way to guarantee the pathname resolution will ever finish unless we put a hard limit on it. There are lots of reasons, some of which are the 1 second resolution from some file systems on the server Which is a server bug, AFAICS. It's simply impossible to close all the windows that result from sloppy file time stamps without completely disabling client-side caching. The NFS protocol relies on file time stamps to manage cache coherence. If the server is lying about time stamps, there's no way the client can cache coherently. and the window in between the revalidation and the actual use of the file handle associated with each dentry/inode pair. A use case or two would be useful to explore (on linux-nfs or linux- fsdevel, rather than lkml). Also, there was no support for ESTALE errors which occur during subsequent operations to the pathname resolution process. For example, during a mkdir(2) operation, the ESTALE can occur from the over the wire MKDIR operation after the LOOKUP operations have all succeeded. If the final operation fails after a pathname resolution, then it's a real error. Is there a fixed and valid recovery script for the client in this case that will allow the mkdir to proceed? Admittedly, the NFS client could recover more cleanly from some of these problems, but given the architecture of the Linux VFS, it will be difficult to address some of the corner cases. -- Chuck Lever chuck[dot]lever[at]oracle[dot]com - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] enhanced ESTALE error handling
On Jan 18, 2008, at 12:30 PM, Peter Staubach wrote: Chuck Lever wrote: On Jan 18, 2008, at 11:55 AM, Peter Staubach wrote: Chuck Lever wrote: Hi Peter- On Jan 18, 2008, at 10:35 AM, Peter Staubach wrote: Hi. Here is a patch set which modifies the system to enhance the ESTALE error handling for system calls which take pathnames as arguments. The VFS already handles ESTALE. If a pathname resolution encounters an ESTALE at any point, the resolution is restarted exactly once, and an additional flag is passed to the file system during each lookup that forces each component in the path to be revalidated on the server. This has no possibility of causing an infinite loop. Is there some part of this logic that is no longer working? The VFS does not fully handle ESTALE. An ESTALE error can occur during the second pathname resolution attempt. If an ESTALE occurs during the second resolution attempt, we should give up. When I addressed this issue two years ago, the two-try logic was the only acceptable solution because there's no way to guarantee the pathname resolution will ever finish unless we put a hard limit on it. I can probably imagine a situation where the pathname resolution would never finish, but I am not sure that it could ever happen in nature. Unless someone is doing something malicious. Or if the server is repeatedly returning ESTALE for some reason. There are lots of reasons, some of which are the 1 second resolution from some file systems on the server Which is a server bug, AFAICS. It's simply impossible to close all the windows that result from sloppy file time stamps without completely disabling client-side caching. The NFS protocol relies on file time stamps to manage cache coherence. If the server is lying about time stamps, there's no way the client can cache coherently. Server bug or not, it is something that the client has to live with. We can't get the server file system fixed, so it is something that we should find a way to live with. This support can help. We haven't identified a server-side solution yet, but that doesn't mean it doesn't exist. If we address the time stamp problem in the client, should we also go to lengths to address it in every other corner of the NFS client? Should we also address every other server bug we discover with a client side fix? Also, there was no support for ESTALE errors which occur during subsequent operations to the pathname resolution process. For example, during a mkdir(2) operation, the ESTALE can occur from the over the wire MKDIR operation after the LOOKUP operations have all succeeded. If the final operation fails after a pathname resolution, then it's a real error. Is there a fixed and valid recovery script for the client in this case that will allow the mkdir to proceed? Why do you think that it is an error? Because this is a problem that sometimes requires application-level recovery. Can we guarantee that retrying the mkdir is the right thing to do every time? It can easily occur if the directory in which the new directory is to be created disppears after it is looked up and before the MKDIR is issued. The recovery is to perform the lookup again. Have you tried this client against a file server when you unexport the filesystem under test? The server returns ESTALE no matter what the client does. Should the client continue to retry the request if the file system has been permanently taken offline? Admittedly, the NFS client could recover more cleanly from some of these problems, but given the architecture of the Linux VFS, it will be difficult to address some of the corner cases. Could you outline some of these corner cases that this proposal would not address, please? I think we have one right here: should the client retry a mkdir if gets an ESTALE? -- Chuck Lever chuck[dot]lever[at]oracle[dot]com - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 21/26] mount options: partially fix nfs
Hi Miklos- Miklos Szeredi wrote: From: Miklos Szeredi [EMAIL PROTECTED] Add posix, bsize=, namelen= options to /proc/mounts for nfs filesystems. Document several other options that are still missing. NFS lists only some options in /proc/mounts on purpose: only the essential options are mentioned there to keep clutter down. The three you've added here are for all intents and purposes deprecated, which is why they are not supported. NFS lists a more complete set of mount options for a mount point in /proc/self/mountstats. See nfs_show_stats(). Since your cover letter does not explain why you are changing this code, can you refer me to a description of why you are doing this? More below. Signed-off-by: Miklos Szeredi [EMAIL PROTECTED] --- Index: linux/fs/nfs/super.c === --- linux.orig/fs/nfs/super.c 2008-01-19 11:56:34.0 +0100 +++ linux/fs/nfs/super.c2008-01-21 20:41:30.0 +0100 @@ -449,6 +449,7 @@ static void nfs_show_mount_options(struc } nfs_info[] = { { NFS_MOUNT_SOFT, ,soft, ,hard }, { NFS_MOUNT_INTR, ,intr, ,nointr }, + { NFS_MOUNT_POSIX, ,posix, }, { NFS_MOUNT_NOCTO, ,nocto, }, { NFS_MOUNT_NOAC, ,noac, }, { NFS_MOUNT_NONLM, ,nolock, }, @@ -459,10 +460,17 @@ static void nfs_show_mount_options(struc }; const struct proc_nfs_info *nfs_infop; struct nfs_client *clp = nfss-nfs_client; + unsigned int default_namelen = + clp-rpc_ops-version == 4 ? NFS4_MAXNAMLEN : + clp-rpc_ops-version == 3 ? NFS3_MAXNAMLEN : NFS2_MAXNAMLEN; seq_printf(m, ,vers=%d, clp-rpc_ops-version); seq_printf(m, ,rsize=%d, nfss-rsize); seq_printf(m, ,wsize=%d, nfss-wsize); + if (nfss-bsize != 0) + seq_printf(m, ,bsize=%d, nfss-bsize); + if (nfss-namelen != default_namelen) + seq_printf(m, ,namelen=%d, nfss-namelen); if (nfss-acregmin != 3*HZ || showdefaults) seq_printf(m, ,acregmin=%d, nfss-acregmin/HZ); if (nfss-acregmax != 60*HZ || showdefaults) @@ -482,6 +490,18 @@ static void nfs_show_mount_options(struc seq_printf(m, ,timeo=%lu, 10U * nfss-client-cl_timeout-to_initval / HZ); seq_printf(m, ,retrans=%u, nfss-client-cl_timeout-to_retries); seq_printf(m, ,sec=%s, nfs_pseudoflavour_to_name(nfss-client-cl_auth-au_flavor)); + + /* +* Missing options: +* port= Probably should be supported. +* addr= This one is already supported; see nfs_show_options(). +* clientaddr= This one isn't, and should be... would be useful for tracking down certain NFSv4 problems. +* mounthost= +* mountaddr= + * mountport= + * mountvers= + * mountproto= And these mount* options are for the kernel's new mount protocol client. They aren't really useful for understanding steady-state NFS client behavior, they only effect mount-time behavior. begin:vcard fn:Chuck Lever n:Lever;Chuck org:Oracle Corporation;Corporate Architecture: Linux Projects Group adr:;;1015 Granger Avenue;Ann Arbor;MI;48104;USA email;internet:chuck dot lever at nospam oracle dot com title:Principal Member of Staff tel;work:+1 248 614 5091 x-mozilla-html:FALSE version:2.1 end:vcard
Re: [PATCH 24/27] NFS: Use local caching [try #2]
Some comments below. This patch really ought to be broken into more manageable atomic changes to make it easier to review, and to provide more fine-grained explanation and rationalization for each specific change via individual patch descriptions. David Howells wrote: The attached patch makes it possible for the NFS filesystem to make use of the network filesystem local caching service (FS-Cache). To be able to use this, an updated mount program is required. This can be obtained from: http://people.redhat.com/steved/fscache/util-linux/ This should no longer be necessary. The latest mount.nfs subcommand from nfs-utils supports text-based mounts when running on kernels 2.6.23 and later. To mount an NFS filesystem to use caching, add an fsc option to the mount: mount warthog:/ /a -o fsc I hope you intend to provide updates to nfs(5) that describe the new mount options you introduce in this and later patches. You don't mention it, but I assume that nofsc is the default behavior. Signed-off-by: David Howells [EMAIL PROTECTED] --- fs/nfs/Makefile |1 fs/nfs/client.c |5 + fs/nfs/file.c | 37 fs/nfs/fscache-def.c | 289 + fs/nfs/fscache.c | 391 + fs/nfs/fscache.h | 148 + fs/nfs/inode.c| 47 + fs/nfs/read.c | 28 +++ fs/nfs/super.c|3 fs/nfs/sysctl.c |1 include/linux/nfs_fs.h|9 + include/linux/nfs_fs_sb.h | 18 ++ 12 files changed, 968 insertions(+), 9 deletions(-) create mode 100644 fs/nfs/fscache-def.c create mode 100644 fs/nfs/fscache.c create mode 100644 fs/nfs/fscache.h diff --git a/fs/nfs/Makefile b/fs/nfs/Makefile index df0f41e..073d04c 100644 --- a/fs/nfs/Makefile +++ b/fs/nfs/Makefile @@ -16,3 +16,4 @@ nfs-$(CONFIG_NFS_V4) += nfs4proc.o nfs4xdr.o nfs4state.o nfs4renewd.o \ nfs4namespace.o nfs-$(CONFIG_NFS_DIRECTIO) += direct.o nfs-$(CONFIG_SYSCTL) += sysctl.o +nfs-$(CONFIG_NFS_FSCACHE) += fscache.o fscache-def.o diff --git a/fs/nfs/client.c b/fs/nfs/client.c index a6f6254..bcdc5d0 100644 --- a/fs/nfs/client.c +++ b/fs/nfs/client.c @@ -43,6 +43,7 @@ #include delegation.h #include iostat.h #include internal.h +#include fscache.h #define NFSDBG_FACILITY NFSDBG_CLIENT @@ -139,6 +140,8 @@ static struct nfs_client *nfs_alloc_client(const char *hostname, clp-cl_state = 1 NFS4CLNT_LEASE_EXPIRED; #endif + nfs_fscache_get_client_cookie(clp); + return clp; error_3: @@ -170,6 +173,8 @@ static void nfs_free_client(struct nfs_client *clp) nfs4_shutdown_client(clp); + nfs_fscache_release_client_cookie(clp); + /* -EIO all pending I/O */ if (!IS_ERR(clp-cl_rpcclient)) rpc_shutdown_client(clp-cl_rpcclient); diff --git a/fs/nfs/file.c b/fs/nfs/file.c index b3bb89f..d492cd7 100644 --- a/fs/nfs/file.c +++ b/fs/nfs/file.c @@ -35,6 +35,7 @@ #include delegation.h #include internal.h #include iostat.h +#include fscache.h #define NFSDBG_FACILITY NFSDBG_FILE @@ -352,22 +353,48 @@ static int nfs_write_end(struct file *file, struct address_space *mapping, return status 0 ? status : copied; } +/* + * Partially or wholly invalidate a page + * - Release the private state associated with a page if undergoing complete + * page invalidation + * - Called if either PG_private or PG_fscache set on the page + * - Caller holds page lock + */ Add comments like this in a separate clean up patch. static void nfs_invalidate_page(struct page *page, unsigned long offset) { if (offset != 0) return; /* Cancel any unstarted writes on this page */ nfs_wb_page_cancel(page-mapping-host, page); + + nfs_fscache_invalidate_page(page, page-mapping-host); } +/* + * Release the private state associated with a page + * - Called if either PG_private or PG_fscache set on the page + * - Caller holds page lock + * - Return true (may release) or false (may not) + */ static int nfs_release_page(struct page *page, gfp_t gfp) { /* If PagePrivate() is set, then the page is not freeable */ - return 0; + if (PagePrivate(page)) + return 0; + return nfs_fscache_release_page(page, gfp); } +/* + * Attempt to clear the private state associated with a page when an error + * occurs that requires the cached contents of an inode to be written back or + * destroyed + * - Called if either PG_private or PG_fscache set on the page + * - Caller holds page lock + * - Return 0 if successful, -error otherwise + */ static int nfs_launder_page(struct page *page) { + wait_on_page_fscache_write(page); return nfs_wb_page(page-mapping-host, page); } @@ -387,6 +414,11 @@ const struct address_space_operations nfs_file_aops = { .launder_page =
Re: [patch 21/26] mount options: partially fix nfs
On Jan 25, 2008, at 4:39 AM, Miklos Szeredi wrote: Miklos Szeredi wrote: From: Miklos Szeredi [EMAIL PROTECTED] Add posix, bsize=, namelen= options to /proc/mounts for nfs filesystems. Document several other options that are still missing. NFS lists only some options in /proc/mounts on purpose: only the essential options are mentioned there to keep clutter down. The three you've added here are for all intents and purposes deprecated, which is why they are not supported. NFS lists a more complete set of mount options for a mount point in /proc/self/mountstats. See nfs_show_stats(). Since your cover letter does not explain why you are changing this code, can you refer me to a description of why you are doing this? Descritption is in the 01/26 patch. More below. Signed-off-by: Miklos Szeredi [EMAIL PROTECTED] --- Index: linux/fs/nfs/super.c === --- linux.orig/fs/nfs/super.c 2008-01-19 11:56:34.0 +0100 +++ linux/fs/nfs/super.c2008-01-21 20:41:30.0 +0100 @@ -449,6 +449,7 @@ static void nfs_show_mount_options(struc } nfs_info[] = { { NFS_MOUNT_SOFT, ,soft, ,hard }, { NFS_MOUNT_INTR, ,intr, ,nointr }, + { NFS_MOUNT_POSIX, ,posix, }, { NFS_MOUNT_NOCTO, ,nocto, }, { NFS_MOUNT_NOAC, ,noac, }, { NFS_MOUNT_NONLM, ,nolock, }, @@ -459,10 +460,17 @@ static void nfs_show_mount_options(struc }; const struct proc_nfs_info *nfs_infop; struct nfs_client *clp = nfss-nfs_client; + unsigned int default_namelen = + clp-rpc_ops-version == 4 ? NFS4_MAXNAMLEN : + clp-rpc_ops-version == 3 ? NFS3_MAXNAMLEN : NFS2_MAXNAMLEN; seq_printf(m, ,vers=%d, clp-rpc_ops-version); seq_printf(m, ,rsize=%d, nfss-rsize); seq_printf(m, ,wsize=%d, nfss-wsize); + if (nfss-bsize != 0) + seq_printf(m, ,bsize=%d, nfss-bsize); + if (nfss-namelen != default_namelen) + seq_printf(m, ,namelen=%d, nfss-namelen); if (nfss-acregmin != 3*HZ || showdefaults) seq_printf(m, ,acregmin=%d, nfss-acregmin/HZ); if (nfss-acregmax != 60*HZ || showdefaults) @@ -482,6 +490,18 @@ static void nfs_show_mount_options(struc seq_printf(m, ,timeo=%lu, 10U * nfss-client-cl_timeout- to_initval / HZ); seq_printf(m, ,retrans=%u, nfss-client-cl_timeout- to_retries); seq_printf(m, ,sec=%s, nfs_pseudoflavour_to_name(nfss-client- cl_auth-au_flavor)); + + /* +* Missing options: +* port= Probably should be supported. +* addr= This one is already supported; see nfs_show_options(). Right, thanks. +* clientaddr= This one isn't, and should be... would be useful for tracking down certain NFSv4 problems. +* mounthost= +* mountaddr= +* mountport= +* mountvers= +* mountproto= And these mount* options are for the kernel's new mount protocol client. They aren't really useful for understanding steady-state NFS client behavior, they only effect mount-time behavior. All mount options should be shown, which are needed to reconstruct a previous mount. Ah, OK. I'm happy to implement logic to display the all missing options. I should have updated nfs_show_mount_options() when I wrote the NFS mount option parser. Let me know your preference. For example, if you copy options out from /proc/mount, umount the filesystem, and then create a new mount with the copied options, you should get the same mount. For NFS, umount also needs to read some of the options in order to determine how mountd is to connect to the server for the unmount. (That's why we have addr= in the first place). -- Chuck Lever chuck[dot]lever[at]oracle[dot]com - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 21/26] mount options: partially fix nfs
On Jan 28, 2008, at 6:34 AM, Miklos Szeredi wrote: All mount options should be shown, which are needed to reconstruct a previous mount. Ah, OK. I'm happy to implement logic to display the all missing options. I should have updated nfs_show_mount_options() when I wrote the NFS mount option parser. Let me know your preference. You are more familiar with NFS, so I think it would be better if you updated nfs_show_mount_options(). Could you also queue my patch (updated) or incorporate it into a combined fix? Yes. I'll have time in a day or two to get this finished. Thanks, Miklos Subject: mount options: partially fix nfs From: Miklos Szeredi [EMAIL PROTECTED] Add posix, bsize=, namelen= options to /proc/mounts for nfs filesystems. Document several other options that are still missing. Changes: - display namelen= unconditionally - addr= isn't missing after all Signed-off-by: Miklos Szeredi [EMAIL PROTECTED] CC: Trond Myklebust [EMAIL PROTECTED] --- Index: linux/fs/nfs/super.c === --- linux.orig/fs/nfs/super.c 2008-01-25 15:44:56.0 +0100 +++ linux/fs/nfs/super.c2008-01-25 15:57:32.0 +0100 @@ -449,6 +449,7 @@ static void nfs_show_mount_options(struc } nfs_info[] = { { NFS_MOUNT_SOFT, ,soft, ,hard }, { NFS_MOUNT_INTR, ,intr, ,nointr }, + { NFS_MOUNT_POSIX, ,posix, }, { NFS_MOUNT_NOCTO, ,nocto, }, { NFS_MOUNT_NOAC, ,noac, }, { NFS_MOUNT_NONLM, ,nolock, }, @@ -463,6 +464,9 @@ static void nfs_show_mount_options(struc seq_printf(m, ,vers=%d, clp-rpc_ops-version); seq_printf(m, ,rsize=%d, nfss-rsize); seq_printf(m, ,wsize=%d, nfss-wsize); + seq_printf(m, ,namelen=%d, nfss-namelen); + if (nfss-bsize != 0) + seq_printf(m, ,bsize=%d, nfss-bsize); if (nfss-acregmin != 3*HZ || showdefaults) seq_printf(m, ,acregmin=%d, nfss-acregmin/HZ); if (nfss-acregmax != 60*HZ || showdefaults) @@ -482,6 +486,17 @@ static void nfs_show_mount_options(struc seq_printf(m, ,timeo=%lu, 10U * nfss-client-cl_timeout- to_initval / HZ); seq_printf(m, ,retrans=%u, nfss-client-cl_timeout-to_retries); seq_printf(m, ,sec=%s, nfs_pseudoflavour_to_name(nfss-client- cl_auth-au_flavor)); + + /* +* Missing options: +* port= +* mountport= +* mountvers= +* mountproto= +* clientaddr= +* mounthost= +* mountaddr= +*/ } /* -- Chuck Lever chuck[dot]lever[at]oracle[dot]com - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 24/27] NFS: Use local caching [try #2]
Hi David- On Jan 29, 2008, at 10:25 PM, David Howells wrote: Chuck Lever [EMAIL PROTECTED] wrote: This patch really ought to be broken into more manageable atomic changes to make it easier to review, and to provide more fine-grained explanation and rationalization for each specific change via individual patch descriptions. Hmmm I broke the patch up as Trond stipulated - at least, I thought I had. In many ways this request doesn't make sense. You can't do NFS caching without all the appropriate bits, so logically they should be one patch. Breaking it up won't help git-bisect since the option to enable all this is the last (or nearly last) patch. In addition to adding a new feature, you are changing existing code. If any one of the changes you made breaks existing behavior, having them all in small atomic patches makes it practical to bisect and find the problem. In addition it makes it worlds easier to review by people who are not so familiar with your fscache implementation. And smaller patches means the ratio of patch descriptions to code changes can be much higher. It does make sense to introduce the files under fs/fsc in a single patch. But when you are changing code that is already being used, more care needs to be taken. This should no longer be necessary. The latest mount.nfs subcommand from nfs-utils supports text-based mounts when running on kernels 2.6.23 and later. Okay. I'll update my patches to reflect this. Note, however, I've got someone reporting a bug that seems to show otherwise. I'll have to investigate this more next week. The very latest version (post 1.1.1) is required today for text-based NFS mounts. (That is, the bleeding edge version you get by cloning the nfs-utils git repo). And it only works on kernels later than 2.6.22 -- if that particular user is testing fscache on 2.6.22 or older, then only the legacy binary NFS mount system call API is supported. Add comments like this in a separate clean up patch. +/* + * Notification that a PTE pointing to an NFS page is about to be made + * writable, implying that someone is about to modify the page through a + * shared-writable mapping + */ What does that have to do with local disk caching? +struct nfs_fh_auxdata { + struct timespec i_mtime; + struct timespec i_ctime; + loff_t i_size; +}; It might be useful to explain here why you need to supplement the mtime, ctime, and size fields that already exist in an NFS inode. Supplement? I don't understand. Why is it necessary to add additional mtime, ctime and size fields for NFS inodes? Similar metadata is already stored in nfsi. All I'm asking for is some documentation of what these fields do that the existing time stamps and size fields in nfsi don't. Explain why the NFS fsc implementation needs this data structure. + key-port = clp-cl_addr.sin_port; Not sure why you are using the server's port here. In almost every case the server side port number will be 2049, so it really doesn't add any uniquification. The reason lies is in almost every case. It's possible to configure it such that a server is running two separate NFS servers on different ports. We should explore whether it is typical or even possible that such a configuration exports the same file handles on different ports, and whether that really matters to the client. I strongly recommend you use the existing IPv6 address conversion macros for this instead of open-coding yet another way of mapping an IPv4 address to an IPv6 address. However, since AF_INET6 support is being introduced in the NFS client in 2.6.24, I recommend you take a look at these source files after Trond has pushed his NFS_ALL for 2.6.24. I'll look at them. I always do this: I meant 2.6.25, not 2.6.24. By the time you return, basic IPv6 support for NFSv4 should be in 2.6.25-rc1's NFS client (not server). Not that it is bug-free, but an implementation is now there. -- Chuck Lever chuck[dot]lever[at]oracle[dot]com - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html