Re: [patch] VFS: extend /proc/mounts
The alternative (and completely safe) solution is to add another file to proc. Me no likey. Since we need saner layout, I would strongly suggest exactly that. I don't think there's all that much wrong with the current layout, except the two dummy zeroes at the end. Or, something else needs fixing in there? major:minor -- is the major minor number of the device hosting the filesystem Bad description. Value of st_dev for files on that filesystem, please - there might be no such thing as the device hosting the filesystem _and_ the value here may bloody well be unrelated to device actually holding all data (for things like ext2meta, etc.). Right. 1) The mount is a shared mount. 2) Its peer mount of mount with id 20 3) It is also a slave mount of the master-mount with the id 19 4) The filesystem on device with major/minor number 98:0 and subdirectory mnt/1/abc makes the root directory of this mount. 5) And finally the mount with id 16 is its parent. I'd suggest doing a new file that would *not* try to imitate /etc/mtab. Another thing is, how much of propagation information do we want to be exposed and what do we intend to do with it? I think the scheme devised by Ram is basically right. It shows the relationships (slave, peer) and the ID of a master/peer mount. What I changed, is to always show a canonical peer, because I think that is more useful in establishing relationships between mounts. Is this info sensitive? I can't see why it would be. Note that entire propagation tree is out of question - it spans many namespaces and contains potentially sensitive information. So we won't see all nodes. With multiple namespaces, of course you are only allowed to see a part of the tree, but you could have xterms for all of them, and can put together the big picture from the pieces. What do we want to *do* with the information about propagation? Just feedback about the state of the thing. It's very annoying, that after setting up propagation, it's impossible to check the result. Miklos - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch] VFS: extend /proc/mounts
On Thu, Jan 17, 2008 at 09:36:11AM +0100, Miklos Szeredi wrote: I'd suggest doing a new file that would *not* try to imitate /etc/mtab. Another thing is, how much of propagation information do we want to be exposed and what do we intend to do with it? I think the scheme devised by Ram is basically right. It shows the relationships (slave, peer) and the ID of a master/peer mount. Yes. It also shows the full relationship between source and destination for bind mounts. Now the /proc/mounts is useless: # mount --bind /mnt/test /mnt/test2 # cat /proc/mounts | grep test /dev/root /mnt/test2 ext3 rw,noatime,data=ordered 0 0 What do we want to *do* with the information about propagation? Just feedback about the state of the thing. It's very annoying, that after setting up propagation, it's impossible to check the result. Exactly. Karel -- Karel Zak [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch] VFS: extend /proc/mounts
On Jan 17, 2008, at 3:55 AM, Miklos Szeredi wrote: Hey, I just found /proc/X/mountstats. How does this fit in to the big picture? It seems to show some counters for NFS mounts, no other filesystem uses it. Format looks rather less nice, than /proc/X/mounts (why do we need long english sentences under /proc?). I introduced /proc/self/mountstats because we need a way for non- block-device-based file systems to report I/O statistics. Everything else I tried was rejected, and apparently what we ended up with was reviewed by only a handful of people, so no one else likes it or uses it. It can go away for all I care, as long as we retain some flexible mechanism for non-block-based file systems to report I/O stats. As far as I am aware, there are only two user utilities that understand and parse this data, and I maintain both. -- Chuck Lever chuck[dot]lever[at]oracle[dot]com - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)
On Tue, 15 Jan 2008, Daniel Phillips wrote: Along with this effort, could you let me know if the world actually cares about online fsck? Now we know how to do it I think, but is it worth the effort. Most users seem to care deeply about things just work. Here is why ntfs-3g also took the online fsck path some time ago. NTFS support had a highly bad reputation on Linux thus the new code was written with rigid sanity checks and extensive automatic, regression testing. One of the consequences is that we're detecting way too many inconsistencies left behind by the Windows and other NTFS drivers, hardware faults, device drivers. To better utilize the non-existing developer resources, it was obvious to suggest the already existing Windows fsck (chkdsk) in such cases. Simple and safe as most people like us would think who never used Windows. However years of experience shows that depending on several factors chkdsk may start or not, may report the real problems or not, but on the other hand it may report bogus issues, it may run long or just forever, and it may even remove completely valid files. So one could perhaps even consider suggestions to run chkdsk a call to play Russian roulette. Thankfully NTFS has some level of metadata redundancy with signatures and weak checksums which make possible to correct some common and obvious corruptions on the fly. Similarly to ZFS, Windows Server 2008 also has self-healing NTFS: http://technet2.microsoft.com/windowsserver2008/en/library/6f883d0d-3668-4e15-b7ad-4df0f6e6805d1033.mspx?mfr=true Szaka -- NTFS-3G: http://ntfs-3g.org - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Btrfs-devel] [ANNOUNCE] Btrfs v0.10 (online growing/shrinking, ext3 conversion, and more)
On Tuesday 15 January 2008, Chris Mason wrote: Hello everyone, Btrfs v0.10 is now available for download from: http://oss.oracle.com/projects/btrfs/ Well, it turns out this release had a few small problems: * data=ordered deadlock on older kernels (including 2.6.23) * Compile problems when ACLs were not enabled in the kernel So, I've put v0.11 out there. It fixes those two problems and will also compile on older (2.6.18) enterprise kernels. v0.11 does not have any disk format changes. -chris - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Btrfs-devel] [ANNOUNCE] Btrfs v0.10 (online growing/shrinking, ext3 conversion, and more)
On Jan 17, 2008 1:25 PM, Chris mason [EMAIL PROTECTED] wrote: So, I've put v0.11 out there. It fixes those two problems and will also compile on older (2.6.18) enterprise kernels. v0.11 does not have any disk format changes. Hi Chris, First, massive congratulations for bringing this to fruition in such a short time. Now back to the regular carping: why even support older kernels? Regards, Daniel - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Btrfs-devel] [ANNOUNCE] Btrfs v0.10 (online growing/shrinking, ext3 conversion, and more)
On Thursday 17 January 2008, Daniel Phillips wrote: On Jan 17, 2008 1:25 PM, Chris mason [EMAIL PROTECTED] wrote: So, I've put v0.11 out there. It fixes those two problems and will also compile on older (2.6.18) enterprise kernels. v0.11 does not have any disk format changes. Hi Chris, First, massive congratulations for bringing this to fruition in such a short time. Now back to the regular carping: why even support older kernels? The general answer is the backports are small and easy. I don't test them heavily, and I don't go out of my way to make things work. But, they do make it easier for people to try out, and to figure how to use all these new features to solve problems. Small changes that enable more testers are always welcome. In general, the core parts of the kernel that btrfs uses haven't had many interface changes since 2.6.18, so this isn't a huge deal. -chris - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/4] [CIFS] Provides DFS shrinkable submounts functionality
Christoph, thanks for your review. Here is the dfs patch 1/4 I rewrote taking into account your comments. Patch still depends on patch 1/3 that is to be fixed yet. Signed-off-by: Igor Mammedov [EMAIL PROTECTED] --- fs/cifs/Makefile |2 +- fs/cifs/cifs_dfs_ref.c | 376 fs/cifs/cifs_dfs_ref.h | 28 fs/cifs/cifsfs.c |3 + 4 files changed, 408 insertions(+), 1 deletions(-) create mode 100644 fs/cifs/cifs_dfs_ref.c create mode 100644 fs/cifs/cifs_dfs_ref.h diff --git a/fs/cifs/Makefile b/fs/cifs/Makefile index 09898b8..6ba43fb 100644 --- a/fs/cifs/Makefile +++ b/fs/cifs/Makefile @@ -10,4 +10,4 @@ cifs-y := cifsfs.o cifssmb.o cifs_debug.o connect.o dir.o file.o inode.o \ cifs-$(CONFIG_CIFS_UPCALL) += cifs_spnego.o -cifs-$(CONFIG_CIFS_DFS_UPCALL) += dns_resolve.o +cifs-$(CONFIG_CIFS_DFS_UPCALL) += dns_resolve.o cifs_dfs_ref.o diff --git a/fs/cifs/cifs_dfs_ref.c b/fs/cifs/cifs_dfs_ref.c new file mode 100644 index 000..740d99d --- /dev/null +++ b/fs/cifs/cifs_dfs_ref.c @@ -0,0 +1,376 @@ +/* + * Contains the CIFS DFS referral mounting routines used for handling + * traversal via DFS junction point + * + * Copyright (c) 2007 Igor Mammedov + * Author(s): Igor Mammedov ([EMAIL PROTECTED]) + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; either version + * 2 of the License, or (at your option) any later version. + */ + +#include linux/dcache.h +#include linux/mount.h +#include linux/namei.h +#include linux/vfs.h +#include linux/fs.h +#include cifsglob.h +#include cifsproto.h +#include cifsfs.h +#include dns_resolve.h +#include cifs_dfs_ref.h +#include cifs_debug.h + +LIST_HEAD(cifs_dfs_automount_list); + +/* + * DFS functions +*/ + +void dfs_shrink_umount_helper(struct vfsmount *vfsmnt) +{ + mark_mounts_for_expiry(cifs_dfs_automount_list); + mark_mounts_for_expiry(cifs_dfs_automount_list); + shrink_submounts(vfsmnt, cifs_dfs_automount_list); +} + +/** + * cifs_get_share_name - extracts share name from UNC + * @node_name: pointer to UNC string + * + * Extracts sharename form full UNC. + * i.e. strips from UNC trailing path that is not part of share + * name and fixup missing '\' in the begining of DFS node refferal + * if neccessary. + * Returns pointer to share name on success or NULL on error. + * Caller is responcible for freeing returned string. + */ +static char *cifs_get_share_name(const char *node_name) +{ + int len; + char *UNC; + char *pSep; + + len = strlen(node_name); + UNC = kmalloc(len+2 /*for term null and additional \ if it's missed */, +GFP_KERNEL); + if (!UNC) + return NULL; + + /* get share name and server name */ + if (node_name[1] != '\\') { + UNC[0] = '\\'; + strncpy(UNC+1, node_name, len); + len++; + UNC[len] = 0; + } else { + strncpy(UNC, node_name, len); + UNC[len] = 0; + } + + /* find server name end */ + pSep = memchr(UNC+2, '\\', len-2); + if (!pSep) { + cERROR(1, (%s: no server name end in node name: %s, + __FUNCTION__, node_name)); + kfree(UNC); + return NULL; + } + + /* find sharename end */ + pSep++; + pSep = memchr(UNC+(pSep-UNC), '\\', len-(pSep-UNC)); + if (!pSep) { + cERROR(1, (%s:2 cant find share name in node name: %s, + __FUNCTION__, node_name)); + kfree(UNC); + return NULL; + } + /* trim path up to sharename end +* * now we have share name in UNC */ + *pSep = 0; + + return UNC; +} + + +/** + * compose_mount_options - creates mount options for refferral + * @sb_mountdata: parent/root DFS mount options (template) + * @ref_unc: refferral server UNC + * @devname: pointer for saving device name + * + * creates mount options for submount based on template options sb_mountdata + * and replacing unc,ip,prefixpath options with ones we've got form ref_unc. + * + * Returns: pointer to new mount options or ERR_PTR. + * Caller is responcible for freeing retunrned value if it is not error. + */ +char *compose_mount_options(const char *sb_mountdata, const char *ref_unc, + char **devname) +{ + int rc; + char *mountdata; + int md_len; + char *tkn_e; + char *srvIP = NULL; + char sep = ','; + int off, noff; + + if (sb_mountdata == NULL) + return ERR_PTR(-EINVAL); + + *devname = cifs_get_share_name(ref_unc); + rc = dns_resolve_server_name_to_ip(*devname, srvIP); + if (rc != 0) { +
Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)
On Tue 2008-01-15 20:36:16, Chris Mason wrote: On Tue, 15 Jan 2008 20:24:27 -0500 Daniel Phillips [EMAIL PROTECTED] wrote: On Jan 15, 2008 7:15 PM, Alan Cox [EMAIL PROTECTED] wrote: Writeback cache on disk in iteself is not bad, it only gets bad if the disk is not engineered to save all its dirty cache on power loss, using the disk motor as a generator or alternatively a small battery. It would be awfully nice to know which brands fail here, if any, because writeback cache is a big performance booster. AFAIK no drive saves the cache. The worst case cache flush for drives is several seconds with no retries and a couple of minutes if something really bad happens. This is why the kernel has some knowledge of barriers and uses them to issue flushes when needed. Indeed, you are right, which is supported by actual measurements: http://sr5tech.com/write_back_cache_experiments.htm Sorry for implying that anybody has engineered a drive that can do such a nice thing with writeback cache. The disk motor as a generator tale may not be purely folklore. When an IDE drive is not in writeback mode, something special needs to done to ensure the last write to media is not a scribble. A small UPS can make writeback mode actually reliable, provided the system is smart enough to take the drives out of writeback mode when the line power is off. We've had mount -o barrier=1 for ext3 for a while now, it makes writeback caching safe. XFS has this on by default, as does reiserfs. Maybe ext3 should do barriers by default? Having ext3 in lets corrupt data by default... seems like bad idea. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)
On Jan 17, 2008 7:29 AM, Szabolcs Szakacsits [EMAIL PROTECTED] wrote: Similarly to ZFS, Windows Server 2008 also has self-healing NTFS: I guess that is enough votes to justify going ahead and trying an implementation of the reverse mapping ideas I posted. But of course more votes for this is better. If online incremental fsck is something people want, then please speak up here and that will very definitely help make it happen. On the walk-before-run principle, it would initially just be filesystem checking, not repair. But even this would help, by setting per-group checked flags that offline fsck could use to do a much quicker repair pass. And it will let you know when a volume needs to be taken offline without having to build in planned downtime just in case, which already eats a bunch of nines. Regards, Daniel - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Parallelize IO for e2fsck
On Wed, Jan 16, 2008 at 01:30:43PM -0800, Valerie Henson wrote: Hi y'all, This is a request for comments on the rewrite of the e2fsck IO parallelization patches I sent out a few months ago. The mechanism is totally different. Previously IO was parallelized by issuing IOs from multiple threads; now a single thread issues fadvise(WILLNEED) and then uses read() to complete the IO. Interesting. We ultimately rejected a similar patch to xfs_repair (pre-population the kernel block device cache) mainly because of low memory performance issues and it doesn't really enable you to do anything particularly smart with optimising I/O patterns for larger, high performance RAID arrays. The low memory problems were particularly bad; the readahead thrashing cause a slowdown of 2-3x compared to the baseline and often it was due to the repair process requiring all of memory to cache stuff it would need later. IIRC, multi-terabyte ext3 filesystems have similar memory usage problems to XFS, so there's a good chance that this patch will see the same sorts of issues. Single disk performance doesn't change, but elapsed time drops by about 50% on a big RAID-5 box. Passes 1 and 2 are parallelized. Pass 5 is left as an exercise for the reader. Promising results, though Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Parallelize IO for e2fsck
On Jan 17, 2008 5:15 PM, David Chinner [EMAIL PROTECTED] wrote: On Wed, Jan 16, 2008 at 01:30:43PM -0800, Valerie Henson wrote: Hi y'all, This is a request for comments on the rewrite of the e2fsck IO parallelization patches I sent out a few months ago. The mechanism is totally different. Previously IO was parallelized by issuing IOs from multiple threads; now a single thread issues fadvise(WILLNEED) and then uses read() to complete the IO. Interesting. We ultimately rejected a similar patch to xfs_repair (pre-population the kernel block device cache) mainly because of low memory performance issues and it doesn't really enable you to do anything particularly smart with optimising I/O patterns for larger, high performance RAID arrays. The low memory problems were particularly bad; the readahead thrashing cause a slowdown of 2-3x compared to the baseline and often it was due to the repair process requiring all of memory to cache stuff it would need later. IIRC, multi-terabyte ext3 filesystems have similar memory usage problems to XFS, so there's a good chance that this patch will see the same sorts of issues. That was one of my first concerns - how to avoid overflowing memory? Whenever I screw it up on e2fsck, it does go, oh, 2 times slower due to the minor detail of every single block being read from disk twice. :) I have a partial solution that sort of blindly manages the buffer cache. First, the user passes e2fsck a parameter saying how much memory is available as buffer cache. The readahead thread reads things in and immediately throws them away so they are only in buffer cache (no double-caching). Then readahead and e2fsck work together so that readahead only reads in new blocks when the main thread is done with earlier blocks. The already-used blocks get kicked out of buffer cache to make room for the new ones. What would be nice is to take into account the current total memory usage of the whole fsck process and factor that in. I don't think it would be hard to add to the existing cache management framework. Thoughts? Promising results, though Thanks! It's solving a rather simpler problem than XFS check/repair. :) -VAL - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 4/6] xip: support non-struct page backed memory
Convert XIP to support non-struct page backed memory, using VM_MIXEDMAP for the user mappings. This requires the get_xip_page API to be changed to an address based one. Improve the API layering a little bit too, while we're here. (The kaddr-pfn conversion may not be quite right for all architectures or XIP memory mappings, and the cacheflushing may need to be added for some archs). This scheme has been tested and works for Jared's work-in-progress filesystem, with s390's xip, and with the new brd driver. It is required to have XIP filesystems on memory that isn't backed with struct page. Signed-off-by: Nick Piggin [EMAIL PROTECTED] Cc: Jared Hulbert [EMAIL PROTECTED] Cc: Carsten Otte [EMAIL PROTECTED] Cc: Martin Schwidefsky [EMAIL PROTECTED] Cc: Heiko Carstens [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Cc: linux-fsdevel@vger.kernel.org --- fs/ext2/inode.c|2 fs/ext2/xip.c | 36 - fs/ext2/xip.h |8 +- fs/open.c |2 include/linux/fs.h |3 mm/fadvise.c |2 mm/filemap_xip.c | 191 ++--- mm/madvise.c |2 8 files changed, 122 insertions(+), 124 deletions(-) Index: linux-2.6/fs/ext2/inode.c === --- linux-2.6.orig/fs/ext2/inode.c +++ linux-2.6/fs/ext2/inode.c @@ -800,7 +800,7 @@ const struct address_space_operations ex const struct address_space_operations ext2_aops_xip = { .bmap = ext2_bmap, - .get_xip_page = ext2_get_xip_page, + .get_xip_address= ext2_get_xip_address, }; const struct address_space_operations ext2_nobh_aops = { Index: linux-2.6/fs/ext2/xip.c === --- linux-2.6.orig/fs/ext2/xip.c +++ linux-2.6/fs/ext2/xip.c @@ -15,24 +15,25 @@ #include xip.h static inline int -__inode_direct_access(struct inode *inode, sector_t sector, - unsigned long *data) +__inode_direct_access(struct inode *inode, sector_t block, unsigned long *data) { + sector_t sector; BUG_ON(!inode-i_sb-s_bdev-bd_disk-fops-direct_access); + + sector = block * (PAGE_SIZE / 512); /* ext2 block to bdev sector */ return inode-i_sb-s_bdev-bd_disk-fops - -direct_access(inode-i_sb-s_bdev,sector,data); + -direct_access(inode-i_sb-s_bdev, sector, data); } static inline int -__ext2_get_sector(struct inode *inode, sector_t offset, int create, +__ext2_get_block(struct inode *inode, pgoff_t pgoff, int create, sector_t *result) { struct buffer_head tmp; int rc; memset(tmp, 0, sizeof(struct buffer_head)); - rc = ext2_get_block(inode, offset/ (PAGE_SIZE/512), tmp, - create); + rc = ext2_get_block(inode, pgoff, tmp, create); *result = tmp.b_blocknr; /* did we get a sparse block (hole in the file)? */ @@ -45,13 +46,12 @@ __ext2_get_sector(struct inode *inode, s } int -ext2_clear_xip_target(struct inode *inode, int block) +ext2_clear_xip_target(struct inode *inode, sector_t block) { - sector_t sector = block * (PAGE_SIZE/512); unsigned long data; int rc; - rc = __inode_direct_access(inode, sector, data); + rc = __inode_direct_access(inode, block, data); if (!rc) clear_page((void*)data); return rc; @@ -69,24 +69,24 @@ void ext2_xip_verify_sb(struct super_blo } } -struct page * -ext2_get_xip_page(struct address_space *mapping, sector_t offset, - int create) +void * +ext2_get_xip_address(struct address_space *mapping, pgoff_t pgoff, int create) { int rc; unsigned long data; - sector_t sector; + sector_t block; /* first, retrieve the sector number */ - rc = __ext2_get_sector(mapping-host, offset, create, sector); + rc = __ext2_get_block(mapping-host, pgoff, create, block); if (rc) goto error; /* retrieve address of the target data */ - rc = __inode_direct_access - (mapping-host, sector * (PAGE_SIZE/512), data); - if (!rc) - return virt_to_page(data); + rc = __inode_direct_access(mapping-host, block, data); + if (rc) + goto error; + + return (void *)data; error: return ERR_PTR(rc); Index: linux-2.6/fs/ext2/xip.h === --- linux-2.6.orig/fs/ext2/xip.h +++ linux-2.6/fs/ext2/xip.h @@ -7,19 +7,19 @@ #ifdef CONFIG_EXT2_FS_XIP extern void ext2_xip_verify_sb (struct super_block *); -extern int ext2_clear_xip_target (struct inode *, int); +extern int ext2_clear_xip_target (struct inode *, sector_t); static inline int ext2_use_xip (struct super_block *sb) { struct ext2_sb_info *sbi = EXT2_SB(sb); return (sbi-s_mount_opt