Re: very poor ext3 write performance on big filesystems?
On Mon, Feb 18, 2008 at 04:57:25PM +0100, Andi Kleen wrote: Use cp or a tar pipeline to move the files. Are you sure cp handles hardlinks correctly? I know tar does, but I have my doubts about cp. I *think* GNU cp does the right thing with --preserve=links. I'm not 100% sure, though --- like you, probably, I always use tar for moving or copying directory hierarchies. - Ted - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: very poor ext3 write performance on big filesystems?
On Mon, Feb 18, 2008 at 05:16:55PM +0100, Tomasz Chmielewski wrote: Theodore Tso schrieb: I'd really need to know exactly what kind of operations you were trying to do that were causing problems before I could say for sure. Yes, you said you were removing unneeded files, but how were you doing it? With rm -r of old hard-linked directories? Yes, with rm -r. You should definitely try the spd_readdir hack; that will help reduce the seek times. This will probably help on any block group oriented filesystems, including XFS, etc. How big are the average files involved? Etc. It's hard to estimate the average size of a file. I'd say there are not many files bigger than 50 MB. Well, Ext4 will help for files bigger than 48k. The other thing that might help for you is using an external journal on a separate hard drive (either for ext3 or ext4). That will help alleviate some of the seek storms going on, since the journal is written to only sequentially, and putting it on a separate hard drive will help remove some of the contention on the hard drive. I assume that your 1.2 TB filesystem is located on a RAID array; did you use the mke2fs -E stride option to make sure all of the bitmaps don't get concentrated on one hard drive spindle? One of the failure modes which can happen is if you use a 4+1 raid 5 setup, that all of the block and inode bitmaps can end up getting laid out on a single hard drive, so it becomes a bottleneck for bitmap intensive workloads --- including rm -rf. So that's another thing that might be going on. If you do a dumpe2fs, and look at the block numbers for the block and inode allocation bitmaps, and you find that they are are all landing on the same physical hard drive, then that's very clearly the biggest problem given an rm -rf workload. You should be able to see this as well visually; if one hard drive has its hard drive light almost constantly on, and the other ones don't have much activity, that's probably what is happening. - Ted - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Parallelize IO for e2fsck
On Mon, Jan 28, 2008 at 07:30:05PM +, Pavel Machek wrote: As user pages are always in highmem, this should be easy to decide: only send SIGDANGER when highmem is full. (Yes, there are inodes/dentries/file descriptors in lowmem, but I doubt apps will respond to SIGDANGER by closing files). Good point; for a system with at least (say) 2GB of memory, that definitely makes sense. For a system with less than 768 megs of memory (how quaint, but it wasn't that long ago this was a lot of memory :-), there wouldn't *be* any memory in highmem at all - Ted - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Parallelize IO for e2fsck
On Fri, Jan 25, 2008 at 05:55:51PM -0800, Bryan Henderson wrote: I was surprised to see AIX do late allocation by default, because IBM's traditional style is bulletproof systems. A system where a process can be killed at unpredictable times because of resource demands of unrelated processes doesn't really fit that style. It's really a fairly unusual application that benefits from late allocation: one that creates a lot more virtual memory than it ever touches. For example, a sparse array. Or am I missing something? I guess it depends on how far you try to do bulletproof. OSF/1 used to use bulletproof as its default --- and I had to turn it off on tsx-11.mit.edu (the first North American ftp server for Linux :-), because the difference was something like 50 ftp daemons versus over 500 on the same server. It reserved VM space for the text segement of every single process, since at least in theory, it's possible for every single text page to get modified using ptrace if (for example) a debugger were to set a break point on every single page of every single text segement of every single ftp daemon. You can also see potential problems for Java programs. Suppose you had some gigantic Java Application (say, Lotus Notes, or Websphere Application Server) which is taking up many, many, MANY gigabytes of VM space. Now suppose the Java application needs to fork and exec some trivial helper program. For that tiny instant, between the fork and exec, the VM requirements in bulletproof mode would double, since while 99.% of the time programs will immediately discard the VM upon the exec, there is always the possibility that the child process will touch every single data page, forcing a copy on write, and never do the exec. There are of course different levels of bulletproof between the extremes of totally bulletproof and late binding from an algorithmic standpoint. For example, you could ignore the needed pages caused by ptrace(); more challenging would be to how to handle the fork/exec semantics, although there could be kludges such as strongly encouraging applications to use an old-fashed BSD-style vfork() to guarantee that the child couldn't double VM requirements between the vfork() and exec(). I certainly can't say for sure what the AIX designers had in mind, and why they didn't choose one of the more intermediate design choices. However, it is fair to say that 100% bulletproof can require reserving far more VM resources than you might first expect. Even a company which is highly incented to sell large amounts of hardware, such as Digital, might not have wanted their OS to be only able to support an embarassingly small number of simultaneous ftpd connections. I know this for sure because the OSF/1 documentation, when discussing their VM tuning knobs, specifically talked about the scenario that I ran into with tsx-11.mit.edu. Regards, - Ted - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] ext3 freeze feature
On Fri, Jan 25, 2008 at 10:34:25AM -0600, Eric Sandeen wrote: But it was this concern which is why ext3 never exported freeze functionality to userspace, even though other commercial filesystems do support this. It wasn't that it wasn't considered, but the concern about whether or not it was sufficiently safe to make available. What's the safety concern; that the admin will forget to unfreeze? That the admin would manage to deadlock him/herself and wedge up the whole system... I'm also not sure I see the point of the timeout in the original patch; either you are done snapshotting and ready to unfreeze, or you're not; 1, or 2, or 3 seconds doesn't really matter. When you're done, you're done, and you can only unfreeze then. Shouldn't this be done programmatically, and not with some pre-determined timeout? This is only a guess, but I suspect it was a fail-safe in case the admin did manage to deadlock him/herself. I would think a better approach would be to make the filesystem unfreeze if the file descriptor that was used to freeze the filesystem is closed, and then have explicit deadlock detection that kills the process doing the freeze, at which point the filesystem unlocks and the system can recover. - Ted - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] ext3 freeze feature
On Fri, Jan 25, 2008 at 03:18:51PM +0300, Dmitri Monakhov wrote: First of all Linux already have at least one open-source(dm-snap), and several commercial snapshot solutions. Yes, but it requires that the filesystem be stored under LVM. Unlike what EVMS v1 allowed us to do, we can't currently take a snapshot of a bare block device. This patch could potentially be useful for systems which aren't using LVM, however You have to realize what delay between 1-3 stages have to be minimal. for example dm-snap perform it only for explicit journal flushing. From my experience if delay is more than 4-5 seconds whole system becomes unstable. That's the problem. You can't afford to freeze for very long. What you *could* do is to start putting processes to sleep if they attempt to write to the frozen filesystem, and then detect the deadlock case where the process holding the file descriptor used to freeze the filesystem gets frozen because it attempted to write to the filesystem --- at which point it gets some kind of signal (which defaults to killing the process), and the filesystem is unfrozen and as part of the unfreeze you wake up all of the processes that were put to sleep for touching the frozen filesystem. The other approach would be to say, oh well, the freeze ioctl is inherently dangerous, and root is allowed to himself in the foot, so who cares. :-) But it was this concern which is why ext3 never exported freeze functionality to userspace, even though other commercial filesystems do support this. It wasn't that it wasn't considered, but the concern about whether or not it was sufficiently safe to make available. And I do agree that we probably should just implement this in filesystem independent way, in which case all of the filesystems that support this already have super_operations functions write_super_lockfs() and unlockfs(). So if this is done using a new system call, there should be no filesystem-specific changes needed, and all filesystems which support those super_operations method functions would be able to provide this functionality to the new system call. - Ted P.S. Oh yeah, it should be noted that freezing at the filesystem layer does *not* guarantee that changes to the block device aren't happening via mmap()'ed files. The LVM needs to freeze writes the block device level if it wants to guarantee a completely stable snapshot image. So the proposed patch doens't quite give you those guarantees, if that was the intended goal. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Parallelize IO for e2fsck
On Fri, Jan 25, 2008 at 01:08:09AM +0200, Adrian Bunk wrote: In practice, there is a small number of programs that are both the common memory hogs and should be able to reduce their memory consumption by 10% or 20% without big problems when requested (e.g. Java VMs, Firefox and databases come into my mind). I agree, it's only a few processes where this makes sense. But for those that do, it would be useful if they could register with the kernel that would like to know, (just before the system starts ejecting cached data, just before swapping, etc.) and at what frequency. And presumably, if the kernel notices that a process is responding to such requests with memory actually getting released back to the system, that process could get rewarded by having the OOM killer less likely to target that particular thread. AIX basically did this with SIGDANGER (the signal is ignored by default), except there wasn't the ability for the process to tell the kernel at what level of memory pressure before it should start getting notified, and there was no way for the kernel to tell how bad the memory pressure actually was. On the other hand, it was a relatively simple design. In practice very few processes would indeed pay attention to SIGDANGER, so I think you're quite right there. And from a performance point of view letting applications voluntarily free some memory is better even than starting to swap. Absolutely. - Ted - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFD] Incremental fsck
On Wed, Jan 09, 2008 at 02:52:14PM +0300, Al Boldi wrote: Ok, but let's look at this a bit more opportunistic / optimistic. Even after a black-out shutdown, the corruption is pretty minimal, using ext3fs at least. After a unclean shutdown, assuming you have decent hardware that doesn't lie about when blocks hit iron oxide, you shouldn't have any corruption at all. If you have crappy hardware, then all bets are off So let's take advantage of this fact and do an optimistic fsck, to assure integrity per-dir, and assume no external corruption. Then we release this checked dir to the wild (optionally ro), and check the next. Once we find external inconsistencies we either fix it unconditionally, based on some preconfigured actions, or present the user with options. So what can you check? The *only* thing you can check is whether or not the directory syntax looks sane, whether the inode structure looks sane, and whether or not the blocks reported as belong to an inode looks sane. What is very hard to check is whether or not the link count on the inode is correct. Suppose the link count is 1, but there are actually two directory entries pointing at it. Now when someone unlinks the file through one of the directory hard entries, the link count will go to zero, and the blocks will start to get reused, even though the inode is still accessible via another pathname. Oops. Data Loss. This is why doing incremental, on-line fsck'ing is *hard*. You're not going to find this while doing each directory one at a time, and if the filesystem is changing out from under you, it gets worse. And it's not just the hard link count. There is a similar issue with the block allocation bitmap. Detecting the case where two files are simultaneously can't be done if you are doing it incrementally, and if the filesystem is changing out from under you, it's impossible, unless you also have the filesystem telling you every single change while it is happening, and you keep an insane amount of bookkeeping. One that you *might* be able to do, is to mount a filesystem readonly, check it in the background while you allow users to access it read-only. There are a few caveats, however (1) some filesystem errors may cause the data to be corrupt, or in the worst case, could cause the system to panic (that's would arguably be a filesystem/kernel bug, but we've not necessarily done as much testing here as we should.) (2) if there were any filesystem errors found, you would beed to completely unmount the filesystem to flush the inode cache and remount it before it would be safe to remount the filesystem read/write. You can't just do a mount -o remount if the filesystem was modified under the OS's nose. All this could be per-dir or using some form of on-the-fly file-block-zoning. And there probably is a lot more to it, but it should conceptually be possible, with more thoughts though... Many things are possible, in the NASA sense of with enough thrust, anything will fly. Whether or not it is *useful* and *worthwhile* are of course different questions! :-) - Ted - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 0/2] readdir() as an inode operation
On Tue, Oct 30, 2007 at 04:26:04PM +0100, Jan Kara wrote: This is a first try to move readdir() to become an inode operation. This is necessary for a VFS implementation of something like union-mounts where a readdir() needs to read the directory contents of multiple directories. Besides that the new interface is no longer giving the struct file to the filesystem implementations anymore. Comments, please? Hmm, are you sure there are no users which keep some per-struct-file information for directories? File offset is one such obvious thing which you've handled but actually filesystem with more complicated structure of directory may remember some hints about where we really are, keep some readahead information or so... For example, the ext3 filesystem, when it is supported hash tree, does exactly this. See ext3_htree_store_dirent() in fs/ext3/dir.c and ext3_htree_fill_tree() in fs/ext3/namei.c. So your patch would break ext3 htree support. - Ted - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Does 32.1% non-contiguous mean severely fragmented?
On Tue, Oct 23, 2007 at 07:38:20PM +0900, Tetsuo Handa wrote: Are you sure the file isn't getting written by some background tasks that you weren't aware of? This seems very strange; what virtualization software are you using? VMware, Xen, KVM? I'm using VMware Workstation 6.0.0 build 45731 for x86_64. It seems that there were some background tasks that delays writing. I tried the following sequence, sync didn't affect. Or it may be that it takes a while to do a controlled shutdown. One potential reason for the vmem file being very badly fragmented is that it might not be getting written in sequential order. If the writer is writing the file in random order, then unless you have a filesystem which can do delayed allocations, the blocks will get allocated in the other that they are first written, and if the writer is seeking to random locations to do the write, that's one way that you can end up with a very badly fragmented file. Regards, - Ted - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Does 32.1% non-contiguous mean severely fragmented?
On Mon, Oct 22, 2007 at 08:58:11PM +0900, Tetsuo Handa wrote: --- Start VM --- --- Suspend VM --- [EMAIL PROTECTED] Ubuntu7.10]# filefrag Ubuntu7.10.vmem Ubuntu7.10.vmem: 751 extents found, perfection would be 5 extents [EMAIL PROTECTED] Ubuntu7.10]# sync [EMAIL PROTECTED] Ubuntu7.10]# filefrag Ubuntu7.10.vmem Ubuntu7.10.vmem: 3281 extents found, perfection would be 5 extents [EMAIL PROTECTED] Ubuntu7.10]# filefrag Ubuntu7.10.vmem Ubuntu7.10.vmem: 3281 extents found, perfection would be 5 extents --- Resume and poweroff VM --- What? sync yields more discontiguous? What filesystem are you using? ext3? ext4? xfs? And are you using any non-standard patches, such as some of the delayed allocation patches that have been floating around? If you're using ext3, that shouldn't be happening. If you use the -v option to filefrag, both before and after the sync, that might show us what is going on. The other thing is to use debugfs and its stat command to get detailed breakdown of the block assignments of the file. Are you sure the file isn't getting written by some background tasks that you weren't aware of? This seems very strange; what virtualization software are you using? VMware, Xen, KVM? - Ted - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Does 32.1% non-contiguous mean severely fragmented?
On Sat, Oct 20, 2007 at 12:39:33PM +0900, Tetsuo Handa wrote: Theodore Tso wrote: beginning of every single block group. You have a small number of files on your system (349) occupying an average of 348 megabytes. So it's not at all surprising that the contiguous percentage is 32%. I see, thank you. Yes, there are many files splitted in 2GB each. But what is surprising for me is that I have to wait for more than five minutes to save/restore the virtual machine's 512MB-RAM image (usually it takes less than five seconds). Hdparm reports DMA is on and e2fsck reports no errors, so I thought it is severely fragmented. May be I should backup all virtual machine's data and format the partition and restore them. Well, that's a little drastic if you're not sure what is going on is fragmentation. 5 minutes to save/restore a 512MB ram image, assuming that you are saving somewhere around 576 megs of data, means you are writing less than 2 megs/second. That seems point to something fundamentally wrong, far worse than can be explained by fragmentation. First of all, what does the filefrag program (shipped as part of e2fsprogs, not included in some distributions) say if you run it as root on your VM data file? Secondly, what results do you get when you run the command hdparm -tT /dev/sda (or /dev/hda if you are using an IDE disk)? This kind of performance regression is the sort of thing I see on my laptop when compile the kernel with the wrong options, and/or disable AHCI mode in favor of compatibility mode, such that my laptop SATA performance (as measured using hdparm) drops from 50 megs/second to 2 megs/second. Regards, - Ted - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Does \32.1% non-contigunous\ mean severely fragmented?
On Fri, Oct 19, 2007 at 10:49:03AM +0900, Tetsuo Handa wrote: /data/VMware: 349/19546112 files (32.1% non-contiguous), 31019203/39072080 blocks Does non-contiguous mean fragmented? If so, where is ext3defrag? Not necessarily; it just means that 32% of your files have at least one discontinuity. Given the ext3 layout, by definition every 128 megs there will be a discontinuity because of the metadata at the beginning of every single block group. You have a small number of files on your system (349) occupying an average of 348 megabytes. So it's not at all surprising that the contiguous percentage is 32%. The recent Flex BG feature that was recently pulled into 2.6.23-git14 for ext4 is desgined to avoid this issue, but a seek every 128 megs is for most workloads not a big deal and will hopefully not cause you any problems. - Ted - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 13/32] IGET: Stop EXT2 from using iget() and read_inode() [try #2]
On Thu, Oct 04, 2007 at 04:57:08PM +0100, David Howells wrote: Stop the EXT2 filesystem from using iget() and read_inode(). Replace ext2_read_inode() with ext2_iget(), and call that instead of iget(). ext2_iget() then uses iget_locked() directly and returns a proper error code instead of an inode in the event of an error. ext2_fill_super() returns any error incurred when getting the root inode instead of EINVAL. Signed-off-by: David Howells [EMAIL PROTECTED] Acked-by: Theodore Ts'o [EMAIL PROTECTED] - Ted - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Upgrading datastructures between different filesystem versions
On Fri, Sep 28, 2007 at 02:31:46PM +0100, Christoph Hellwig wrote: On Fri, Sep 28, 2007 at 03:11:00PM +0200, Erik Mouw wrote: There are however ways to confuse it: if you reformat an ext3 filesystem to reiserfs (version 3), mounting that filesystem without -t reiserfs will trick mount(8) into mounting it as an ext3 filesystem (which will usually fail). This is because the ext3 superblocks lives at offset 0x400, and the reiserfs superblock at 0x8000. When you format a partition as reiserfs, it will not erase old ext3 superblocks. Before looking for a reiserfs superblock, mount(8) first looks for an ext3 superblock. The old ext3 superblock wasn't erased, but usually most of the other ext3 structures are and so mount(8) will fail to mount the filesystem. Don't know if this particular bug is still there, but it has bitten me in the past. This is easy to fix, though. Quoting mkfs.xfs: /* * Zero out the beginning of the device, to obliterate any old * filesystem signatures out there. This should take care of * swap (somewhere around the page size), jfs (32k), * ext[2,3] and reiserfs (64k) - and hopefully all else. */ buf = libxfs_getbuf(xi.ddev, 0, BTOBB(WHACK_SIZE)); bzero(XFS_BUF_PTR(buf), WHACK_SIZE); libxfs_writebuf(buf, LIBXFS_EXIT_ON_FAILURE); libxfs_purgebuf(buf); Ext3 does something similar, zapping space at the beginning AND the end of the partition (because the MD superblocks are at the end). It's just a misfeature of reiserfs's mkfs that it doesn't do this. - Ted - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] fs: Correct SuS compliance for open of large file without options
On Thu, Sep 27, 2007 at 04:19:12PM +0100, Alan Cox wrote: Well it's not my call, just seems like a really bad idea to change the error value. You can't claim full coverage for such testing anyway, it's one of those things that people will complain about two releases later saying it broke app foo. Strange since we've spent years changing error values and getting them right in the past. I doubt there any apps which are going to specifically check for EFBIG and do soemthing different if they get EOVERFLOW instead. If it was something like EAGAIN or EPERM, I'd be more concerned, but EFBIG vs. EOVERFLOW? C'mon! There are real things to worry about - sysfs, sysfs, sysfs, ... and all the other crap which is continually breaking stuff, not spec compliance corrections that don't break things but move us into compliance with the standard I've got to agree with Alan, the sysfs/udev breakages that we've done are far more significant, and the fact that we continue to expose internal data structures via sysfs is a gaping open pit is far more likely to cause any kind of problems than changing an error return. - Ted - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] fs: Correct SuS compliance for open of large file without options
On Thu, Sep 27, 2007 at 10:59:17AM -0700, Greg KH wrote: Come on now, I'm _very_ tired of this kind of discussion. Please go read the documentation on how to _use_ sysfs from userspace in such a way that you can properly access these data structures so that no breakage occurs. I've read it; the question is whether every single application programmer or system shell script programmer who writes code my system depends upon has read it this document buried in the kernel sources, or whether things will break spectacularly --- one of those things that leaves me in suspense each time I update the kernel. I'm reminded of Rusty's 2003 OLS Keynote, where he points out that what's important is not making an interface easy to use, but _hard_ _to_ _misuse_. That fact that sysfs is all laid out in a directory, but for which some directories/symlinks are OK to use, and some are NOT OK to use --- is why I call the sysfs interface an open pit. Sure, if you have the map to the minefield, a minefield is perfectly safe when you know what to avoid. But is that the best way to construct a path/interface for an application programmer to get from point A to point B? Maybe, maybe not. - Ted - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] fs: Correct SuS compliance for open of large file without options
On Thu, Sep 27, 2007 at 05:28:57PM -0600, Matthew Wilcox wrote: On Thu, Sep 27, 2007 at 07:19:27PM -0400, Theodore Tso wrote: Would you accept a patch which causes the deprecated sysfs files/directories to disappear, even if CONFIG_SYS_DEPRECATED is defined, via a boot-time parameter? How about a mount option? That way people can test without a reboot: mount -o remount,deprecated={yes,no} /sys It would be nice if that would be easy to make work, but the problem is that remounting /sysfs doesn't change the entries in the sysfs tree that have already been made in the tree. We could do something such as creating an sysfs_create_link_deprecated() call which created a kobject with a new flag indicating it's deprecated, so it could be filtered out dynamically when /sys is remounted, or when some file such as /sys/kernel/deprecated_sysfs_files has 0 or 1 written to it. The question is whether it's worth it, since we'd have to bloat the kobject structure by 4 bytes (it currently doesn't have a flags field from which we could borrow a bit), or whether it's OK just to make the user reboot. (I do agree it would be nicer if the user didn't have to reboot, but most of the time they will need to test the initrd and init scripts anyway. - Ted - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Upgrading datastructures between different filesystem versions
On Wed, Sep 26, 2007 at 06:29:19PM -0500, Sachin Gaikwad wrote: Is it not the case that VFS takes care of all filesystems available ? VFS will see if a particular file belongs to ext3 or ext4 and call that FS's drivers to access information ?? No, it doesn't quite work that way. You have to mount a particular partition using a specific filesystem (i.e., ntfs, vfat, ext2, ext3, ext4, etc.). A partition formatted using ext2 can be mounted using the ext2, ext3, or ext4 filesystem driver. You can explicitly specify what filesystem should be used to mount a particuar partition using the -t option to the mount program, or by specifying a particular filesystem type in the /etc/fstab file. - Ted - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 12/26] ext2 white-out support
On Mon, Jul 30, 2007 at 06:13:35PM +0200, Jan Blunck wrote: Introduce white-out support to ext2. Known Bugs: - Needs a reserved inode number for white-outs You picked different reserved inodes for the ext2 and ext3 filesystems. That's good for a NACK right there. The codepoints (i.e., reserved inode numbers, feature bit masks, etc.) for ext2, ext3, and ext4 MUST not overlap. After all, someone might use tune2fs -j to convert an ext2 filesystem to ext3, and is it's REALLY BAD that you're using a reserved inode of 7 for ext2, and 9 for ext3. Also, I note that you have created a new INCOMPAT feature flag support for whiteouts. That's really unfortunate; we try to avoid introducing incompatible feature flags unless absolutely necessary; note that even adding a COMPAT feature flag means that you need a new version of e2fsprogs if you want e2fsck to be willing to touch that filesystem. So --- if you're looking for a way to add whiteout support to ext2/ext3 without needing a feature bit, here's how. We allocate a new inode flag in struct ext3_inode.i_flags: #define EXT2_WHTOUT_FL 0x0004 We also allocate a new field in the ext2 superblock to store the whiteout inode. (Please coordinate with me so it's a superblock field not in use by ext3/ext4, and so it's reserved so that no one else uses it.) The superblock field, call it s_whtout_ino, stores the inode number for the white out inode. When you create a new whiteout file, the code checks sb-s_whtout_ino, and if it is zero, it allocates a new inode, and creates it as a zero-length regular file (i_mode |= S_IFREG) with the EXT2_WHTOUT_FL flag set in the inode, and then store the inode number in sb-s_whtout_ino. If sb-s_whtout_ino is non-zero, you must read in the inode and make sure that the EXT2_WHTOUT_FL is set. If it is not, then allocate a new whiteout inode as described previously. Then link the inode into the directory as before. When reading an inode, if the EXT2_WHTOUT_FL flag is set, then set the in-memory mode of the inode to be S_IFWHT. That's pretty much about it. For cleanliness sake, it would be good if ext2_delete_inode clears sb-s_whtout_ino if the last whiteout link has been deleted, but it's strictly speaking not necessary. If you do it this way, the filesystem is completely backwards compatible; the whiteout files will just appear to links to a normal zero-lenth file. I wouldn't bother with setting the directory type field to be DT_WHT, given that they will never be returned to userspace anyway. Regards, - Ted - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFH] Partition table recovery
On Mon, Jul 23, 2007 at 10:15:21AM +0200, Rene Herman wrote: On an integrated system like this, do you consider it acceptable to only do the MS-DOS partitions and not the other types that may be present _inside_ those partitions? (MINIX subpartitions, BSD slices, ...). I believe those should really also be done, but this would require keeping more information again. Well, I'm considering this to be a MBR backup scheme, so Minix and BSD slices are legacy systems which are out of scope. If they are busted in the same way as MBR in terms of not having redundant backups of critical data, when they have a lot fewer excuses that MBR, and they can address that issue in their own way. The number of Linux users that also have Minix and BSD partitions are a vanishingly small number in any case. I (very) briefly looked at blkid but unless I'm mistaken blkid needs device names? The documentation seems to be missing. When scanning the device for the partition table, we've built a list of partitions with offsets into the device and it would be nice if we could hand the fd and the offset off to something directly. If the program has to construct device names itself there's another truckload of pitfalls right there. Yeah, good point, I'd have to add that support into blkid. It's been on my todo list, but I just haven't gotten around to it yet. It might in fact make sense to just ask the kernel for the partitions on a device and not bother with scanning anything ourselves. Ie, just walk sysfs. Would you agree? This siginificantly reduces the risk of things getting out of sync, both scanning order and implementation. My concern of sysfs is that #1, it won't work on older kernels since you would need to add new fields to backup what we want, and #2, I'm still fundamentally distrustful of sysfs because there isn't a bright line between what is an exported interface that will never change, and something which is considered an internal implementation detail that can change whenever some kernel hacker feels like it. (Or when some kernel hacker is careless...) So as far as I'm concerned sysfs is a terrible, TERRIBLE way to export a published interface where we promise stability to userspace. So I'd just as soon do this in userspace; after all, the entire partition manager (and there are multiple ones; fdisk, sfdisk, gpart, etc.) all in userspace, and that needs to be in synch with the kernel partition reading code anyway. So one more userspace implementation is in my mind much cleaner than trying to push the needed functionality into sysfs, and then hoping against hope that it doesn't accidentally change in the future. - Ted - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFH] Partition table recovery
On Sun, Jul 22, 2007 at 07:10:31AM +0300, Al Boldi wrote: Sounds great, but it may be advisable to hook this into the partition modification routines instead of mkfs/fsck. Which would mean that the partition manager could ask the kernel to instruct its fs subsystem to update the backup partition table for each known fs-type that supports such a feature. Well, let's think about this a bit. What are the requirements? 1) The partition manager should be able explicitly request that a new backup of the partition tables be stashed in each filesystem that has room for such a backup. That way, when the user affirmatively makes a partition table change, it can get backed up in all of the right places automatically. 2) The fsck program should *only* stash a backup of the partition table if there currently isn't one in the filesystem. It may be that the partition table has been corrupted, and so merely doing an fsck should not transfer a current copy of the partition table to the filesystem-secpfic backup area. It could be that the partition table was only partially recovered, and we don't want to overwrite the previously existing backups except on an explicit request from the system administrator. 3) The mkfs program should automatically create a backup of the current partition table layout. That way we get a backup in the newly created filesystem as soon as it is created. 4) The exact location of the backup may vary from filesystem to filesystem. For ext2/3/4, bytes 512-1023 are always unused, and don't interfere with the boot sector at bytes 0-511, so that's the obvious location. Other filesystems may have that location in use, and some other location might be a better place to store it. Ideally it will be a well-known location, that isn't dependent on finding an inode table, or some such, but that may not be possible for all filesystems. OK, so how about this as a solution that meets the above requirements? /sbin/partbackup device [fspart] Will scan device (i.e., /dev/hda, /dev/sdb, etc.) and create a 512 byte partition backup, using the format I've previously described. If fspart is specified on the command line, it will use the blkid library to determine the filesystem type of fspart, and then attempt to execute /dev/partbackupfs.fstype to write the partition backup to fspart. If fspart is '-', then it will write the 512 byte partition table to stdout. If fspart is not specified on the command line, /sbin/partbackup will iterate over all partitions in device, use the blkid library to attempt to determine the correct filesystem type, and then execute /sbin/partbackupfs.fstype if such a backup program exists. /sbin/partbackupfs.fstype fspart ... is a filesystem specific program for filesystem type fstype. It will assure that fspart (i.e., /dev/hda1, /dev/sdb3) is of an appropriate filesystem type, and then read 512 bytes from stdin and write it out to fspart to an appropriate place for that filesystem. Partition managers will be encouraged to check to see if /sbin/partbackup exists, and if so, after the partition table is written, will check to see if /sbin/partbackup exists, and if so, to call it with just one argument (i.e., /sbin/partbackup /dev/hdb). They SHOULD provide an option for the user to suppress the backup from happening, but the backup should be the default behavior. An /etc/mkfs.fstype program is encouraged to run /sbin/partbackup with two arguments (i.e., /sbin/partbackup /dev/hdb /dev/hdb3) when creating a filesystem. An /etc/fsck.fstype program is encouraged to check to see if a partition backup exists (assuming the filesystem supports it), and if not, call /sbin/partbackup with two arguments. A filesystem utility package for a particular filesystem type is encouraged to make the above changes to its mkfs and fsck programs, as well as provide an /sbin/partbackupfs.fstype program. I would do this all in userspace, though. Is there any reason to get the kernel involved? I don't think so. - Ted - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 4][PATCH 5/5] i_version: noversion mount option to disable inode version updates
On Tue, Jul 10, 2007 at 04:31:44PM -0700, Andrew Morton wrote: On Sun, 01 Jul 2007 03:37:53 -0400 Mingming Cao [EMAIL PROTECTED] wrote: Add a noversion mount option to disable inode version updates. Why is this option being offered to our users? To reduce disk traffic, like noatime? If so, what are the implications of this? What would the user lose? This has been removed in the latest patch set; it's needed only for Lustre, because they set the version field themselves. Lustre needs the inode version to be globally monotonically increasing, so it can order updates between two different files, so it does this itself. NFSv4 only uses i_version to detect changes, and so there's no need to use a global atomic counter for i_version. So the thinking was that there was no point doing the global atomic counter if it was not necessary. Since noversion is Lustre specific, we've dropped that from the list of patches that we'll push, and so the inode version will only have local per-inode significance, and not have any global ordering properties. We have not actually benchmarked whether or not doing the global ordering actually *matters* in terms of being actually noticeable. If it isn't noticeable, I wouldn't mind changing things so that we always make i_version globally significant (without a mount option), and make life a bit easier for the Lustre folks. Or if someone other distributed filesystem requests a globally significant i_version. But we can cross that bridge when we get to it - Ted - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Versioning file system
On Wed, Jul 04, 2007 at 07:32:34PM +0200, Erik Mouw wrote: (sorry for the late reply, just got back from holiday) On Mon, Jun 18, 2007 at 01:29:56PM -0400, Theodore Tso wrote: As I mentioned in my Linux.conf.au presentation a year and a half ago, the main use of Streams in Windows to date has been for system crackers to hide trojan horse code and rootkits so that system administrators couldn't find them. :-) The only valid use of Streams in Windows I've seen was a virus checker that stored a hash of the file in a separate stream. Checking a file was a matter of rehashing it and comparing against the hash stored in the special hash data stream for that particular file. And even that's not a valid use. All the virus would have to do is to infect the file, and then update the special hash data stream. Why is it that when programmers are told about streams as a potential technology choice, it makes their thinking become fuzzy? :-) - Ted - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/6][TAKE5] fallocate system call
On Thu, Jun 28, 2007 at 11:33:42AM -0700, Andrew Morton wrote: Please let us know what you think of Mingming's suggestion of posting all the fallocate patches including the ext4 ones as incremental ones against the -mm. I think Mingming was asking that Ted move the current quilt tree into git, presumably because she's working off git. No, mingming and I both work off of the patch queue (which is also stored in git). So what mingming was asking for exactly was just posting the incremental patches and tagging them appropriately to avoid confusion. I tried building the patch queue earlier in the week and it there were multiple oops/panics as I ran things through various regression tests, but that may have been fixed since (the tree was broken over the weekend and I may have grabbed a broken patch series) or it may have been a screw up on my part feeding them into our testing grid. I haven't had time to try again this week, but I'll try to put together a new tested ext4 patchset over the weekend. I'm not sure what to do, really. The core kernel patches need to be in Ted's tree for testing but that'll create a mess for me. I don't think we have a problem here. What we have now is fine, and it was just people kvetching that Amit reposted patches that were already in -mm and ext4. In any case, the plan is to push all of the core bits into Linus tree for 2.6.22 once it opens up, which should be Real Soon Now, it looks like. - Ted - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/6][TAKE5] fallocate system call
On Fri, Jun 29, 2007 at 10:29:21AM -0400, Jeff Garzik wrote: In any case, the plan is to push all of the core bits into Linus tree for 2.6.22 once it opens up, which should be Real Soon Now, it looks like. Presumably you mean 2.6.23. Yes, sorry. I meant once Linus releases 2.6.22, and we would be aiming to merge before the 2.6.23-rc1 window. - Ted - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Versioning file system
On Tue, Jun 19, 2007 at 12:26:57AM +0200, Jörn Engel wrote: The main difference appears to be the potential size. Both extended attributes and forks allow for extra data that I neither want or need. But once the extra space is large enough to hide a rootkit in, it becomes a security problem instead of just something pointless. The other difference is that you can't execute an extended attribute. You can store kvm/qemu, a complete virtualization enviroment, shared libraries, and other executables all inside a forks inside a file, and then execute programs/rootkit out of said file fork(s). As I mentioned in my LCA presentation, one system administrator refused to upgrade beyond Solaris 8 because he thought forks were good for nothing but letting system crackers hide rootkits that wouldn't be detected by programs like tripwire. The question then is why in the world would we want to replicate Sun's mistakes? - Ted - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Versioning file system
On Mon, Jun 18, 2007 at 03:48:15PM -0700, Jeremy Allison wrote: Did you ever code up forkdepot ? Just wondering ? There is a partial implementation lieing around somewhere, but there were a number of problems we ran into that were discussed in the slidedeck. Basically, if the only program accessing the files containing forks was the Samba program calling forkdepot library, it worked fine. But if there were other programs (or NFS servers) that were potentially deleting files, moving files around, the things fell apart fairly quickly. Just because I now agree with you that streams are a bad idea doesn't mean the pressure to support them in some way in Samba has gone away :-). What, even with Winfs delaying Microsoft Longwait by years before finally being flushed? :-) - Ted - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Versioning file system
On Mon, Jun 18, 2007 at 03:45:24AM -0600, Andreas Dilger wrote: Too bad everyone is spending time on 10 similar-but-slightly-different filesystems. This will likely end up with a bunch of filesystems that implement some easy subset of features, but will not get polished for users or have a full set of features implemented (e.g. ACL, quota, fsck, etc). While I don't think there is a single answer to every question, it does seem that the number of filesystem projects has climbed lately. I view some of the attempts for from scratch filesystems as ways of testing out various designs as proof-of-concepts. It's a great way of demo'ing ones ideas, to see how well they work. There is a huge chasm between a proof-of-concept and a full production filesystem that has great repair/recovery tools, etc. That's why it's so important to do the POC implementation first, so folks can see how well it works before investing a huge amount of effort to make it be production-ready. So I actually think the number of these new filesystem proposals are *good* things. It means people are interested in creating new filesystems, and that's all good. Eventually, we'll need to decide which design ideas should be combined, and that may be a little tough to the egos involved, but that's all part of the darwinian kernel programming model. Not all implementations make it into the kernel mainline. That doesn't mean that the work that was done on the various schedular proposals were useless; they just helped demonstrate concepts and advanced the debate. Regards, - Ted - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Versioning file system
On Mon, Jun 18, 2007 at 09:16:30AM -0700, alan wrote: I just wish that people would learn from the mistakes of others. The MacOS is a prime example of why you do not want to use a forked filesystem, yet some people still seem to think it is a good idea. (Forked filesystems tend to be fragile and do not play well with non-forked filesystems.) Jeremy Alison used to be the one who was always pestering me to add Streams support into ext4, but recently he's admitted that I was right that it was a Very Bad Idea. As I mentioned in my Linux.conf.au presentation a year and a half ago, the main use of Streams in Windows to date has been for system crackers to hide trojan horse code and rootkits so that system administrators couldn't find them. :-) - Ted - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Versioning file system
On Mon, Jun 18, 2007 at 10:33:42AM -0700, Jeremy Allison wrote: Yeah, ok - but do you have to rub my nose in it every chance you get ? :-) :-). Well, I just want to make sure people know that Samba isn't asking for it any more, and I don't know of any current requests outstanding from any of the userspace projects. So there's no one we need to ship off to the re-education camps about why filesystem fork/streams are a bad idea. :-) - Ted - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Versioning file system
On Mon, Jun 18, 2007 at 02:31:14PM -0700, H. Peter Anvin wrote: And that makes them different from extended attributes, how? Both of these really are nothing but ad hocky syntactic sugar for directories, sometimes combined with in-filesystem support for small data items. There's a good discussion of the issues involved in my LCA 2006 presentation which doesn't seem to be on the LCA 2006 site. Hrm. I'll have to ask that this be fixed. In any case, here it is: http://thunk.org/tytso/forkdepot.odp - Ted - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Read/write counts
On Mon, Jun 04, 2007 at 11:02:23AM -0600, Matthew Wilcox wrote: On Mon, Jun 04, 2007 at 09:56:07AM -0700, Bryan Henderson wrote: Programs that assume a full transfer are fairly common, but are universally regarded as either broken or just lazy, and when it does cause a problem, it is far more common to fix the application than the kernel. Linus has explicitly forbidden short reads from being returned. The original poster may get away with it for a specialised case, but for example, signals may not cause a return to userspace with a short read for exactly this reason. Hmm, I'm not sure I would go that far. Per the POSIX specification, we support the optional BSD-style restartable system calls for signals which will avoid short reads; but this is only true if SA_RESTART is passed to sigaction(). Without SA_RESTART, we will indeed return short reads, as required by POSIX. I don't think Linus has said that short reads are always evil; I certainly can't remember him ever making that statement. Do you have a pointer to a LKML message where he's said that? - Ted - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Read/write counts
On Mon, Jun 04, 2007 at 08:57:16PM +0200, Roman Zippel wrote: That's the last discussion about signals and I/O I can remember: http://www.ussg.iu.edu/hypermail/linux/kernel/0208.0/0188.html Well, I think Linus was saying that we have to do both (where the signal interrupts and where it doesn't), and I agree with that: There are enough reasons to discourage people from using uninterruptible sleep (this f*cking application won't die when the network goes down) that I don't think this is an issue. We need to handle both cases, and ^ while we can expand on the two cases we have now, we can't remove them. ^^^ Fortunately, although the -ERESTARTSYS framework is a little awkward (and people can shoot arrows at me for creating it 15 year ago :-), we do have a way of supporting both styles without _too_ much pain. - Ted - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/5] ext4: fallocate support in ext4
On Mon, May 07, 2007 at 03:38:56PM -0700, Andrew Morton wrote: Actually, this is a non-issue. The reason that it is handled for extent-only is that this is the only way to allocate space in the filesystem without doing the explicit zeroing. For other filesystems (including ext3 and ext4 with block-mapped files) the filesystem should return an error (e.g. -EOPNOTSUPP) and glibc will do manual zero-filling of the file in userspace. It can be a bit suboptimal from the layout POV. The reservations code will largely save us here, but kernel support might make it a bit better. Actually, the reservations code won't matter, since glibc will fall back to its current behavior, which is it will do the preallocation by explicitly writing zeros to the file. This wlil result in the same layout as if we had done the persistent preallocation, but of course it will mean the posix_fallocate() could potentially take a long time if you're a PVR and you're reserving a gig or two for a two hour movie at high quality. That seems suboptimal, granted, and ideally the application should be warned about this before it calls posix_fallocate(). On the other hand, it's what happens today, all the time, so applications won't be too badly surprised. If we think applications programmers badly need to know in advance if posix_fallocate() will be fast or slow, probably the right thing is to define a new fpathconf() configuration option so they can query to see whether a particular file will support a fast posix_fallocate(). I'm not 100% convinced such complexity is really needed, but I'm willing to be convinced what do folks think? - Ted - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/5] ext4: fallocate support in ext4
On Mon, May 07, 2007 at 07:02:32PM -0400, Jeff Garzik wrote: Andreas Dilger wrote: On May 07, 2007 13:58 -0700, Andrew Morton wrote: Final point: it's fairly disappointing that the present implementation is ext4-only, and extent-only. I do think we should be aiming at an ext4 bitmap-based implementation and an ext3 implementation. Actually, this is a non-issue. The reason that it is handled for extent-only is that this is the only way to allocate space in the filesystem without doing the explicit zeroing. For other filesystems (including ext3 and Precisely /how/ do you avoid the zeroing issue, for extents? If I posix_fallocate() 20GB on ext4, it damn well better be zeroed, otherwise the implementation is broken. There is a bit in the extent structure which indicates that the extent has not been initialized. When reading from a block where the extent is marked as unitialized, ext4 returns zero's, to avoid returning the uninitalized contents of the disk, which might contain someone else's love letters, p0rn, or other information which we shouldn't leak out. When writing to an extent which is uninitalized, we may potentially have to split the extent into three extents in the worst case. My understanding is that XFS uses a similar implementation; it's a pretty obvious and standard way to implement allocated-but-not-initialized extents. We thought about supporting persistent preallocation for inodes using indirect blocks, but it would require stealing a bit from each entry in the indirect block, reducing the maximum size of the filesystem by two (i.e., 2**31 blocks). It was decided it wasn't worth the complexity, given the tradeoffs. - Ted - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/5] ext4: fallocate support in ext4
On Mon, May 07, 2007 at 05:41:39PM -0700, Mingming Cao wrote: We could check the total number of fs free blocks account before preallocation happens, if there isn't enough space left, there is no need to bother preallocating. Checking against the fs free blocks is a good idea, since it will prevent the obvious error case where someone tries to preallocate 10GB when there is only 2GB left. But it won't help if there are multiple processes trying to allocate blocks the same time. On the other hand, that case is probably relatively rare, and in that case, the filesystem was probably going to be left completely full in any case. On Mon, May 07, 2007 at 05:15:41PM -0700, Andrew Morton wrote: Userspace could presumably repair the mess in most situations by truncating the file back again. The kernel cannot do that because there might be live data in amongst there. Actually, the kernel could do it, in that could simply release all unitialized extents back to the system. The problem is distinguishing between the unitialized extents that had just been newly added, versus the ones that had there from before. (On the other hand, if the filesystem was completely full, releasing unitialized blocks wouldn't be the worse thing in the world to do, although releasing previously fallocated blocks probably does violate the princple of least surprise, even if it's what the user would have wanted.) On Mon, May 07, 2007 at 05:41:39PM -0700, Mingming Cao wrote: If there is enough free space, we could make a reservation window that have at least N free blocks and mark it not stealable by other files. So later we will not run into the ENOSPC error. Could you really use a single reservation window? When the filesystem is almost full, the free extents are likely going to be scattered all over the disk. The general principle of grabbing all of the extents and keeping them in an in-memory data structure, and only adding them to the extent tree would work, though; I'm just not sure we could do it using the existing reservation window code, since it only supports a single reservation window per file, yes? - Ted - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ext2/3 block remapping tool
On Tue, May 01, 2007 at 12:01:42AM -0600, Andreas Dilger wrote: Except one other issue with online shrinking is that we need to move inodes on occasion and this poses a bunch of other problems over just remapping the data blocks. Well, I did say necessary, and not sufficient. But yes, moving inodes, especially if the inode is currently open gets interesting. I don't think there are that many user space applications that would notice or care if the st_ino of an open file changed out from under them, but there are obviously userspace applications, such as tar, that would most definitely care. - Ted - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ext2/3 block remapping tool
On Tue, May 01, 2007 at 12:52:49PM -0600, Andreas Dilger wrote: I think rm -r does a LOT of this kind of operation, like: stat(.); stat(foo); chdir(foo); stat(.); unlink(*); chdir(..); stat(.) I think find does the same to avoid security problems with malicious path manipulation. Yep, so if you're doing an rm -rf (or any other recursive descent) while we're doing an on-line shrink, it's going to fail. I suppose we could have an in-core inode mapping table that would continue to remap inode numbers until the next reboot. I'm not sure we would want to keep the inode remapping indefinitely, although if we don't it could also end up screwing up NFS as well. Not sure I care, though. :-) - Ted - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ext2/3 block remapping tool
On Fri, Apr 27, 2007 at 12:09:42PM -0600, Andreas Dilger wrote: I'd prefer that such functionality be integrated with Takashi's online defrag tool, since it needs virtually the same functionality. For that matter, this is also very similar to the block-mapped - extents tool from Aneesh. It doesn't make sense to have so many separate tools for users, especially if they start interfering with each other (i.e. defrag undoes the remapping done by your tool). Yep, in fact, I'm really glad that Jan is working on the remapping tool because if the on-line defrag kernel interfaces don't have the right support for it, then that means we need to fix the on-line defrag patches. :-) While we're at it, someone want to start thinking about on-line shrinking of ext4 filesystems? Again, the same block remapping interfaces for defrag and file access optimizations should also be useful for shrinking filesystems (even if some of the files that need to be relocated are being actively used). If not, that probably means we got the interface wrong. - Ted - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] TileFS - a proposal for scalable integrity checking
On Sun, Apr 29, 2007 at 08:40:42PM -0500, Matt Mackall wrote: chunkfs. The other is reverse maps (aka back pointers) for blocks - inodes and inodes - directories that obviate the need to have large amounts of memory to check for collisions. Yes, I missed the fact that you had back pointers for blocks as well as inodes. So the block table in the tile header gets used for determing if a block is free, much like is done with FAT, right? That's a clever system; I like it. It does mean that there is a lot more metadata updates, but since you're not journaling, that should counter that effect to some extent. IMHO, it's definitely worth a try to see how well it works! - Ted - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] TileFS - a proposal for scalable integrity checking
On Sat, Apr 28, 2007 at 05:05:22PM -0500, Matt Mackall wrote: This is a relatively simple scheme for making a filesystem with incremental online consistency checks of both data and metadata. Overhead can be well under 1% disk space and CPU overhead may also be very small, while greatly improving filesystem integrity. What's your goal here? Is it to it to speed up fsck's after an unclean shutdown to the point so that you don't need to use a journal or some kind of soft updates scheme? Is it to speed up fsck's after the kernel has detected some kind if internal consistency error? Is it to speed up fsck's after you no longer have confidence that random blocks in the filesystem may have gotten corrupted, due to a suspend/resume bug, hard drive failure, reported CRC/parity errors when writing to the device, reports of massive ECC mailures in your memory that could have caused random block to have gotten been written with multiple bit flips? The first is relatively easy, but as you move down the list, things get progressive harder, since it's no longer possible to use a per tile clean bit to assume that you get to skip checking that particular tile or chunk. Divide disk into a bunch of tiles. For each tile, allocate a one block tile header that contains (inode, checksum) pairs for each block in the tile. Unused blocks get marked inode -1, filesystem metadata blocks -2. The first element contains a last-clean timestamp, a clean flag and a checksum for the block itself. For 4K blocks with 32-bit inode and CRC, that's 512 blocks per tile (2MB), with ~.2% overhead. So what happens for files that are bigger than 2MB? Presumably they consists of blocks that must come from more than one tile, right? So is an inode allowed to reference blocks outside of its tile? Or does an inode that needs to span multiple tiles have a local sub-inode in each tile, with back pointers to parent inode? Note that both design paths have some serious tradeoffs. If you allow an inode to span multiple tiles, now you can't check the block allocation data structures without scanning all of the tiles involved. If you have sub-inodes, now you have to have bidirectional pointers and you have to validate those pointers after validating all of the individual tiles. This is one of the toughest aspects of either the chunkfs or tilefs design, and when we discussed chunkfs at past filesystem workshops, we didn't come to any firm conclusions about the best way to solve it, except to acknowledge that it is a hard problem. My personal inclination is to use substantially bigger chunks than the 2MB that you've proposed, and make each of the chunks more like 2 or 4 GB each, and to enforce a rule which says an inode in a chunk can only reference blocks in that local chunk, and to try like mad to keep directories reference inodes in a chunk, and to keep inodes reference blocks within that chunk. When a file is bigger than a chunk, then you will be forced to use indirection pointers that basically say, for offsets 2GB-4GB, reference inode in chunk , and for 4GB-8GB, checkout inode in chunk , etc. I won't say that this is definitely the best way to do things, but I note that you haven't really address this design point, and there are no obvious best ways of handling this. [Note that CRCs are optional so we can cut the overhead in half. I choose CRCs here because they're capable of catching the vast majority of accidental corruptions at a small cost and mostly serve to protect against errors not caught by on-disk ECC (eg cable noise, kernel bugs, cosmic rays). Replacing CRCs with a stronger hash like SHA-n is perfectly doable.] If the goal is just accidental corruptions, CRC's are just fine. If you want better protection against accidental corruption, then the answer is to use a bigger CRC. Using a cryptographic hash like SHA-n is pure overkill unless you're trying to design protection against a malicious attacker, in which case you've got a much bigger set of problems that you have to address first --- you don't get a cryptogaphically secure filesystem by replcing a CRC with a SHA-n hash function Every time we write to a tile, we must mark the tile dirty. To cut down time to find dirty tiles, the clean bits can be collected into a smaller set of blocks, one clean bitmap block per 64GB of data. Hopefully the clean bitmap block is protected by a checksum. After all, the smaller set of clean bitmap block is going to be constantly updated as tiles get dirtied, and then cleaned. What if they get corrupted? How does the checker notice? And presumably if there is a CRC that doesn't verify, it would have to check all of the tiles, right? Checking a tile: Read the tile If clean and current, we're done. Check the tile header checksum Check the checksum on each block in the tile Check that metadata blocks are metadata Check that inodes in tile agree with inode
Re: ChunkFS - measuring cross-chunk references
On Mon, Apr 23, 2007 at 06:02:29PM -0700, Arjan van de Ven wrote: The other thing which we should consider is that chunkfs really requires a 64-bit inode number space, which means either we only allow does it? I'd think it needs a chunk space number and a 32 bit local inode number ;) (same for blocks) But that means that the number which gets exported to userspace via the stat system call will need more than 32 bits worth of ino_t - Ted - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ChunkFS - measuring cross-chunk references
On Mon, Apr 23, 2007 at 02:53:33PM -0600, Andreas Dilger wrote: With a blocksize of 4KB, a block group would be 128 MB. In the original Chunkfs paper, Valh had mentioned 1GB chunks and I believe it will be possible to use 2GB, 4GB or 8GB chunks in the future. As the chunk size increases the number of cross-chunk references will reduce and hence it might be a good idea to present these statistics considering different chunk sizes starting from 512MB upto 2GB. Also, given that cross-chunk references will be more expensive to fix, I can imagine the allocation policy for chunkfs will try to avoid this if possible, further reducing the number of cross-chunk inodes. I guess it should be more clear whether the cross-chunk references are due to inode block references, or because of e.g. directories referencing inodes in another chunk. It would also be good to distinguish between directories referencing files in another chunk, and directories referencing subdirectories in another chunk (which would be simpler to handle, given the topological restrictions on directories, as compared to files and hard links). There may also be special things we will need to do to handle scenarios such as BackupPC, where if it looks like a directory contains a huge number of hard links to a particular chunk, we'll need to make sure that directory is either created in the right chunk (possibly with hints from the application) or migrated to the right chunk (but this might cause the inode number of the directory to change --- maybe we allow this as long as the directory has never been stat'ed, so that the inode number has never been observed). The other thing which we should consider is that chunkfs really requires a 64-bit inode number space, which means either we only allow it on 64-bit systems, or we need to consider a migration so that even on 32-bit platforms, stat() functions like stat64(), insofar that it uses a stat structure which returns a 64-bit ino_t. - Ted - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Reiser4. BEST FILESYSTEM EVER.
The reason why I ignore the tar+gzip tests is that in the past Hans has rigged the test by using a tar ball which was generated by unpacking a set of kernel sources on a reiser4 filesystem, and then repacking them using tar+gzip. The result was a tar file whose files were optimally laid out so that reiser4 could insert them into the filesystem b-tree without doing any extra work. I can't say for sure whether or not this set of benchmarks has done this (there's not enough information describing the benchmark setup), but the sad fact of the matter is that people trying to pitch Reiser4 have generated for themselves a reputation for using rigged benchmarks. Hans's used of a carefully stacked and ordered tar file (which is the same as stacking a deck of cards), and your repeated use of the bonnee++ benchmarks despite being told that it is a meaningless result given the fact that well, zero's compress very well and most people are interested in storing a file of all zeros, has caused me to look at any benchmarks cited by Reiser4 partisans with a very jaundiced and skeptical eye. Fortunately for you, it's not up to me whether or not Reiser4 makes it into the kernel. And if it works for you, hey, go wild. You can always patch it into your own kernel and encourage others to do the same with respect to getting it tested and adopted. My personal take on it is that Reiser3, Reiser4 and JFS suffer the same problems, which is to say they have a very small and limited development community, and this was referenced in Novell's decision to drop Reiser3: http://linux.wordpress.com/2006/09/27/suse-102-ditching-reiserfs-as-it-default-fs/ SuSE has deprecated Reiser3 *and* JFS, and I believe quite strongly it is the failure of the organizations to attract a diverse development community is ultimately what doomed them in the long term, both in terms of support as the kernel migrated and new feature support. It is for that reason that Hans' personality traits that tend to drive away those developers who would help them, beyond those that he hires, is what has been so self-destructive to Reiser4. Read the announcement Jeff Mahoney from SUSE Labs again; he pointed out was that reiser3 was getting dropped even though it performs better than ext3 in some scenarios. There are many other considerations, such as a filesystem's robustness in case on-disk corruption, long term maintenance as the kernel maintains, availability of developers to provide bug fixes, how well the system performs on systems with multiple cores/CPU's, etc. - Ted - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Reiser4. BEST FILESYSTEM EVER.
On Sat, Apr 07, 2007 at 05:44:57PM -0700, [EMAIL PROTECTED] wrote: To get a feel for the performance increases that can be achieved by using compression, we look at the total time (in seconds) to run the test: You mean the performance increases of writing a file which is mostly all zero's? Yawn. - Ted - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Heads up on sys_fallocate()
On Mon, Mar 05, 2007 at 07:15:33AM -0800, Ulrich Drepper wrote: Well, I'm sure the kernel can do better than the code we have in libc now. The kernel has access to the bitmasks which say which blocks have already been allocated. The libc code does not and we have to be very simple-minded and simply touch every block. And this means reading it and then writing it back. The kernel would know when the reading part is not necessary. Add to then the block granularity (we use f_bsize as returned from fstatfs but that's not the best value in some cases) and you have compelling data to have generic code in the kernel. Then libc implementation can then go away completely which is a good thing. You have a very good point; indeed since we don't export an interface which allows userspace to determine whether or not a block is in use, that does mean a huge amount of churn in the page cache. So maybe it would be worth doing in the kernel as a result, although the libc implementation still wouldn't be able to go away for long time due to the need to be backwards compatible with older kernels that didn't have this support. Regards, - Ted - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: end to end error recovery musings
On Mon, Feb 26, 2007 at 04:33:37PM +1100, Neil Brown wrote: Do we want a path in the other direction to handle write errors? The file system could say Don't worry to much if this block cannot be written, just return an error and I will write it somewhere else? This might allow md not to fail a whole drive if there is a single write error. Can someone with knowledge of current disk drive behavior confirm that for all drives that support bad block sparing, if an attempt to write to a particular spot on disk results in an error due to bad media at that spot, the disk drive will automatically rewrite the sector to a sector in its spare pool, and automatically redirect that sector to the new location. I believe this should be always true, so presumably with all modern disk drives a write error should mean something very serious has happend. (Or that someone was in the middle of reconfiguring a FC network and they're running a kernel that doesn't understand why short-duration FC timeouts should be retried. :-) Or is that completely un-necessary as all modern devices do bad-block relocation for us? Is there any need for a bad-block-relocating layer in md or dm? That's the question. It wouldn't be that hard for filesystems to be able to remap a data block, but (a) it would be much more difficult for fundamental metadata (for example, the inode table), and (b) it's unnecessary complexity if the lower levels in the storage stack should always be doing this for us in the case of media errors anyway. What about corrected-error counts? Drives provide them with SMART. The SCSI layer could provide some as well. Md can do a similar thing to some extent. Where these are actually useful predictors of pending failure is unclear, but there could be some value. e.g. after a certain number of recovered errors raid5 could trigger a background consistency check, or a filesystem could trigger a background fsck should it support that. Somewhat off-topic, but my one big regret with how the dm vs. evms competition settled out was that evms had the ability to perform block device snapshots using a non-LVM volume as the base --- and that EVMS allowed a single drive to be partially managed by the LVM layer, and partially managed by evms. What this allowed is the ability to do device snapshots and therefore background fsck's without needing to convert the entire laptop disk to using a LVM solution (since to this day I still don't trust initrd's to always do the right thing when I am constantly replacing the kernel for kernel development). I know, I'm weird, distro users have initrd that seem to mostly work, and it's only wierd developers that try to use bleeding edge kernels with a RHEL4 userspace that suffer, but it's one of the reasons why I've avoided initrd's like the plague --- I've wasted entire days trying to debug problems with the userspace-provided initrd being too old to support newer 2.6 development kernels. In any case, the reason why I bring this up is that it would be really nice if there was a way with a single laptop drive to be able to do snapshots and background fsck's without having to use initrd's with device mapper. - Ted - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: end to end error recovery musings
On Fri, Feb 23, 2007 at 05:37:23PM -0700, Andreas Dilger wrote: Probably the only sane thing to do is to remember the bad sectors and avoid attempting reading them; that would mean marking automatic versus explicitly requested requests to determine whether or not to filter them against a list of discovered bad blocks. And clearing this list when the sector is overwritten, as it will almost certainly be relocated at the disk level. For that matter, a huge win would be to have the MD RAID layer rewrite only the bad sector (in hopes of the disk relocating it) instead of failing the whiole disk. Otherwise, a few read errors on different disks in a RAID set can take the whole system offline. Apologies if this is already done in recent kernels... And having a way of making this list available to both the filesystem and to a userspace utility, so they can more easily deal with doing a forced rewrite of the bad sector, after determining which file is involved and perhaps doing something intelligent (up to and including automatically requesting a backup system to fetch a backup version of the file, and if it can be determined that the file shouldn't have been changed since the last backup, automatically fixing up the corrupted data block :-). - Ted - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fix(es) for ext2 fsync bug
On Wed, Feb 14, 2007 at 11:54:54AM -0800, Valerie Henson wrote: Background: The eXplode file system checker found a bug in ext2 fsync behavior. Do the following: truncate file A, create file B which reallocates one of A's old indirect blocks, fsync file B. If you then crash before file A's metadata is all written out, fsck will complete the truncate for file A... thereby deleting file B's data. So fsync file B doesn't guarantee data is on disk after a crash. Details: It's actually not the case that fsck will complete the truncate for file A. The problem is that while e2fsck is processing indirect blocks in pass 1, the block which is marked as file A's indirect block (but which actually contain's file B's data) gets fixed when e2fsck sees block numbers which look like illegal block numbers. So this ends up corrupting file B's data. This is actually legal end result, BTW, since it's POSIX states the result of fsync() is undefined if the system crashes. Technically fsync() did actually guarantee that file B's data is on disk; the problem is that e2fsck would corrupt the data afterwards. Ironically, fsync()'ing file B actually makes it more likely that it might get corrupted afterwards, since normally filesystem metadata gets sync'ed out on 5 second intervals, while data gets sync'ed out at 30 second intervals. * Rearrange order of duplicate block checking and fixing file size in fsck. Not sure how hard this is. (Ted?) It's not a matter of changing when we deal with fixing the file size, as described above. At the fsck time, we would need to keep backup copies of any indirect blocks that get modified for whatever reason, and then in pass 1D, when we clone a block that has been claimed by multiple inods, the inodes which claim the block as a data block should get a copy of the block before it was modified by e2fsck. * Keep a set of still allocated on disk block bitmaps that gets flushed whenever a sync happens. Don't allocate these blocks. Journaling file systems already have to do this. A list would be more efficient, as others have pointed out. That would work, although the knowing when entries could be removed from the list. The machinery for knowing when metadata has been updated isn't present in ext2, and that's a fair amount of complexity. You could clear the list/bitmap after the 5 second metadata flush command has been kicked off, or if you associate a data block with the previous inode's owner, you could clear the entry when the inode's dirty bit has been cleared, but that doesn't completely get rid of the race unless you tie it to when the write has completed (and this assumes write barriers to make sure the block was actually flushed to the media). Another very heavyweight approach would be to simply force a full sync of the filesystem whenever fysnc() is called. Not pretty, and without the proper write ordering, the race is still potentially there. I'd say that the best way to handle this is in fsck, but quite frankly it's relatively low priority bug to handle, since a much simpler workaround is to tell people to use ext3 instead. Regards, - Ted - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fix(es) for ext2 fsync bug
On Thu, Feb 15, 2007 at 10:39:02AM -0600, Dave Kleikamp wrote: It was my understanding from the persentation of Dawson that ext3 and jfs have ame problem. Hmm. If jfs has the problem, it is a bug. jfs is designed to handle this correctly. I'm pretty sure I've fixed at least one bug that eXplode has uncovered in the past. I'm not sure what was mentioned in the presentation though. I'd like any information about current problems in jfs. That was not my understanding of the charts that were presented earlier this week. Ext3 journaling code will deal with this case explicitly, just as jfs does. - Ted - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fix(es) for ext2 fsync bug
On Thu, Feb 15, 2007 at 11:28:46AM -0800, Junfeng Yang wrote: Actually, we found a crash-during-recovery bug in ext3 too. It's a race between resetting the journal super block and replay of the journal. This bug was fixed by Ted long time ago (3 years?). That was found in your original work (using UML) not the more recent work using EXPLODE, correct? - Ted - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 2/3] Move the file data to the new blocks
On Thu, Feb 08, 2007 at 11:47:39AM +0100, Jan Kara wrote: Well. Do we really? Are we looking for a 100% solution here, or a 90% one? Umm, I think that for ext3 having data on one end of the disk and indirect blocks on the other end of the disk does not quite help (not mentioning that it can create bad free space fragmentation over the time). I have not measured it but I'd guess that it would erase the effect of moving data closer together. At least for sequential reads.. I don't think anyone is saying we can ignore the metadata; but the fact is, the cleanest solution for 90% of the problem is to use the page cache, and as far as the other 10%, Linus has been pushing us to move at least the directories into the page cache, and it's not insane to consider moving the rest of the metadata into page cache. At least it's something we should consider carefully. - Ted - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ext3 question: How to compose an inode given a list of data block numbers?
On Thu, Feb 08, 2007 at 02:46:19PM -0800, hlily wrote: Suppose I have a list of data blocks, does Ext3 provide some functions that can help me to build a block list into an inode? If no such functions, could someone direct me to the right place in Ext3 code that add block numbers to an inode? What are you trying to do? Are you trying to do this from a kernel module, or from user space? - Ted - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH[RFC] kill sysrq-u (emergency remount r/o)
On Mon, Feb 05, 2007 at 09:40:08PM +0100, Jan Engelhardt wrote: On Feb 5 2007 18:32, Christoph Hellwig wrote: in two recent discussions (file_list_lock scalability and remount r/o on suspend) I stumbled over this emergency remount feature. It's not actually useful because it tries a potentially dangerous remount despite writers still beeing in progress, which we can't get rid. The current way is to remount things, and return -EROFS to any process that attempts to write(). Unless we want to kill processes to get rid of them [most likely we possibly won't], I am fine with how things are atm. So, what's the dangerous part, actually? The dangerous part is that we change f-f_mode for all open files without regard for whether there might be any writes underway at the time. This isn't *serious* although the results might be a little strange and it might result in a confused return from write(2). More seriously, mark_files_ro() in super.c *only* changes f-f_mode and doesn't deal with the possibility that the file might be mapped read-write. For filesystems that do delayed allocation, I'm not at all convinced that an emergency read-only will result in the filesystem doing anything at all sane, depending on what else the filesystem might do when the filesystem is forced into read-only state. sysrq+u is helpful. It is like \( sysrq+s make sure no further writes go to disk \). I agree it is useful, but if we're going to do it we really should do it right. We should have real revoke() functionality on file descriptors, which revokes all of the mmap()'s (any attempt to write into a previously read/write mmap will cause a SEGV) as well as changing f_mode, and then use that to implement emergency read-only remount. - Ted - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 21/35] Unionfs: Inode operations
On Tue, Dec 05, 2006 at 01:50:17PM -0800, Andrew Morton wrote: This /* * Lorem ipsum dolor sit amet, consectetur * adipisicing elit, sed do eiusmod tempor * incididunt ut labore et dolore magna aliqua. */ is probably the most common, and is what I use when forced to descrog comments. This what I normally do by default, unless it's a one-line comment, in which case my preference is usually for this: /* Lorem ipsum dolor sit amet, consectetur */ I'm not convinced we really do _need_ to standardize on comment styles (I can foresee thousands and thousands of trivial patches being submitted and we'd probably be better off encouraging people to spend time actually improving the documentation instead of reformatting it :-), but if were going to standardize, that would be my vote. - Ted - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Ext2-devel] Re: ext3 for 2.4
On Thu, May 17, 2001 at 03:00:28PM -0400, Jeff Garzik wrote: AFAIK the original stated intention of ext3 was cd linux/fs cp -a ext2 ext3 # hack on ext3 That leaves ext2 in ultra-stability, no-patches-unless-absolutely-necessary mode. IMHO prove a new feature, like directories in page cache, journaling, etc. in ext3 first. Then maybe after a year of testing, if people actually care, backport those features to ext2. Alternatively, once we get ext3 with just journaling stable (and with an option to not do journaling at all), simply do something like this: cd linux/fs rm -f ext2 mv ext3 ext2 cp -r ext2 ext3 # hack hack hack on ext3 and add even more features So ext3 is always the development version, and ext2 is the stable version. - Ted - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED]