Re: very poor ext3 write performance on big filesystems?

2008-02-18 Thread Theodore Tso
On Mon, Feb 18, 2008 at 04:57:25PM +0100, Andi Kleen wrote: Use cp or a tar pipeline to move the files. Are you sure cp handles hardlinks correctly? I know tar does, but I have my doubts about cp. I *think* GNU cp does the right thing with --preserve=links. I'm not 100% sure, though ---

Re: very poor ext3 write performance on big filesystems?

2008-02-18 Thread Theodore Tso
On Mon, Feb 18, 2008 at 05:16:55PM +0100, Tomasz Chmielewski wrote: Theodore Tso schrieb: I'd really need to know exactly what kind of operations you were trying to do that were causing problems before I could say for sure. Yes, you said you were removing unneeded files, but how were you

Re: [RFC] Parallelize IO for e2fsck

2008-01-28 Thread Theodore Tso
On Mon, Jan 28, 2008 at 07:30:05PM +, Pavel Machek wrote: As user pages are always in highmem, this should be easy to decide: only send SIGDANGER when highmem is full. (Yes, there are inodes/dentries/file descriptors in lowmem, but I doubt apps will respond to SIGDANGER by closing

Re: [RFC] Parallelize IO for e2fsck

2008-01-26 Thread Theodore Tso
On Fri, Jan 25, 2008 at 05:55:51PM -0800, Bryan Henderson wrote: I was surprised to see AIX do late allocation by default, because IBM's traditional style is bulletproof systems. A system where a process can be killed at unpredictable times because of resource demands of unrelated

Re: [RFC] ext3 freeze feature

2008-01-25 Thread Theodore Tso
On Fri, Jan 25, 2008 at 10:34:25AM -0600, Eric Sandeen wrote: But it was this concern which is why ext3 never exported freeze functionality to userspace, even though other commercial filesystems do support this. It wasn't that it wasn't considered, but the concern about whether or not it

Re: [RFC] ext3 freeze feature

2008-01-25 Thread Theodore Tso
On Fri, Jan 25, 2008 at 03:18:51PM +0300, Dmitri Monakhov wrote: First of all Linux already have at least one open-source(dm-snap), and several commercial snapshot solutions. Yes, but it requires that the filesystem be stored under LVM. Unlike what EVMS v1 allowed us to do, we can't currently

Re: [RFC] Parallelize IO for e2fsck

2008-01-24 Thread Theodore Tso
On Fri, Jan 25, 2008 at 01:08:09AM +0200, Adrian Bunk wrote: In practice, there is a small number of programs that are both the common memory hogs and should be able to reduce their memory consumption by 10% or 20% without big problems when requested (e.g. Java VMs, Firefox and databases come

Re: [RFD] Incremental fsck

2008-01-12 Thread Theodore Tso
On Wed, Jan 09, 2008 at 02:52:14PM +0300, Al Boldi wrote: Ok, but let's look at this a bit more opportunistic / optimistic. Even after a black-out shutdown, the corruption is pretty minimal, using ext3fs at least. After a unclean shutdown, assuming you have decent hardware that doesn't

Re: [RFC 0/2] readdir() as an inode operation

2007-10-31 Thread Theodore Tso
On Tue, Oct 30, 2007 at 04:26:04PM +0100, Jan Kara wrote: This is a first try to move readdir() to become an inode operation. This is necessary for a VFS implementation of something like union-mounts where a readdir() needs to read the directory contents of multiple directories. Besides

Re: Does 32.1% non-contiguous mean severely fragmented?

2007-10-23 Thread Theodore Tso
On Tue, Oct 23, 2007 at 07:38:20PM +0900, Tetsuo Handa wrote: Are you sure the file isn't getting written by some background tasks that you weren't aware of? This seems very strange; what virtualization software are you using? VMware, Xen, KVM? I'm using VMware Workstation 6.0.0 build

Re: Does 32.1% non-contiguous mean severely fragmented?

2007-10-22 Thread Theodore Tso
On Mon, Oct 22, 2007 at 08:58:11PM +0900, Tetsuo Handa wrote: --- Start VM --- --- Suspend VM --- [EMAIL PROTECTED] Ubuntu7.10]# filefrag Ubuntu7.10.vmem Ubuntu7.10.vmem: 751 extents found, perfection would be 5 extents [EMAIL PROTECTED] Ubuntu7.10]# sync [EMAIL PROTECTED] Ubuntu7.10]#

Re: Does 32.1% non-contiguous mean severely fragmented?

2007-10-20 Thread Theodore Tso
On Sat, Oct 20, 2007 at 12:39:33PM +0900, Tetsuo Handa wrote: Theodore Tso wrote: beginning of every single block group. You have a small number of files on your system (349) occupying an average of 348 megabytes. So it's not at all surprising that the contiguous percentage is 32%. I see

Re: Does \32.1% non-contigunous\ mean severely fragmented?

2007-10-19 Thread Theodore Tso
On Fri, Oct 19, 2007 at 10:49:03AM +0900, Tetsuo Handa wrote: /data/VMware: 349/19546112 files (32.1% non-contiguous), 31019203/39072080 blocks Does non-contiguous mean fragmented? If so, where is ext3defrag? Not necessarily; it just means that 32% of your files have at least one

Re: [PATCH 13/32] IGET: Stop EXT2 from using iget() and read_inode() [try #2]

2007-10-05 Thread Theodore Tso
On Thu, Oct 04, 2007 at 04:57:08PM +0100, David Howells wrote: Stop the EXT2 filesystem from using iget() and read_inode(). Replace ext2_read_inode() with ext2_iget(), and call that instead of iget(). ext2_iget() then uses iget_locked() directly and returns a proper error code instead of an

Re: Upgrading datastructures between different filesystem versions

2007-09-28 Thread Theodore Tso
On Fri, Sep 28, 2007 at 02:31:46PM +0100, Christoph Hellwig wrote: On Fri, Sep 28, 2007 at 03:11:00PM +0200, Erik Mouw wrote: There are however ways to confuse it: if you reformat an ext3 filesystem to reiserfs (version 3), mounting that filesystem without -t reiserfs will trick mount(8)

Re: [PATCH] fs: Correct SuS compliance for open of large file without options

2007-09-27 Thread Theodore Tso
On Thu, Sep 27, 2007 at 04:19:12PM +0100, Alan Cox wrote: Well it's not my call, just seems like a really bad idea to change the error value. You can't claim full coverage for such testing anyway, it's one of those things that people will complain about two releases later saying it broke

Re: [PATCH] fs: Correct SuS compliance for open of large file without options

2007-09-27 Thread Theodore Tso
On Thu, Sep 27, 2007 at 10:59:17AM -0700, Greg KH wrote: Come on now, I'm _very_ tired of this kind of discussion. Please go read the documentation on how to _use_ sysfs from userspace in such a way that you can properly access these data structures so that no breakage occurs. I've read it;

Re: [PATCH] fs: Correct SuS compliance for open of large file without options

2007-09-27 Thread Theodore Tso
On Thu, Sep 27, 2007 at 05:28:57PM -0600, Matthew Wilcox wrote: On Thu, Sep 27, 2007 at 07:19:27PM -0400, Theodore Tso wrote: Would you accept a patch which causes the deprecated sysfs files/directories to disappear, even if CONFIG_SYS_DEPRECATED is defined, via a boot-time parameter

Re: Upgrading datastructures between different filesystem versions

2007-09-26 Thread Theodore Tso
On Wed, Sep 26, 2007 at 06:29:19PM -0500, Sachin Gaikwad wrote: Is it not the case that VFS takes care of all filesystems available ? VFS will see if a particular file belongs to ext3 or ext4 and call that FS's drivers to access information ?? No, it doesn't quite work that way. You have to

Re: [RFC 12/26] ext2 white-out support

2007-07-30 Thread Theodore Tso
On Mon, Jul 30, 2007 at 06:13:35PM +0200, Jan Blunck wrote: Introduce white-out support to ext2. Known Bugs: - Needs a reserved inode number for white-outs You picked different reserved inodes for the ext2 and ext3 filesystems. That's good for a NACK right there. The codepoints (i.e.,

Re: [RFH] Partition table recovery

2007-07-23 Thread Theodore Tso
On Mon, Jul 23, 2007 at 10:15:21AM +0200, Rene Herman wrote: On an integrated system like this, do you consider it acceptable to only do the MS-DOS partitions and not the other types that may be present _inside_ those partitions? (MINIX subpartitions, BSD slices, ...). I believe those

Re: [RFH] Partition table recovery

2007-07-22 Thread Theodore Tso
On Sun, Jul 22, 2007 at 07:10:31AM +0300, Al Boldi wrote: Sounds great, but it may be advisable to hook this into the partition modification routines instead of mkfs/fsck. Which would mean that the partition manager could ask the kernel to instruct its fs subsystem to update the backup

Re: [EXT4 set 4][PATCH 5/5] i_version: noversion mount option to disable inode version updates

2007-07-11 Thread Theodore Tso
On Tue, Jul 10, 2007 at 04:31:44PM -0700, Andrew Morton wrote: On Sun, 01 Jul 2007 03:37:53 -0400 Mingming Cao [EMAIL PROTECTED] wrote: Add a noversion mount option to disable inode version updates. Why is this option being offered to our users? To reduce disk traffic, like noatime?

Re: Versioning file system

2007-07-04 Thread Theodore Tso
On Wed, Jul 04, 2007 at 07:32:34PM +0200, Erik Mouw wrote: (sorry for the late reply, just got back from holiday) On Mon, Jun 18, 2007 at 01:29:56PM -0400, Theodore Tso wrote: As I mentioned in my Linux.conf.au presentation a year and a half ago, the main use of Streams in Windows to date

Re: [PATCH 0/6][TAKE5] fallocate system call

2007-06-29 Thread Theodore Tso
On Thu, Jun 28, 2007 at 11:33:42AM -0700, Andrew Morton wrote: Please let us know what you think of Mingming's suggestion of posting all the fallocate patches including the ext4 ones as incremental ones against the -mm. I think Mingming was asking that Ted move the current quilt tree into

Re: [PATCH 0/6][TAKE5] fallocate system call

2007-06-29 Thread Theodore Tso
On Fri, Jun 29, 2007 at 10:29:21AM -0400, Jeff Garzik wrote: In any case, the plan is to push all of the core bits into Linus tree for 2.6.22 once it opens up, which should be Real Soon Now, it looks like. Presumably you mean 2.6.23. Yes, sorry. I meant once Linus releases 2.6.22, and we

Re: Versioning file system

2007-06-19 Thread Theodore Tso
On Tue, Jun 19, 2007 at 12:26:57AM +0200, Jörn Engel wrote: The main difference appears to be the potential size. Both extended attributes and forks allow for extra data that I neither want or need. But once the extra space is large enough to hide a rootkit in, it becomes a security problem

Re: Versioning file system

2007-06-19 Thread Theodore Tso
On Mon, Jun 18, 2007 at 03:48:15PM -0700, Jeremy Allison wrote: Did you ever code up forkdepot ? Just wondering ? There is a partial implementation lieing around somewhere, but there were a number of problems we ran into that were discussed in the slidedeck. Basically, if the only program

Re: Versioning file system

2007-06-18 Thread Theodore Tso
On Mon, Jun 18, 2007 at 03:45:24AM -0600, Andreas Dilger wrote: Too bad everyone is spending time on 10 similar-but-slightly-different filesystems. This will likely end up with a bunch of filesystems that implement some easy subset of features, but will not get polished for users or have a

Re: Versioning file system

2007-06-18 Thread Theodore Tso
On Mon, Jun 18, 2007 at 09:16:30AM -0700, alan wrote: I just wish that people would learn from the mistakes of others. The MacOS is a prime example of why you do not want to use a forked filesystem, yet some people still seem to think it is a good idea. (Forked filesystems tend to be

Re: Versioning file system

2007-06-18 Thread Theodore Tso
On Mon, Jun 18, 2007 at 10:33:42AM -0700, Jeremy Allison wrote: Yeah, ok - but do you have to rub my nose in it every chance you get ? :-) :-). Well, I just want to make sure people know that Samba isn't asking for it any more, and I don't know of any current requests outstanding from any

Re: Versioning file system

2007-06-18 Thread Theodore Tso
On Mon, Jun 18, 2007 at 02:31:14PM -0700, H. Peter Anvin wrote: And that makes them different from extended attributes, how? Both of these really are nothing but ad hocky syntactic sugar for directories, sometimes combined with in-filesystem support for small data items. There's a good

Re: Read/write counts

2007-06-04 Thread Theodore Tso
On Mon, Jun 04, 2007 at 11:02:23AM -0600, Matthew Wilcox wrote: On Mon, Jun 04, 2007 at 09:56:07AM -0700, Bryan Henderson wrote: Programs that assume a full transfer are fairly common, but are universally regarded as either broken or just lazy, and when it does cause a problem, it is far

Re: Read/write counts

2007-06-04 Thread Theodore Tso
On Mon, Jun 04, 2007 at 08:57:16PM +0200, Roman Zippel wrote: That's the last discussion about signals and I/O I can remember: http://www.ussg.iu.edu/hypermail/linux/kernel/0208.0/0188.html Well, I think Linus was saying that we have to do both (where the signal interrupts and where it

Re: [PATCH 4/5] ext4: fallocate support in ext4

2007-05-07 Thread Theodore Tso
On Mon, May 07, 2007 at 03:38:56PM -0700, Andrew Morton wrote: Actually, this is a non-issue. The reason that it is handled for extent-only is that this is the only way to allocate space in the filesystem without doing the explicit zeroing. For other filesystems (including ext3 and

Re: [PATCH 4/5] ext4: fallocate support in ext4

2007-05-07 Thread Theodore Tso
On Mon, May 07, 2007 at 07:02:32PM -0400, Jeff Garzik wrote: Andreas Dilger wrote: On May 07, 2007 13:58 -0700, Andrew Morton wrote: Final point: it's fairly disappointing that the present implementation is ext4-only, and extent-only. I do think we should be aiming at an ext4 bitmap-based

Re: [PATCH 4/5] ext4: fallocate support in ext4

2007-05-07 Thread Theodore Tso
On Mon, May 07, 2007 at 05:41:39PM -0700, Mingming Cao wrote: We could check the total number of fs free blocks account before preallocation happens, if there isn't enough space left, there is no need to bother preallocating. Checking against the fs free blocks is a good idea, since it will

Re: Ext2/3 block remapping tool

2007-05-01 Thread Theodore Tso
On Tue, May 01, 2007 at 12:01:42AM -0600, Andreas Dilger wrote: Except one other issue with online shrinking is that we need to move inodes on occasion and this poses a bunch of other problems over just remapping the data blocks. Well, I did say necessary, and not sufficient. But yes, moving

Re: Ext2/3 block remapping tool

2007-05-01 Thread Theodore Tso
On Tue, May 01, 2007 at 12:52:49PM -0600, Andreas Dilger wrote: I think rm -r does a LOT of this kind of operation, like: stat(.); stat(foo); chdir(foo); stat(.); unlink(*); chdir(..); stat(.) I think find does the same to avoid security problems with malicious path manipulation. Yep, so

Re: Ext2/3 block remapping tool

2007-04-30 Thread Theodore Tso
On Fri, Apr 27, 2007 at 12:09:42PM -0600, Andreas Dilger wrote: I'd prefer that such functionality be integrated with Takashi's online defrag tool, since it needs virtually the same functionality. For that matter, this is also very similar to the block-mapped - extents tool from Aneesh. It

Re: [RFC] TileFS - a proposal for scalable integrity checking

2007-04-30 Thread Theodore Tso
On Sun, Apr 29, 2007 at 08:40:42PM -0500, Matt Mackall wrote: chunkfs. The other is reverse maps (aka back pointers) for blocks - inodes and inodes - directories that obviate the need to have large amounts of memory to check for collisions. Yes, I missed the fact that you had back pointers for

Re: [RFC] TileFS - a proposal for scalable integrity checking

2007-04-29 Thread Theodore Tso
On Sat, Apr 28, 2007 at 05:05:22PM -0500, Matt Mackall wrote: This is a relatively simple scheme for making a filesystem with incremental online consistency checks of both data and metadata. Overhead can be well under 1% disk space and CPU overhead may also be very small, while greatly

Re: ChunkFS - measuring cross-chunk references

2007-04-24 Thread Theodore Tso
On Mon, Apr 23, 2007 at 06:02:29PM -0700, Arjan van de Ven wrote: The other thing which we should consider is that chunkfs really requires a 64-bit inode number space, which means either we only allow does it? I'd think it needs a chunk space number and a 32 bit local inode number ;)

Re: ChunkFS - measuring cross-chunk references

2007-04-23 Thread Theodore Tso
On Mon, Apr 23, 2007 at 02:53:33PM -0600, Andreas Dilger wrote: With a blocksize of 4KB, a block group would be 128 MB. In the original Chunkfs paper, Valh had mentioned 1GB chunks and I believe it will be possible to use 2GB, 4GB or 8GB chunks in the future. As the chunk size increases

Re: Reiser4. BEST FILESYSTEM EVER.

2007-04-08 Thread Theodore Tso
The reason why I ignore the tar+gzip tests is that in the past Hans has rigged the test by using a tar ball which was generated by unpacking a set of kernel sources on a reiser4 filesystem, and then repacking them using tar+gzip. The result was a tar file whose files were optimally laid out so

Re: Reiser4. BEST FILESYSTEM EVER.

2007-04-07 Thread Theodore Tso
On Sat, Apr 07, 2007 at 05:44:57PM -0700, [EMAIL PROTECTED] wrote: To get a feel for the performance increases that can be achieved by using compression, we look at the total time (in seconds) to run the test: You mean the performance increases of writing a file which is mostly all zero's?

Re: [RFC] Heads up on sys_fallocate()

2007-03-05 Thread Theodore Tso
On Mon, Mar 05, 2007 at 07:15:33AM -0800, Ulrich Drepper wrote: Well, I'm sure the kernel can do better than the code we have in libc now. The kernel has access to the bitmasks which say which blocks have already been allocated. The libc code does not and we have to be very simple-minded and

Re: end to end error recovery musings

2007-02-26 Thread Theodore Tso
On Mon, Feb 26, 2007 at 04:33:37PM +1100, Neil Brown wrote: Do we want a path in the other direction to handle write errors? The file system could say Don't worry to much if this block cannot be written, just return an error and I will write it somewhere else? This might allow md not to fail

Re: end to end error recovery musings

2007-02-23 Thread Theodore Tso
On Fri, Feb 23, 2007 at 05:37:23PM -0700, Andreas Dilger wrote: Probably the only sane thing to do is to remember the bad sectors and avoid attempting reading them; that would mean marking automatic versus explicitly requested requests to determine whether or not to filter them against

Re: Fix(es) for ext2 fsync bug

2007-02-15 Thread Theodore Tso
On Wed, Feb 14, 2007 at 11:54:54AM -0800, Valerie Henson wrote: Background: The eXplode file system checker found a bug in ext2 fsync behavior. Do the following: truncate file A, create file B which reallocates one of A's old indirect blocks, fsync file B. If you then crash before file A's

Re: Fix(es) for ext2 fsync bug

2007-02-15 Thread Theodore Tso
On Thu, Feb 15, 2007 at 10:39:02AM -0600, Dave Kleikamp wrote: It was my understanding from the persentation of Dawson that ext3 and jfs have ame problem. Hmm. If jfs has the problem, it is a bug. jfs is designed to handle this correctly. I'm pretty sure I've fixed at least one bug

Re: Fix(es) for ext2 fsync bug

2007-02-15 Thread Theodore Tso
On Thu, Feb 15, 2007 at 11:28:46AM -0800, Junfeng Yang wrote: Actually, we found a crash-during-recovery bug in ext3 too. It's a race between resetting the journal super block and replay of the journal. This bug was fixed by Ted long time ago (3 years?). That was found in your original

Re: [RFC][PATCH 2/3] Move the file data to the new blocks

2007-02-11 Thread Theodore Tso
On Thu, Feb 08, 2007 at 11:47:39AM +0100, Jan Kara wrote: Well. Do we really? Are we looking for a 100% solution here, or a 90% one? Umm, I think that for ext3 having data on one end of the disk and indirect blocks on the other end of the disk does not quite help (not mentioning that

Re: Ext3 question: How to compose an inode given a list of data block numbers?

2007-02-08 Thread Theodore Tso
On Thu, Feb 08, 2007 at 02:46:19PM -0800, hlily wrote: Suppose I have a list of data blocks, does Ext3 provide some functions that can help me to build a block list into an inode? If no such functions, could someone direct me to the right place in Ext3 code that add block numbers to an

Re: [PATCH[RFC] kill sysrq-u (emergency remount r/o)

2007-02-05 Thread Theodore Tso
On Mon, Feb 05, 2007 at 09:40:08PM +0100, Jan Engelhardt wrote: On Feb 5 2007 18:32, Christoph Hellwig wrote: in two recent discussions (file_list_lock scalability and remount r/o on suspend) I stumbled over this emergency remount feature. It's not actually useful because it tries a

Re: [PATCH 21/35] Unionfs: Inode operations

2006-12-07 Thread Theodore Tso
On Tue, Dec 05, 2006 at 01:50:17PM -0800, Andrew Morton wrote: This /* * Lorem ipsum dolor sit amet, consectetur * adipisicing elit, sed do eiusmod tempor * incididunt ut labore et dolore magna aliqua. */ is probably the most common, and is what I use

Re: [Ext2-devel] Re: ext3 for 2.4

2001-05-17 Thread Theodore Tso
On Thu, May 17, 2001 at 03:00:28PM -0400, Jeff Garzik wrote: AFAIK the original stated intention of ext3 was cd linux/fs cp -a ext2 ext3 # hack on ext3 That leaves ext2 in ultra-stability, no-patches-unless-absolutely-necessary mode. IMHO prove a new feature, like