Bug#654206: [PATCH] ext4: Report max_batch_time option correctly
On Mon, Jan 02, 2012 at 02:13:02PM +, Ben Hutchings wrote: Currently the value reported for max_batch_time is really the value of min_batch_time. Reported-by: Russell Coker russ...@coker.com.au Signed-off-by: Ben Hutchings b...@decadent.org.uk Applied, thanks. - Ted -- To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20120105022310.ga24...@thunk.org
Bug#615998: linux-image-2.6.32-5-xen-amd64: Repeatable kernel BUG at fs/jbd2/commit.c:534 from Postfix on ext4
My basic impression is that the use of data=journalled can help reduce the risk (slightly) of serious corruption to some kinds of databases when the application does not provide appropriate syncs or journalling on its own (IE: such as text-based Wiki database files). Yes, although if the application has index files that have to be updated at the same time, there is no guarantee that the changes that survive after a system failure (either a crash or a power fail), unless the application is doing proper application-level journalling or some other structured. To sum up, the only additional guarantee data=journal offers against data=ordered is a total ordering of all IO operations. That is, if you do a sequence of data and metadata operations, then you are guaranteed that after a crash you will see the filesystem in a state corresponding exactly to your sequence terminated at some (arbitrary) point. Data writes are disassembled into page-sized page-aligned sequence of writes for purpose of this model... data=journal can also make the fsync() operation faster, since it will involver fewer seeks (although it will require a greater write bandwidth). Depending on the write bandwidth, you really need to benchmark things to be sure, though. - Ted -- To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20110628141650.gh2...@thunk.org
Bug#615998: linux-image-2.6.32-5-xen-amd64: Repeatable kernel BUG at fs/jbd2/commit.c:534 from Postfix on ext4
On Mon, Jun 27, 2011 at 05:30:11PM +0200, Lukas Czerner wrote: I've found some. So although data=journal users are minority, there are some. That being said I agree with you we should do something about it - either state that we want to fully support data=journal - and then we should really do better with testing it or deprecate it and remove it (which would save us some complications in the code). I would be slightly in favor of removing it (code simplicity, less options to configure for admin, less options to test for us, some users I've come across actually were not quite sure why they are using it - they just thought it looks safer). Hmm... FYI, I hope to be able to bring on line automated testing for ext4 later this summer (there's a testing person at Google is has signed up to work on setting this up as his 20% project). The test matrix that I have him included data=journal, so we will be getting better testing in the near future. At least historically, data=journalling was the *simpler* case, and was the first thing supported by ext4. (data=ordered required revoke handling which didn't land for six months or so). So I'm not really that convinced that removing really buys us that much code simplification. That being siad, it is true that data=journalled isn't necessarily faster. For heavy disk-bound workloads, it can be slower. So I can imagine adding some documentation that warns people not to use data=journal unless they really know what they are doing, but at least personally, I'm a bit reluctant to dispense with a bug report like this by saying, oh, that feature should be deprecated. Regards, - Ted -- To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20110627160140.gc2...@thunk.org
Bug#615998: linux-image-2.6.32-5-xen-amd64: Repeatable kernel BUG at fs/jbd2/commit.c:534 from Postfix on ext4
On Thu, Jun 23, 2011 at 01:32:48PM -0500, Moffett, Kyle D wrote: Ted, since this new iteration has no customer data, passwords, keys, or any other private data, I'm going to try to get approval to release an exact EC2 image of this system for you to test with, including the fake data volume that I triggered the problem on. That would be great! Approximately how big are the images involved? - Ted -- To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20110623222330.gc3...@thunk.org
Bug#615998: linux-image-2.6.32-5-xen-amd64: Repeatable kernel BUG at fs/jbd2/commit.c:534 from Postfix on ext4
On Tue, Apr 05, 2011 at 10:30:11AM -0500, Moffett, Kyle D wrote: Couple of questions which might give me some clues: (a) was this a natively formatted ext4 file system, or a ext3 file system which was later converted to ext4? All the filesystems were formatted like this using Debian e2fstools as of 9 months ago: Rats. OK, so the indirect block journal credit bug fix won't help this bug. mke2fs -t ext4 -E lazy_itable_init=1 -L db:mail /dev/mapper/db-mail tune2fs -i 0 -c 1 -e remount-ro -o acl,user_xattr,journal_data /dev/mapper/db-mail Ooooh could the lazy_itable_init have anything to do with it? Shouldn't be, since 2.6.32 doesn't have the lazy inode init support. That support didn't show up until 2.6.37. I've switched the relevant filesystems back to data=journal mode, so if you want to send me a patch for 2.6.32 that I can apply to a Debian kernel I will keep that kernel around and if I see it happen again I'll check if the patch fixes it. Given that this was a freshly created file system with mke2fs -t ext4, I doubt the patch would help. Well, the base image is essentially a somewhat basic Debian squeeze for EC2 with our SSH public keys and a couple generic customizations applied. It does not have Postfix installed or configured, so there would be some work involved. Well, if you can share that image in AWS with the ssh keys stripped out it would save me a bunch of time. I assume it's not setup to automatically set ssh keys and pass them back to AWS like the generic images can? I also didn't see any problems with the system at all until the queue got backed up with ~100-120 stuck emails. After Postfix tried and failed to deliver a bunch of emails I would get the OOPS. Yeah, what I'd probably try to do is install postfix and then send a few hundred messages to foo...@example.com and see if I can repro the OOPS. Thanks for investigating! - Ted -- To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20110405190738.gf2...@thunk.org
Bug#615998: linux-image-2.6.32-5-xen-amd64: Repeatable kernel BUG at fs/jbd2/commit.c:534 from Postfix on ext4
On Mon, Apr 04, 2011 at 09:24:28AM -0500, Moffett, Kyle D wrote: Unfortunately it was not a trivial process to install Debian squeeze onto an EC2 instance; it took a couple ugly Perl scripts, a patched Debian-Installer, and several manual post-install-but-before-reboot steps (like fixing up GRUB 0.99). One of these days I may get time to update all that to the official wheezy release and submit bug reports. Sigh, I was whoping someone was maintaining semi-official EC2 images for Debian, much like alestic has been maintaining for Ubuntu. (Hmm, actually, he has EC2 images for Lenny and Etch, but unfortunately not for squeeze. Sigh) It's probably easier for me to halt email delivery and clone the working instance and try to reproduce from there. If I recall, the (easily undone) workaround was to remount from data=journal to data=ordered on a couple filesystems. It may take a day or two to get this done, though. Couple of questions which might give me some clues: (a) was this a natively formatted ext4 file system, or a ext3 file system which was later converted to ext4? (b) How big are the files/directories involved? In particular, how big is the Postfix mail queue directory, and it is an extent-based directory? (what does lsattr on the mail queue directory report) As far as file sizes, does it matter how big the e-mail messages are, and are there any other database files that postgress might be touching at the time that you get the OOPS? I have found a bug in ext4 where we were underestimating how many journal credits were needed when modifying direct/indirect-mapped files (which would be seen on ext4 if you had a ext3 file system that was converted to start using extents; but old, pre-existing directories wouldn't be converted), which is why I'm asking the question about whether this was an ext2/ext3 file system which was converted to use ext4. I have a patch to fix it, but backporting it into a kernel which will work with EC2 is not something I've done before. Can anyone point me at a web page that gives me the quick cheat sheet? If it comes down to it I also have a base image (from squeeze as of 9 months ago) that could be made public after updating with new SSH keys. If we can reproduce the problem on that base image it would be really great! I have an Amazon AWS account; contact me when you have an image you want to share, if you want to share it just with my AWS account id, instead of sharing it publically... - Ted -- To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20110405001542.ge2...@thunk.org
Bug#615998: linux-image-2.6.32-5-xen-amd64: Repeatable kernel BUG at fs/jbd2/commit.c:534 from Postfix on ext4
Hi Kyle, Sorry for not following up sooner. Are you still able to reproduce this failure? If I set up an identical Debian stable instance on EC-2, am I likely to reproduce it myself? Do you have a package list or EC2 base image I can use as a starting point? Thanks, - Ted -- To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20110403020227.ga19...@thunk.org
Re: [RFC/PATCH 0/4] Re: Bug#605009: serious performance regression with ext4
On Mon, Nov 29, 2010 at 02:46:11PM +0100, Raphael Hertzog wrote: On Mon, 29 Nov 2010, Theodore Tso wrote: BTW, if you had opened the file handle in subsequent passes using O_RDONLY|O_NOATIME, the use of fdatasync() instead of fsync() might not have been necessary. And as far as the comments in patch #4 was Hum, fsync()/fdatasync() require a fd opened for writing, so this is not really possible? (Or at least the man page says so and indicates EBADF as return value in that case) Hmm that's not the language used in SuSv3: [EBADF] The fildes argument is not a valid descriptor. - http://www.opengroup.org/onlinepubs/009695399/functions/fsync.html But yes, I see where the Linux system call man pages have stated this. EBADF fd is not a valid file descriptor open for writing. My test program which I sent out works, and it does: fd = open(file, O_RDONLY|O_NOATIME); fsync(fd); close(fd); with all of the appropriate error checking, so I can tell you it's not required for recent 2.6 kernels (I tested this using 2.6.37-rc2). But whether this was required on older kernels, I'm not 100% sure. I've cc'ed Michael Kerrisk to see if he might be able to shed any light on where the EBADF wording from the fsync() man page might have come from. I've since done more looking at the source code, and from what I can tell, O_WRONLY should be OK; merely opening a file using O_WRONLY shouldn't affect the mod time. Any opening of a file using O_RDONLY touches the atime of the file (and all directories and symlinks needed to open it), though so the use of O_NOATIME and fdatasync() to minimize unneeded I/O does seem to be a good idea. - Ted -- To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20101129141901.gr2...@thunk.org
Re: Bug#605009: serious performance regression with ext4
On Mon, Nov 29, 2010 at 02:16:02PM +0100, Raphael Hertzog wrote: It means we don't need to keep it in RAM since we're not going to read/modifiy it again in the near future. Thus the writeback can be started right now since delaying it will not save us anything. At least that's the way I understand the situation. Yes, that's correct. The fadvise() will do two things; it will start the writeback, and also make these memory pages be the most likely to be discarded. This last might or might not be a good thing. If you are installing a large number of packages, discarding will avoid more useful things from being discard, and might help the interactive feel of the machine while the install is going on in the background. OTOH, if you are only installing one package, it might cause some file that will be needed by the postinstall script to be pushed out of the page cache prematurely. So the fadvise() does the same thing as SYNC_FILE_RANGE_WRITE, which is to say, start an asynchronous writeback of the pages in the file. It will not do a SYNC_FILE_RANGE_WAIT_BEFORE, which assures that the writebacks are complete before attempting to start doing the fdatasync(). Put another way: if this works now, is it likely to continue to work? Well, it will always work (the code is unlikely to introduce failures), but the resulting behaviour is entirely up to the kernel to decide. So there's no guaranty that the optimization will last. Exactly. I think the real question is whether you want to also give the hint that the pages for that particular file should be first in line to be discarded from the page cache. On the other hand, the whole point of posix_fadvise() is to give hints to the kernel so that he can decide on the best course of action. So I hope the interpretation above is one the main motivation behind that hint. The main motivation is to make the pages easily discardable; the fact that it happens to start writeback is really a side effect. So for backup programs, including rsync when it is being used for backups, using POSIX_FADV_DONTNEED is definitely a good idea. Whether or not it is a good idea for dpkg really depends on whether you think the files are going to be used soon after they are written --- either because the user has just installed the new toy and wants to play with it (i.e., apt-get install tuxracer; tuxracer) or because of a post-install script. On the other hand, if the user was just updating a random set of progams that aren't actually going to be used right away (i.e., apt-get update; apt-get upgrade), in that case the POSIX_FADV_DONTNEED would probably be a good thing. The reason why I suggested using sync_file_range() is because it is very specifically directed at forcing the writeback to happen, which is not quite the same thrust as posix_fadvise(). Regards, - Ted -- To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20101129143526.gs2...@thunk.org
Re: Bug#605009: serious performance regression with ext4
On Mon, Nov 29, 2010 at 02:58:16PM +, Ian Jackson wrote: This is the standard way that ordinary files for which reliability was important have been updated on Unix for decades. fsync is for files which need synchronisation with things external to the computer (or at least, external to the volume) - eg, email at final dot. This is simply not true. And I'm speaking as someone who has been doing Unix/Linux kernel development since the BSD 4.3 days. (Well, BSD 4.3+Tahoe days, to be precise.) fsync() has always been the only guarantee that files would be on disk. In fact the way BSD worked, there was no guarantee that rename() would provide any kind of file synchronization primitive; that's actually something new. No, in the old days, if you really cared about a file, you would fsync() it. Period. End of paragraph. It was just that in those days, the main things people cared about where either source/text files (so the editors of the day would do the right thing) or e-mail (and no, just for the final delivery; for all MTA's). The problem that caused people to get this wrong idea was because (a) back then Unix machines tended to be more reliable, because they were run by professionals in machine rooms, very often with UPS's. Also, (b) people weren't loading craptastic video drivers with buggy proprietary kernel modules; they may have used proprietary drivers, but kernels weren't changing all the time, and there was a lot more careful testing of drivers before they were unloosed onto the world. Finally (c), ext3 had as an accident of how it provided protection against old file blocks showing up newly allocated files (something which BSD 4.3 did __not__ protect against, by the way), had the behaviour that renaming over a file __usually__ (but not always) provided atomic guarantees. (c) was especially unfortunate, because it never applied to all Linux file systems, just to ext3, and because the fact that it did this was also responsible for disastrous desktop performance when you had the combination of large number of streaming writes (i.e., bittorrent, video ripping/copying, etc.) going on in the background combined with foreground GUI applications that were fsync-happy() --- i.e., firefox. Lots of users have complained about the desktop performance problem, but the reality is we can't really solve that without also taking away the magic that made (c) happen. Whether you solve it by using data=writeback and stick with ext3, or switch to ext4, or switch to XFS, or switch to btrfs --- all of these will solve the desktop performance problem, but they also leave you vulnerable to file loss in the case of system crashes and applications that don't use fsync()/fdatasync(). Hence the fact that all file system developers, whether they were btrfs developers or XFS developers or ext4 developers, made the joke at the file system developers summit two years ago, that what the application programmers really wanted was O_PONY, with the magic pixie dust. Unfortunately: http://www.linuxformat.com/files/nopony.jpg - Ted -- To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20101129151812.gu2...@thunk.org
Re: Bug#605009: serious performance regression with ext4
On Mon, Nov 29, 2010 at 09:21:44AM -0600, Jonathan Nieder wrote: That explanation helps a lot. Thanks, both. (Guillem, I like your patch very much then. Most files being unpacked in a dpkg run aren't going to be read back again soon. Perhaps some other kernels will also interpret it as a hint to start writeback.) Most files won't, but consider a postinstall script which needs to scan/index a documentation file, or simply run one or more binaries that was just installed. I can definitely imagine situations where using POSIX_FADV_DONTNEED could actually hurt performance. Is it enough to worry about? Hard to say; for a very long dpkg run, the files might end up getting pushed out of memory anyway. But if you are only installing one package, and you are doing this on a particularly slow disk, using POSIX_FADV_DONTNEED could actually hurt in a measurable way. If you are only installing a one or a few packages, and/or you can somehow divine the user's intention that they will very shortly use the file --- for example, if dpkg is being launched via packagekit to install some font or codec, then using POSIX_FADV_DONTNEED is probably the wrong answer. So at the very least I'd recommend having command line options to enable/disable use of posix_fadvise(). Regards, - Ted -- To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20101129153244.ga7...@thunk.org
Re: Bug#605009: serious performance regression with ext4
I did some experimenting, and I figured out what was going on. You're right, (c) doesn't quite work, because delayed allocation meant that the writeout didn't take place until the fsync() for each file happened. I didn't see this at first; my apologies. However, this *does* work: extract(a); sync_file_range(fd.a, 0, 0, SYNC_FILE_RANGE_WRITE); extract(b.dpkg-new); sync_file_range(fd.b, 0, 0, SYNC_FILE_RANGE_WRITE); extract(c.dpkg-new); sync_file_range(fd.c, 0, 0, SYNC_FILE_RANGE_WRITE); sync_file_range(fd.a, 0, 0, SYNC_FILE_RANGE_WAIT_BEFORE); sync_file_range(fd.b, 0, 0, SYNC_FILE_RANGE_WAIT_BEFORE); sync_file_range(fd.c, 0, 0, SYNC_FILE_RANGE_WAIT_BEFORE); fdatasync(a); fdatasync(b.dpkg-new); fdatasync(c.dpkg-new); rename(b.dpkg-new, b); rename(c.dpkg-new, c); This assumes that files b and c existed beforehand, but a is a new file. What's going on here? sync_file_range() is a Linux specific system call that has been around for a while. It allows program to control when writeback happens in a very low-level fashion. The first set of sync_file_range() system calls causes the system to start writing back each file once it has finished being extracted. It doesn't actually wait for the write to finish; it just starts the writeback. The second series of sync_file_range() calls, with the operation SYNC_FILE_RANGE_WAIT_BEFORE, will block until the previously initiated writeback has completed. This basically ensures that the delayed allocation has been resolved; that is, the data blocks have been allocated and written, and the inode updated (in memory), but not necessarily pushed out to disk. The fdatasync() call will actually force the inode to disk. In the case of the ext4 file system, the first fdatasync() will actually push all of the inodes to disk, and all of the subsequent fdatasync() calls are in fact no-ops (assuming that files 'a', 'b', and 'c' are all on the same file system). But what it means is that it minimizes the number of (heavyweight) jbd2 commits to a minimum. It uses a linux-specific system call --- sync_file_range --- but the result should be faster performance across the board for all file systems. So I don't consider this an ext4-specific hack, although it probably does makes things faster for ext4 more than any other file system. I've attached the program I used to test and prove this mechanism, as well as the kernel tracepoint script I used to debug why (c) wasn't working, which might be of interest to folks on debian-kernel. Basically it's a demonstration of how cool ftrace is. :-) But using this program on a file system composed of a 5400rpm laptop drive running LVM and LUKS, I get: mass-sync-tester -d:dpkg current: time: 0.83/ 0.01/ 0.00 versus mass-sync-tester -n:dpkg fixed: time: 0.07/ 0.00/ 0.01 - Ted /* * Mass sync tester */ #define _GNU_SOURCE #include stdio.h #include unistd.h #include stdlib.h #include sys/types.h #include sys/time.h #include sys/stat.h #include fcntl.h #include sys/resource.h #include getopt.h #include errno.h #include string.h void write_file(const char *name, int sync, int sync_range) { int fd, i, ret; char buf[1024]; fd = open(name, O_WRONLY|O_TRUNC|O_CREAT, 0666); if (fd 0) { fprintf(stderr, open(%s) in write_file: %s\n, name, strerror(errno)); exit(1); } memset(buf, 0, sizeof(buf)); for (i=0; i 16; i++) { ret = write(fd, buf, sizeof(buf)); if (ret 0) { fprintf(stderr, writing %s: %s\n, name, strerror(errno)); exit(1); } } if (sync) { ret = fsync(fd); if (ret 0) { fprintf(stderr, fsyncing %s in write_file: %s\n, name, strerror(errno)); exit(1); } } if (sync_range) { ret = sync_file_range(fd, 0, 0, SYNC_FILE_RANGE_WRITE); if (ret 0) { fprintf(stderr, sync_file_range %s in write_file: %s\n, name, strerror(errno)); exit(1); } } ret = close(fd); if (ret 0) { fprintf(stderr, closing %s in write_file: %s\n, name, strerror(errno)); exit(1); } } void rename_file(const char *src, const char *dest) { int ret; ret = rename(src, dest); if (ret) { fprintf(stderr, renaming %s to %s: %s\n, src, dest, strerror(errno)); exit(1); } } void sync_file(const char *name) { int fd, i, ret; fd = open(name, O_RDONLY|O_NOATIME, 0666); if (fd 0) { fprintf(stderr, open(%s) in sync_file: %s\n, name, strerror(errno)); exit(1); } ret = fsync(fd); if (ret 0) { fprintf(stderr, fsyncing %s in sync_file: %s\n, name, strerror(errno)); exit(1); } ret = close(fd); if (ret 0) { fprintf(stderr, closing %s in sync_file: %s\n, name, strerror(errno)); exit(1); } } void datasync_file(const char *name) { int fd, i, ret; fd = open(name, O_RDONLY|O_NOATIME, 0666); if (fd 0) { fprintf(stderr, open(%s) in datasync_file: %s\n, name, strerror(errno)); exit(1); } ret = fdatasync(fd); if (ret 0) { fprintf(stderr,
Re: Bug#605009: serious performance regression with ext4
On Fri, Nov 26, 2010 at 03:53:27PM +0100, Raphael Hertzog wrote: Just to sum up what dpkg --unpack does in 1.15.8.6: 1/ set the package status as half-installed/reinst-required 2/ extract all the new files as *.dpkg-new 3/ for all the unpacked files: fsync(foo.dpkg-new) followed by rename(foo.dpkg-new, foo) What are you doing? 1) Suppose package contains files a, b, and c. Which are you doing? a) extract a.dpkg-new ; fsync(a.dpkg-new); rename(a.dpkg-new, a); extract b.dpkg-new ; fsync(b.dpkg-new); rename(b.dpkg-new, b); extract c.dpkg-new ; fsync(c.dpkg-new); rename(c.dpkg-new, c); or b) extract a.dpkg-new ; fsync(a.dpkg-new); extract b.dpkg-new ; fsync(b.dpkg-new); extract c.dpkg-new ; fsync(c.dpkg-new); rename(a.dpkg-new, a); rename(b.dpkg-new, b); rename(c.dpkg-new, c); or c) extract(a.dpkg-new); extract(b.dpkg-new); extract(c.dpkg-new); fsync(a.dpkg-new); fsync(b.dpkg-new); fsync(c.dpkg-new); rename(a.dpkg-new, a); rename(b.dpkg-new, b); rename(c.dpkg-new, c); (c) will perform the best for most file systems, including ext4. As a further optimization, if b and c does not exist, of course it would be better to extract into b and c directly, and skip the rename, i.e.: d) extract(a.dpkg-new); extract(b); # assuming the file b does not yet exist extract(c); # assuming the file c does not yet exist fsync(a.dpkg-new); fsync(b); fsync(c); rename(a.dpkg-new, a); ... and then set the package status as unpacked. I am guessing you are doing (a) today --- am I right? (c) or (d) would be best. - Ted -- To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20101126215254.gj2...@thunk.org