Bug#654206: [PATCH] ext4: Report max_batch_time option correctly

2012-01-04 Thread Ted Ts'o
On Mon, Jan 02, 2012 at 02:13:02PM +, Ben Hutchings wrote:
 Currently the value reported for max_batch_time is really the
 value of min_batch_time.
 
 Reported-by: Russell Coker russ...@coker.com.au
 Signed-off-by: Ben Hutchings b...@decadent.org.uk

Applied, thanks.

- Ted



-- 
To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20120105022310.ga24...@thunk.org



Bug#615998: linux-image-2.6.32-5-xen-amd64: Repeatable kernel BUG at fs/jbd2/commit.c:534 from Postfix on ext4

2011-06-28 Thread Ted Ts'o
  My basic impression is that the use of data=journalled can help
  reduce the risk (slightly) of serious corruption to some kinds of
  databases when the application does not provide appropriate syncs
  or journalling on its own (IE: such as text-based Wiki database files).

Yes, although if the application has index files that have to be
updated at the same time, there is no guarantee that the changes that
survive after a system failure (either a crash or a power fail),
unless the application is doing proper application-level journalling
or some other structured.

 To sum up, the only additional guarantee data=journal offers against
 data=ordered is a total ordering of all IO operations. That is, if you do a
 sequence of data and metadata operations, then you are guaranteed that
 after a crash you will see the filesystem in a state corresponding exactly
 to your sequence terminated at some (arbitrary) point. Data writes are
 disassembled into page-sized  page-aligned sequence of writes for purpose
 of this model...

data=journal can also make the fsync() operation faster, since it will
involver fewer seeks (although it will require a greater write
bandwidth).  Depending on the write bandwidth, you really need to
benchmark things to be sure, though.

- Ted



-- 
To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20110628141650.gh2...@thunk.org



Bug#615998: linux-image-2.6.32-5-xen-amd64: Repeatable kernel BUG at fs/jbd2/commit.c:534 from Postfix on ext4

2011-06-27 Thread Ted Ts'o
On Mon, Jun 27, 2011 at 05:30:11PM +0200, Lukas Czerner wrote:
  I've found some. So although data=journal users are minority, there are
  some. That being said I agree with you we should do something about it
  - either state that we want to fully support data=journal - and then we
  should really do better with testing it or deprecate it and remove it
  (which would save us some complications in the code).
  
  I would be slightly in favor of removing it (code simplicity, less options
  to configure for admin, less options to test for us, some users I've come
  across actually were not quite sure why they are using it - they just
  thought it looks safer).

Hmm...  FYI, I hope to be able to bring on line automated testing for
ext4 later this summer (there's a testing person at Google is has
signed up to work on setting this up as his 20% project).  The test
matrix that I have him included data=journal, so we will be getting
better testing in the near future.

At least historically, data=journalling was the *simpler* case, and
was the first thing supported by ext4.  (data=ordered required revoke
handling which didn't land for six months or so).  So I'm not really
that convinced that removing really buys us that much code
simplification.

That being siad, it is true that data=journalled isn't necessarily
faster.  For heavy disk-bound workloads, it can be slower.  So I can
imagine adding some documentation that warns people not to use
data=journal unless they really know what they are doing, but at least
personally, I'm a bit reluctant to dispense with a bug report like
this by saying, oh, that feature should be deprecated.

Regards,

- Ted



-- 
To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20110627160140.gc2...@thunk.org



Bug#615998: linux-image-2.6.32-5-xen-amd64: Repeatable kernel BUG at fs/jbd2/commit.c:534 from Postfix on ext4

2011-06-23 Thread Ted Ts'o
On Thu, Jun 23, 2011 at 01:32:48PM -0500, Moffett, Kyle D wrote:
 
 Ted, since this new iteration has no customer data, passwords, keys, or
 any other private data, I'm going to try to get approval to release an
 exact EC2 image of this system for you to test with, including the fake
 data volume that I triggered the problem on.

That would be great!  Approximately how big are the images involved?

- Ted



-- 
To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20110623222330.gc3...@thunk.org



Bug#615998: linux-image-2.6.32-5-xen-amd64: Repeatable kernel BUG at fs/jbd2/commit.c:534 from Postfix on ext4

2011-04-05 Thread Ted Ts'o
On Tue, Apr 05, 2011 at 10:30:11AM -0500, Moffett, Kyle D wrote:
  Couple of questions which might give me some clues:
(a) was this a natively formatted ext4 file system, or a ext3 file
system which was later converted to ext4?
 
 All the filesystems were formatted like this using Debian e2fstools
 as of 9 months ago:

Rats.  OK, so the indirect block journal credit bug fix won't help
this bug.

   mke2fs -t ext4 -E lazy_itable_init=1 -L db:mail /dev/mapper/db-mail
   tune2fs -i 0 -c 1 -e remount-ro -o acl,user_xattr,journal_data 
 /dev/mapper/db-mail
 
 Ooooh could the lazy_itable_init have anything to do with it?

Shouldn't be, since 2.6.32 doesn't have the lazy inode init support.
That support didn't show up until 2.6.37.

 I've switched the relevant filesystems back to data=journal mode,
 so if you want to send me a patch for 2.6.32 that I can apply to a
 Debian kernel I will keep that kernel around and if I see it happen
 again I'll check if the patch fixes it.

Given that this was a freshly created file system with mke2fs -t ext4,
I doubt the patch would help.

 Well, the base image is essentially a somewhat basic Debian squeeze
 for EC2 with our SSH public keys and a couple generic customizations
 applied.  It does not have Postfix installed or configured, so there
 would be some work involved.

Well, if you can share that image in AWS with the ssh keys stripped
out it would save me a bunch of time.  I assume it's not setup to
automatically set ssh keys and pass them back to AWS like the generic
images can?

 I also didn't see any problems with the system at all until the
 queue got backed up with ~100-120 stuck emails.  After Postfix tried
 and failed to deliver a bunch of emails I would get the OOPS.

Yeah, what I'd probably try to do is install postfix and then send a
few hundred messages to foo...@example.com and see if I can repro the OOPS.

Thanks for investigating!

- Ted



-- 
To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20110405190738.gf2...@thunk.org



Bug#615998: linux-image-2.6.32-5-xen-amd64: Repeatable kernel BUG at fs/jbd2/commit.c:534 from Postfix on ext4

2011-04-04 Thread Ted Ts'o
On Mon, Apr 04, 2011 at 09:24:28AM -0500, Moffett, Kyle D wrote:
 
 Unfortunately it was not a trivial process to install Debian
 squeeze onto an EC2 instance; it took a couple ugly Perl scripts,
 a patched Debian-Installer, and several manual
 post-install-but-before-reboot steps (like fixing up GRUB 0.99).
 One of these days I may get time to update all that to the official
 wheezy release and submit bug reports.

Sigh, I was whoping someone was maintaining semi-official EC2 images
for Debian, much like alestic has been maintaining for Ubuntu.  (Hmm,
actually, he has EC2 images for Lenny and Etch, but unfortunately not
for squeeze.  Sigh)

 It's probably easier for me to halt email delivery and clone the
 working instance and try to reproduce from there.  If I recall, the
 (easily undone) workaround was to remount from data=journal to
 data=ordered on a couple filesystems.  It may take a day or two to
 get this done, though.

Couple of questions which might give me some clues: (a) was this a
natively formatted ext4 file system, or a ext3 file system which was
later converted to ext4?  (b) How big are the files/directories
involved?  In particular, how big is the Postfix mail queue directory,
and it is an extent-based directory?  (what does lsattr on the mail
queue directory report) As far as file sizes, does it matter how big
the e-mail messages are, and are there any other database files that
postgress might be touching at the time that you get the OOPS?

I have found a bug in ext4 where we were underestimating how many
journal credits were needed when modifying direct/indirect-mapped
files (which would be seen on ext4 if you had a ext3 file system that
was converted to start using extents; but old, pre-existing
directories wouldn't be converted), which is why I'm asking the
question about whether this was an ext2/ext3 file system which was
converted to use ext4.

I have a patch to fix it, but backporting it into a kernel which will
work with EC2 is not something I've done before.  Can anyone point me
at a web page that gives me the quick cheat sheet?

 If it comes down to it I also have a base image (from squeeze as of 9 
 months ago) that could be made public after updating with new SSH keys. 

If we can reproduce the problem on that base image it would be really
great!  I have an Amazon AWS account; contact me when you have an
image you want to share, if you want to share it just with my AWS
account id, instead of sharing it publically...

 - Ted



-- 
To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20110405001542.ge2...@thunk.org



Bug#615998: linux-image-2.6.32-5-xen-amd64: Repeatable kernel BUG at fs/jbd2/commit.c:534 from Postfix on ext4

2011-04-02 Thread Ted Ts'o
Hi Kyle,

Sorry for not following up sooner.  Are you still able to reproduce
this failure?  If I set up an identical Debian stable instance on
EC-2, am I likely to reproduce it myself?  Do you have a package list
or EC2 base image I can use as a starting point?

Thanks,

- Ted



-- 
To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20110403020227.ga19...@thunk.org



Re: [RFC/PATCH 0/4] Re: Bug#605009: serious performance regression with ext4

2010-11-29 Thread Ted Ts'o
On Mon, Nov 29, 2010 at 02:46:11PM +0100, Raphael Hertzog wrote:
 On Mon, 29 Nov 2010, Theodore Tso wrote:
  BTW, if you had opened the file handle in subsequent passes using
  O_RDONLY|O_NOATIME, the use of fdatasync() instead of fsync() might not
  have been necessary.   And as far as the comments in patch #4 was
 
 Hum, fsync()/fdatasync() require a fd opened for writing, so this is not
 really possible? (Or at least the man page says so and indicates EBADF as
 return value in that case)

Hmm   that's not the language used in SuSv3:

[EBADF]
 The fildes argument is not a valid descriptor.
- http://www.opengroup.org/onlinepubs/009695399/functions/fsync.html

But yes, I see where the Linux system call man pages have stated this.

 EBADF  fd is not a valid file descriptor open for writing.

My test program which I sent out works, and it does:

   fd = open(file, O_RDONLY|O_NOATIME);
   fsync(fd);
   close(fd);

with all of the appropriate error checking, so I can tell you it's not
required for recent 2.6 kernels (I tested this using 2.6.37-rc2).  But
whether this was required on older kernels, I'm not 100% sure.  I've
cc'ed Michael Kerrisk to see if he might be able to shed any light on
where the EBADF wording from the fsync() man page might have come from.

I've since done more looking at the source code, and from what I can
tell, O_WRONLY should be OK; merely opening a file using O_WRONLY
shouldn't affect the mod time.  Any opening of a file using O_RDONLY
touches the atime of the file (and all directories and symlinks needed
to open it), though so the use of O_NOATIME and fdatasync() to
minimize unneeded I/O does seem to be a good idea.

- Ted


-- 
To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20101129141901.gr2...@thunk.org



Re: Bug#605009: serious performance regression with ext4

2010-11-29 Thread Ted Ts'o
On Mon, Nov 29, 2010 at 02:16:02PM +0100, Raphael Hertzog wrote:
 
 It means we don't need to keep it in RAM since we're not going to
 read/modifiy it again in the near future. Thus the writeback can be
 started right now since delaying it will not save us anything.
 
 At least that's the way I understand the situation.

Yes, that's correct.  The fadvise() will do two things; it will start
the writeback, and also make these memory pages be the most likely to
be discarded.  This last might or might not be a good thing.  If you
are installing a large number of packages, discarding will avoid more
useful things from being discard, and might help the interactive
feel of the machine while the install is going on in the background.

OTOH, if you are only installing one package, it might cause some file
that will be needed by the postinstall script to be pushed out of the
page cache prematurely.

So the fadvise() does the same thing as SYNC_FILE_RANGE_WRITE, which
is to say, start an asynchronous writeback of the pages in the file.
It will not do a SYNC_FILE_RANGE_WAIT_BEFORE, which assures that the
writebacks are complete before attempting to start doing the
fdatasync().

  Put another way: if this works now, is it likely to continue to work?
 
 Well, it will always work (the code is unlikely to introduce failures),
 but the resulting behaviour is entirely up to the kernel to decide. So
 there's no guaranty that the optimization will last.

Exactly.  I think the real question is whether you want to also give
the hint that the pages for that particular file should be first in
line to be discarded from the page cache.  

 On the other hand, the whole point of posix_fadvise() is to give hints to
 the kernel so that he can decide on the best course of action. So I hope
 the interpretation above is one the main motivation behind that hint.

The main motivation is to make the pages easily discardable; the fact
that it happens to start writeback is really a side effect.  So for
backup programs, including rsync when it is being used for backups,
using POSIX_FADV_DONTNEED is definitely a good idea.  Whether or not
it is a good idea for dpkg really depends on whether you think the
files are going to be used soon after they are written --- either
because the user has just installed the new toy and wants to play with
it (i.e., apt-get install tuxracer; tuxracer) or because of a
post-install script.

On the other hand, if the user was just updating a random set of
progams that aren't actually going to be used right away (i.e.,
apt-get update; apt-get upgrade), in that case the
POSIX_FADV_DONTNEED would probably be a good thing.

The reason why I suggested using sync_file_range() is because it is
very specifically directed at forcing the writeback to happen, which
is not quite the same thrust as posix_fadvise().

Regards,

- Ted


-- 
To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20101129143526.gs2...@thunk.org



Re: Bug#605009: serious performance regression with ext4

2010-11-29 Thread Ted Ts'o
On Mon, Nov 29, 2010 at 02:58:16PM +, Ian Jackson wrote:
 
 This is the standard way that ordinary files for which reliability was
 important have been updated on Unix for decades.  fsync is for files
 which need synchronisation with things external to the computer (or at
 least, external to the volume) - eg, email at final dot.

This is simply not true.  And I'm speaking as someone who has been
doing Unix/Linux kernel development since the BSD 4.3 days.  (Well,
BSD 4.3+Tahoe days, to be precise.)

fsync() has always been the only guarantee that files would be on
disk.  In fact the way BSD worked, there was no guarantee that
rename() would provide any kind of file synchronization primitive;
that's actually something new.  No, in the old days, if you really
cared about a file, you would fsync() it.  Period.  End of paragraph.

It was just that in those days, the main things people cared about
where either source/text files (so the editors of the day would do the
right thing) or e-mail (and no, just for the final delivery; for all
MTA's).

The problem that caused people to get this wrong idea was because (a)
back then Unix machines tended to be more reliable, because they were
run by professionals in machine rooms, very often with UPS's.  Also,
(b) people weren't loading craptastic video drivers with buggy
proprietary kernel modules; they may have used proprietary drivers,
but kernels weren't changing all the time, and there was a lot more
careful testing of drivers before they were unloosed onto the world.

Finally (c), ext3 had as an accident of how it provided protection
against old file blocks showing up newly allocated files (something
which BSD 4.3 did __not__ protect against, by the way), had the
behaviour that renaming over a file __usually__ (but not always)
provided atomic guarantees.

(c) was especially unfortunate, because it never applied to all Linux
file systems, just to ext3, and because the fact that it did this was
also responsible for disastrous desktop performance when you had the
combination of large number of streaming writes (i.e., bittorrent,
video ripping/copying, etc.) going on in the background combined with
foreground GUI applications that were fsync-happy() --- i.e., firefox.

Lots of users have complained about the desktop performance problem,
but the reality is we can't really solve that without also taking away
the magic that made (c) happen.  Whether you solve it by using
data=writeback and stick with ext3, or switch to ext4, or switch to
XFS, or switch to btrfs --- all of these will solve the desktop
performance problem, but they also leave you vulnerable to file loss
in the case of system crashes and applications that don't use
fsync()/fdatasync().

Hence the fact that all file system developers, whether they were
btrfs developers or XFS developers or ext4 developers, made the joke
at the file system developers summit two years ago, that what the
application programmers really wanted was O_PONY, with the magic pixie
dust.   Unfortunately:

http://www.linuxformat.com/files/nopony.jpg

- Ted


-- 
To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20101129151812.gu2...@thunk.org



Re: Bug#605009: serious performance regression with ext4

2010-11-29 Thread Ted Ts'o
On Mon, Nov 29, 2010 at 09:21:44AM -0600, Jonathan Nieder wrote:
 
 That explanation helps a lot.  Thanks, both.  (Guillem, I like your
 patch very much then.  Most files being unpacked in a dpkg run aren't
 going to be read back again soon.  Perhaps some other kernels will
 also interpret it as a hint to start writeback.)

Most files won't, but consider a postinstall script which needs to
scan/index a documentation file, or simply run one or more binaries
that was just installed.  I can definitely imagine situations where
using POSIX_FADV_DONTNEED could actually hurt performance.  Is it
enough to worry about?  Hard to say; for a very long dpkg run, the
files might end up getting pushed out of memory anyway.  But if you
are only installing one package, and you are doing this on a
particularly slow disk, using POSIX_FADV_DONTNEED could actually hurt
in a measurable way.

If you are only installing a one or a few packages, and/or you can
somehow divine the user's intention that they will very shortly use
the file --- for example, if dpkg is being launched via packagekit to
install some font or codec, then using POSIX_FADV_DONTNEED is probably
the wrong answer.  So at the very least I'd recommend having command
line options to enable/disable use of posix_fadvise().

Regards,

- Ted


-- 
To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20101129153244.ga7...@thunk.org



Re: Bug#605009: serious performance regression with ext4

2010-11-28 Thread Ted Ts'o
I did some experimenting, and I figured out what was going on.  You're
right, (c) doesn't quite work, because delayed allocation meant that
the writeout didn't take place until the fsync() for each file
happened.  I didn't see this at first; my apologies.

However, this *does* work:

extract(a);
sync_file_range(fd.a, 0, 0, SYNC_FILE_RANGE_WRITE); 
extract(b.dpkg-new);
sync_file_range(fd.b, 0, 0, SYNC_FILE_RANGE_WRITE); 
extract(c.dpkg-new);
sync_file_range(fd.c, 0, 0, SYNC_FILE_RANGE_WRITE); 

sync_file_range(fd.a, 0, 0, SYNC_FILE_RANGE_WAIT_BEFORE); 
sync_file_range(fd.b, 0, 0, SYNC_FILE_RANGE_WAIT_BEFORE); 
sync_file_range(fd.c, 0, 0, SYNC_FILE_RANGE_WAIT_BEFORE); 

fdatasync(a);
fdatasync(b.dpkg-new);
fdatasync(c.dpkg-new);

rename(b.dpkg-new, b);
rename(c.dpkg-new, c);

This assumes that files b and c existed beforehand, but a is a new file.

What's going on here?  sync_file_range() is a Linux specific system
call that has been around for a while.  It allows program to control
when writeback happens in a very low-level fashion.  The first set of
sync_file_range() system calls causes the system to start writing back
each file once it has finished being extracted.  It doesn't actually
wait for the write to finish; it just starts the writeback.

The second series of sync_file_range() calls, with the operation
SYNC_FILE_RANGE_WAIT_BEFORE, will block until the previously initiated
writeback has completed.  This basically ensures that the delayed
allocation has been resolved; that is, the data blocks have been
allocated and written, and the inode updated (in memory), but not
necessarily pushed out to disk.

The fdatasync() call will actually force the inode to disk.  In the
case of the ext4 file system, the first fdatasync() will actually push
all of the inodes to disk, and all of the subsequent fdatasync() calls
are in fact no-ops (assuming that files 'a', 'b', and 'c' are all on
the same file system).  But what it means is that it minimizes the
number of (heavyweight) jbd2 commits to a minimum.

It uses a linux-specific system call --- sync_file_range --- but the
result should be faster performance across the board for all file
systems.  So I don't consider this an ext4-specific hack, although it
probably does makes things faster for ext4 more than any other file
system.

I've attached the program I used to test and prove this mechanism, as
well as the kernel tracepoint script I used to debug why (c) wasn't
working, which might be of interest to folks on debian-kernel.
Basically it's a demonstration of how cool ftrace is.  :-)

But using this program on a file system composed of a 5400rpm laptop
drive running LVM and LUKS, I get:

mass-sync-tester -d:dpkg current: time:  0.83/ 0.01/ 0.00

versus

mass-sync-tester -n:dpkg fixed: time:  0.07/ 0.00/ 0.01

   - Ted

/*
 * Mass sync tester
 */

#define _GNU_SOURCE

#include stdio.h
#include unistd.h
#include stdlib.h
#include sys/types.h
#include sys/time.h
#include sys/stat.h
#include fcntl.h
#include sys/resource.h
#include getopt.h
#include errno.h
#include string.h

void write_file(const char *name, int sync, int sync_range)
{
	int	fd, i, ret;
	char	buf[1024];

	fd = open(name, O_WRONLY|O_TRUNC|O_CREAT, 0666);
	if (fd  0) {
		fprintf(stderr, open(%s) in write_file: %s\n,
			name, strerror(errno));
		exit(1);
	}
	memset(buf, 0, sizeof(buf));
	for (i=0; i  16; i++) {
		ret = write(fd, buf, sizeof(buf));
		if (ret  0) {
			fprintf(stderr, writing %s: %s\n,
name, strerror(errno));
			exit(1);
		}
	}
	if (sync) {
		ret = fsync(fd);
		if (ret  0) {
			fprintf(stderr, fsyncing %s in write_file: %s\n,
name, strerror(errno));
			exit(1);
		}
	}
	if (sync_range) {
		ret = sync_file_range(fd, 0, 0, SYNC_FILE_RANGE_WRITE);
		if (ret  0) {
			fprintf(stderr, sync_file_range %s in write_file: %s\n,
name, strerror(errno));
			exit(1);
		}
	}
	ret = close(fd);
	if (ret  0) {
		fprintf(stderr, closing %s in write_file: %s\n,
			name, strerror(errno));
		exit(1);
	}
}

void rename_file(const char *src, const char *dest)
{
	int ret;

	ret = rename(src, dest);
	if (ret) {
		fprintf(stderr, renaming %s to %s: %s\n, src, dest,
			strerror(errno));
		exit(1);
	}
}

void sync_file(const char *name)
{
	int	fd, i, ret;

	fd = open(name, O_RDONLY|O_NOATIME, 0666);
	if (fd  0) {
		fprintf(stderr, open(%s) in sync_file: %s\n,
			name, strerror(errno));
		exit(1);
	}
	ret = fsync(fd);
	if (ret  0) {
		fprintf(stderr, fsyncing %s in sync_file: %s\n,
			name, strerror(errno));
		exit(1);
	}
	ret = close(fd);
	if (ret  0) {
		fprintf(stderr, closing %s in sync_file: %s\n,
			name, strerror(errno));
		exit(1);
	}
}

void datasync_file(const char *name)
{
	int	fd, i, ret;

	fd = open(name, O_RDONLY|O_NOATIME, 0666);
	if (fd  0) {
		fprintf(stderr, open(%s) in datasync_file: %s\n,
			name, strerror(errno));
		exit(1);
	}
	ret = fdatasync(fd);
	if (ret  0) {
		fprintf(stderr, 

Re: Bug#605009: serious performance regression with ext4

2010-11-26 Thread Ted Ts'o
On Fri, Nov 26, 2010 at 03:53:27PM +0100, Raphael Hertzog wrote:
 Just to sum up what dpkg --unpack does in 1.15.8.6:
 1/ set the package status as half-installed/reinst-required
 2/ extract all the new files as *.dpkg-new
 3/ for all the unpacked files: fsync(foo.dpkg-new) followed by
rename(foo.dpkg-new, foo)

What are you doing?

1) Suppose package contains files a, b, and c.  Which are you
doing?

a)  extract a.dpkg-new ; fsync(a.dpkg-new); rename(a.dpkg-new, a);
extract b.dpkg-new ; fsync(b.dpkg-new); rename(b.dpkg-new, b);
extract c.dpkg-new ; fsync(c.dpkg-new); rename(c.dpkg-new, c);

or

b)  extract a.dpkg-new ; fsync(a.dpkg-new);
extract b.dpkg-new ; fsync(b.dpkg-new);
extract c.dpkg-new ; fsync(c.dpkg-new);
rename(a.dpkg-new, a);
rename(b.dpkg-new, b);
rename(c.dpkg-new, c);

or

c)  extract(a.dpkg-new);
extract(b.dpkg-new);
extract(c.dpkg-new);
fsync(a.dpkg-new);
fsync(b.dpkg-new);
fsync(c.dpkg-new);
rename(a.dpkg-new, a);
rename(b.dpkg-new, b);
rename(c.dpkg-new, c);


(c) will perform the best for most file systems, including ext4.  As a
further optimization, if b and c does not exist, of course it
would be better to extract into b and c directly, and skip the
rename, i.e.:

d)  extract(a.dpkg-new);
extract(b); # assuming the file b does not yet exist
extract(c); # assuming the file c does not yet exist
fsync(a.dpkg-new);
fsync(b);
fsync(c);
rename(a.dpkg-new, a);

... and then set the package status as unpacked.

I am guessing you are doing (a) today --- am I right?  (c) or (d)
would be best.

- Ted


-- 
To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20101126215254.gj2...@thunk.org