from:"Theodore Tso"

Re: [PATCH] x86: add the debugfs interface for the sysprof tool

2008-02-24 Thread Theodore Tso

On Sun, Feb 24, 2008 at 04:32:40PM +, John Levon wrote:
> > There are plenty of things that can be done, including using search
> > paths to try to find vmlinuz; or maybe even proposing a new standard
> > such as say for example /lib/modules/`uname -r`/vmlinux being a
> 
> At the time when I was trying to fix this, I wasn't aware of any way to
> propose a new standard and get distributions to follow it - is there
> some way now? Informally I discussed this problem several times with
> many people without any resolution. As regards searching informal paths,
> this is extremely risky - get the wrong vmlinux and we end up with
> inaccurate results, which is worse than no results.

The way that /lib/modules/`uname -r`/build was standardize was via
mail to LKML, years ago.  It was declared so, "make install" for base
kernel Makefiles did that, and the distro's picked it up pretty
quickly thereafter.

In terms of picking the right vmlinux, how about a kernel patch which
stashes the MD5 checksum of vmlinux in a convenient location the
compressed kernel which can be pulled out via querying
/sys/kernel/vmlinux_cksum?  If the problem is making sure you have the
right vmlinux, there are some fairly simple ways of assuring this ---
it's just a matter of thinking creatively.

> > The abdication of responsibility and the lack of trying to solve the
> > usability issues is one of the things that really worries me about
> > *all* of Linux's RAS tools.  We can and should do better!  And it's
> > really embarassing that the RAS maintainers seem (I assume you are one
> > of the oprofile maintainers), seem to be blaming this on the victims,
> > the people who are complaining about using *your* tool.  Yes, it's a

Let me make it clear that the problems go far beyond oprofile.  I have
similar issues of disquietude about the easy of use of SystemTap,
kdump, and all of the other RAS system tools.  It may be the problem
is that the companies who fund the development of the RAS tools are
stopping before they can be made turn-key and easy to use by kernel
developers, as opposed to assuming that the distro's will do all of
the hard work productizing them and actually making them *usuable*.

The problem is that not enough mainline kernel developers use these
tools, mostly because they aren't easy enough for them to use.  I
remember complaining about kdump, and I got the same answer, "Oh, it's
the distro's job to make it easy to use."  Which is fine, except that
means very few people actually use it (how many kernel developers use
RHEL and SLES as their day-to-day development OS, as opposed to Fedora
or Debian, et. al.?), and since there aren't lots of kernel developers
using it, once the people who are funded to support the RAS tools get
reassigned to other projects, what's left is in a terrible shape to be
used by mainline kernel developers, and then the tools effectively
become unused and then unmaintained.

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] x86: add the debugfs interface for the sysprof tool

2008-02-24 Thread Theodore Tso

On Sat, Feb 23, 2008 at 01:53:35PM +, John Levon wrote:
> On Sat, Feb 23, 2008 at 12:37:24PM +0100, Ingo Molnar wrote:
> 
> > It's 200 lines of pretty well isolated code for something that is 
> > already much more usable to me than 10 years of oprofile. Really, i'd 
> > much rather take 200 lines of poor kernel code written by a userspace 
> > developer for stuff that _already works better_, than to have ~2000 
> > lines of oprofile code and an unusable (to me) user-space tool written 
> > by kernel developers.

I think it's fair to say that oth oprofile and sysprof can use some
improvements.  There are a couple of questions that immediately come
to mind, including the most obvious one, *if* as you John clams, the
oprofile kernel had all of the functionality for the GUI, why wasn't
it used --- could it *perhaps* because the kernel interface for
oprofile wasn't documented well?  Heck, even if sysprof is 200 lines
of code versus 2000 lines of kernel code, most people don't write
extra code unless it's because the 2000 lines of pre-existing code
isn't well documented enough.

> Firstly, the distributions should have set this up automatically. That
> they don't is a distributor bug. The sheer madness of Linux not leaving
> a vmlinux file in a stable known location is hardly something oprofile
> can be blamed for.

Wrong Answer.  People who write userspace helpers *have* to do the
work of the distro's.  It's a bad, bad, bad, Bad, BAD idea to leave it
up to the distributions.  It means that some distributions won't get
it right; other distributions will do it in different ways, making it
harder for users to switch between distro's and making it harder for
people to write distribution-neutral HOWTO's.

There are plenty of things that can be done, including using search
paths to try to find vmlinuz; or maybe even proposing a new standard
such as say for example /lib/modules/`uname -r`/vmlinux being a
synlink to the location of vmlinux.  We already have
/lib/modules/`uname -r`/build and /lib/modules/`uname -r`/source, for
example.

The abdication of responsibility and the lack of trying to solve the
usability issues is one of the things that really worries me about
*all* of Linux's RAS tools.  We can and should do better!  And it's
really embarassing that the RAS maintainers seem (I assume you are one
of the oprofile maintainers), seem to be blaming this on the victims,
the people who are complaining about using *your* tool.  Yes, it's a
shame that Ingo didn't try to fix your tool; open source, and scratch
your own itch and all of that.  To be sure.  But at the *same* *time*
don't you have enough pride to take a look at a tools which so
obviously has massive lacks in the usability department, and tried to
fix it years ago?  There's more than enough blame to go around twenty
times over, I would think.

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] x86: add the debugfs interface for the sysprof tool

2008-02-24 Thread Theodore Tso

On Sat, Feb 23, 2008 at 01:53:35PM +, John Levon wrote:
 On Sat, Feb 23, 2008 at 12:37:24PM +0100, Ingo Molnar wrote:
 
  It's 200 lines of pretty well isolated code for something that is 
  already much more usable to me than 10 years of oprofile. Really, i'd 
  much rather take 200 lines of poor kernel code written by a userspace 
  developer for stuff that _already works better_, than to have ~2000 
  lines of oprofile code and an unusable (to me) user-space tool written 
  by kernel developers.

I think it's fair to say that oth oprofile and sysprof can use some
improvements.  There are a couple of questions that immediately come
to mind, including the most obvious one, *if* as you John clams, the
oprofile kernel had all of the functionality for the GUI, why wasn't
it used --- could it *perhaps* because the kernel interface for
oprofile wasn't documented well?  Heck, even if sysprof is 200 lines
of code versus 2000 lines of kernel code, most people don't write
extra code unless it's because the 2000 lines of pre-existing code
isn't well documented enough.

 Firstly, the distributions should have set this up automatically. That
 they don't is a distributor bug. The sheer madness of Linux not leaving
 a vmlinux file in a stable known location is hardly something oprofile
 can be blamed for.

Wrong Answer.  People who write userspace helpers *have* to do the
work of the distro's.  It's a bad, bad, bad, Bad, BAD idea to leave it
up to the distributions.  It means that some distributions won't get
it right; other distributions will do it in different ways, making it
harder for users to switch between distro's and making it harder for
people to write distribution-neutral HOWTO's.

There are plenty of things that can be done, including using search
paths to try to find vmlinuz; or maybe even proposing a new standard
such as say for example /lib/modules/`uname -r`/vmlinux being a
synlink to the location of vmlinux.  We already have
/lib/modules/`uname -r`/build and /lib/modules/`uname -r`/source, for
example.

The abdication of responsibility and the lack of trying to solve the
usability issues is one of the things that really worries me about
*all* of Linux's RAS tools.  We can and should do better!  And it's
really embarassing that the RAS maintainers seem (I assume you are one
of the oprofile maintainers), seem to be blaming this on the victims,
the people who are complaining about using *your* tool.  Yes, it's a
shame that Ingo didn't try to fix your tool; open source, and scratch
your own itch and all of that.  To be sure.  But at the *same* *time*
don't you have enough pride to take a look at a tools which so
obviously has massive lacks in the usability department, and tried to
fix it years ago?  There's more than enough blame to go around twenty
times over, I would think.

- Ted
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] x86: add the debugfs interface for the sysprof tool

2008-02-24 Thread Theodore Tso

On Sun, Feb 24, 2008 at 04:32:40PM +, John Levon wrote:
  There are plenty of things that can be done, including using search
  paths to try to find vmlinuz; or maybe even proposing a new standard
  such as say for example /lib/modules/`uname -r`/vmlinux being a
 
 At the time when I was trying to fix this, I wasn't aware of any way to
 propose a new standard and get distributions to follow it - is there
 some way now? Informally I discussed this problem several times with
 many people without any resolution. As regards searching informal paths,
 this is extremely risky - get the wrong vmlinux and we end up with
 inaccurate results, which is worse than no results.

The way that /lib/modules/`uname -r`/build was standardize was via
mail to LKML, years ago.  It was declared so, make install for base
kernel Makefiles did that, and the distro's picked it up pretty
quickly thereafter.

In terms of picking the right vmlinux, how about a kernel patch which
stashes the MD5 checksum of vmlinux in a convenient location the
compressed kernel which can be pulled out via querying
/sys/kernel/vmlinux_cksum?  If the problem is making sure you have the
right vmlinux, there are some fairly simple ways of assuring this ---
it's just a matter of thinking creatively.

  The abdication of responsibility and the lack of trying to solve the
  usability issues is one of the things that really worries me about
  *all* of Linux's RAS tools.  We can and should do better!  And it's
  really embarassing that the RAS maintainers seem (I assume you are one
  of the oprofile maintainers), seem to be blaming this on the victims,
  the people who are complaining about using *your* tool.  Yes, it's a

Let me make it clear that the problems go far beyond oprofile.  I have
similar issues of disquietude about the easy of use of SystemTap,
kdump, and all of the other RAS system tools.  It may be the problem
is that the companies who fund the development of the RAS tools are
stopping before they can be made turn-key and easy to use by kernel
developers, as opposed to assuming that the distro's will do all of
the hard work productizing them and actually making them *usuable*.

The problem is that not enough mainline kernel developers use these
tools, mostly because they aren't easy enough for them to use.  I
remember complaining about kdump, and I got the same answer, Oh, it's
the distro's job to make it easy to use.  Which is fine, except that
means very few people actually use it (how many kernel developers use
RHEL and SLES as their day-to-day development OS, as opposed to Fedora
or Debian, et. al.?), and since there aren't lots of kernel developers
using it, once the people who are funded to support the RAS tools get
reassigned to other projects, what's left is in a terrible shape to be
used by mainline kernel developers, and then the tools effectively
become unused and then unmaintained.

- Ted
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Announce: Linux-next (Or Andrew's dream :-))

2008-02-21 Thread Theodore Tso

On Wed, Feb 20, 2008 at 07:13:16PM +0200, Adrian Bunk wrote:
> > A third option would be if people add new functions (with no users) in
> > -rc2 or -rc3 timeframes as long as it is part of a fully reviewed
> > patch with users that will use those new features in various kernel
> > development trees.
> >...
> 
> I don't like suggestions based on unrealistic assumptions like
> "a fully reviewed patch".
> 
> E.g. userspace ABI's are much more stable and everyone is aware that 
> they must be gotten right with the first try since they are then cast in 
> stone - but we all remember the recent timerfd fiasco.

I'm talking about kernel interfaces, not userspace API's.  And we can
change them if they are wrong, since they *are* kernel interfaces; but
if they correct, they ease the cross-tree merge pain.

 - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Announce: Linux-next (Or Andrew's dream :-))

2008-02-21 Thread Theodore Tso

On Wed, Feb 20, 2008 at 07:13:16PM +0200, Adrian Bunk wrote:
  A third option would be if people add new functions (with no users) in
  -rc2 or -rc3 timeframes as long as it is part of a fully reviewed
  patch with users that will use those new features in various kernel
  development trees.
 ...
 
 I don't like suggestions based on unrealistic assumptions like
 a fully reviewed patch.
 
 E.g. userspace ABI's are much more stable and everyone is aware that 
 they must be gotten right with the first try since they are then cast in 
 stone - but we all remember the recent timerfd fiasco.

I'm talking about kernel interfaces, not userspace API's.  And we can
change them if they are wrong, since they *are* kernel interfaces; but
if they correct, they ease the cross-tree merge pain.

 - Ted
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Announce: Linux-next (Or Andrew's dream :-))

2008-02-20 Thread Theodore Tso

On Wed, Feb 20, 2008 at 04:38:52PM +0100, Stefan Richter wrote:
> Two things may largely eliminate the need for parallel branches.
> 
> 1. Do infrastructure changes and whole tree wide refactoring etc. in a
> compatible manner with a brief but nonzero transition period.
> 
> 2. Insert a second merge window right after the usual merge window for
> changes which cannot be well done with a transition period.

A third option would be if people add new functions (with no users) in
-rc2 or -rc3 timeframes as long as it is part of a fully reviewed
patch with users that will use those new features in various kernel
development trees.

Since there wouldn't be any users in Linus's tree, there's no risk in
making those functions available in mainline ahead of time.

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Announce: Linux-next (Or Andrew's dream :-))

2008-02-20 Thread Theodore Tso

On Wed, Feb 20, 2008 at 04:38:52PM +0100, Stefan Richter wrote:
 Two things may largely eliminate the need for parallel branches.
 
 1. Do infrastructure changes and whole tree wide refactoring etc. in a
 compatible manner with a brief but nonzero transition period.
 
 2. Insert a second merge window right after the usual merge window for
 changes which cannot be well done with a transition period.

A third option would be if people add new functions (with no users) in
-rc2 or -rc3 timeframes as long as it is part of a fully reviewed
patch with users that will use those new features in various kernel
development trees.

Since there wouldn't be any users in Linus's tree, there's no risk in
making those functions available in mainline ahead of time.

- Ted
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: very poor ext3 write performance on big filesystems?

2008-02-18 Thread Theodore Tso

On Mon, Feb 18, 2008 at 05:16:55PM +0100, Tomasz Chmielewski wrote:
> Theodore Tso schrieb:
>
>> I'd really need to know exactly what kind of operations you were
>> trying to do that were causing problems before I could say for sure.
>> Yes, you said you were removing unneeded files, but how were you doing
>> it?  With rm -r of old hard-linked directories?
>
> Yes, with rm -r.

You should definitely try the spd_readdir hack; that will help reduce
the seek times.  This will probably help on any block group oriented
filesystems, including XFS, etc.

>> How big are the
>> average files involved?  Etc.
>
> It's hard to estimate the average size of a file. I'd say there are not 
> many files bigger than 50 MB.

Well, Ext4 will help for files bigger than 48k.

The other thing that might help for you is using an external journal
on a separate hard drive (either for ext3 or ext4).  That will help
alleviate some of the seek storms going on, since the journal is
written to only sequentially, and putting it on a separate hard drive
will help remove some of the contention on the hard drive.  

I assume that your 1.2 TB filesystem is located on a RAID array; did
you use the mke2fs -E stride option to make sure all of the bitmaps
don't get concentrated on one hard drive spindle?  One of the failure
modes which can happen is if you use a 4+1 raid 5 setup, that all of
the block and inode bitmaps can end up getting laid out on a single
hard drive, so it becomes a bottleneck for bitmap intensive workloads
--- including "rm -rf".  So that's another thing that might be going
on.  If you do a "dumpe2fs", and look at the block numbers for the
block and inode allocation bitmaps, and you find that they are are all
landing on the same physical hard drive, then that's very clearly the
biggest problem given an "rm -rf" workload.  You should be able to see
this as well visually; if one hard drive has its hard drive light
almost constantly on, and the other ones don't have much activity,
that's probably what is happening.

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: very poor ext3 write performance on big filesystems?

2008-02-18 Thread Theodore Tso

On Mon, Feb 18, 2008 at 04:57:25PM +0100, Andi Kleen wrote:
> > Use cp
> > or a tar pipeline to move the files.
> 
> Are you sure cp handles hardlinks correctly? I know tar does,
> but I have my doubts about cp.

I *think* GNU cp does the right thing with --preserve=links.  I'm not
100% sure, though --- like you, probably, I always use tar for moving
or copying directory hierarchies.

   - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: very poor ext3 write performance on big filesystems?

2008-02-18 Thread Theodore Tso

On Mon, Feb 18, 2008 at 04:02:36PM +0100, Tomasz Chmielewski wrote:
> I tried to copy that filesystem once (when it was much smaller) with "rsync 
> -a -H", but after 3 days, rsync was still building an index and didn't copy 
> any file.

If you're going to copy the whole filesystem don't use rsync!  Use cp
or a tar pipeline to move the files.

> Also, as files/hardlinks come and go, it would degrade again.

Yes...

> Are there better choices than ext3 for a filesystem with lots of hardlinks? 
> ext4, once it's ready? xfs?

All filesystems are going to have problems keeping inodes close to
directories when you have huge numbers of hard links.

I'd really need to know exactly what kind of operations you were
trying to do that were causing problems before I could say for sure.
Yes, you said you were removing unneeded files, but how were you doing
it?  With rm -r of old hard-linked directories?  How big are the
average files involved?  Etc.

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [2.6 patch] fs/jbd/journal.c: cleanups

2008-02-18 Thread Theodore Tso

On Mon, Feb 18, 2008 at 02:31:40PM +0100, Ingo Molnar wrote:
> i guess this explains what static code metrics already suggest:

Am I right in assuming that code-quality is just a program which runs
checkpatch.pl and measures the number of warnings and calls them
errors?

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: very poor ext3 write performance on big filesystems?

2008-02-18 Thread Theodore Tso

On Mon, Feb 18, 2008 at 04:18:23PM +0100, Andi Kleen wrote:
> On Mon, Feb 18, 2008 at 09:16:41AM -0500, Theodore Tso wrote:
> > ext3 tries to keep inodes in the same block group as their containing
> > directory.  If you have lots of hard links, obviously it can't really
> > do that, especially since we don't have a good way at mkdir time to
> > tell the filesystem, "Psst!  This is going to be a hard link clone of
> > that directory over there, put it in the same block group".
> 
> Hmm, you think such a hint interface would be worth it?

It would definitely help ext2/3/4.  An interesting question is whether
it would help enough other filesystems that's worth adding.  

> > necessarily removing the dir_index feature.  Dir_index speeds up
> > individual lookups, but it slows down workloads that do a readdir
> 
> But only for large directories right? For kernel source like
> directory sizes it seems to be a general loss.

On my todo list is a hack which does the sorting of directory inodes
by inode number inside the kernel for smallish directories (say, less
than 2-3 blocks) where using the kernel memory space to store the
directory entries is acceptable, and which would speed up dir_index
performance for kernel source-like directory sizes --- without needing
to use the spd_readdir LD_PRELOAD hack.

But yes, right now, if you know that your directories are almost
always going to be kernel source like in size, then omitting dir_index
is probably goint to be a good idea.  

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: very poor ext3 write performance on big filesystems?

2008-02-18 Thread Theodore Tso

On Mon, Feb 18, 2008 at 03:03:44PM +0100, Andi Kleen wrote:
> Tomasz Chmielewski <[EMAIL PROTECTED]> writes:
> >
> > Is it normal to expect the write speed go down to only few dozens of
> > kilobytes/s? Is it because of that many seeks? Can it be somehow
> > optimized? 
> 
> I have similar problems on my linux source partition which also
> has a lot of hard linked files (although probably not quite
> as many as you do). It seems like hard linking prevents
> some of the heuristics ext* uses to generate non fragmented
> disk layouts and the resulting seeking makes things slow.

ext3 tries to keep inodes in the same block group as their containing
directory.  If you have lots of hard links, obviously it can't really
do that, especially since we don't have a good way at mkdir time to
tell the filesystem, "Psst!  This is going to be a hard link clone of
that directory over there, put it in the same block group".

> What has helped a bit was to recreate the file system with -O^dir_index
> dir_index seems to cause more seeks.

Part of it may have simply been recreating the filesystem, not
necessarily removing the dir_index feature.  Dir_index speeds up
individual lookups, but it slows down workloads that do a readdir
followed by a stat of all of the files in the workload.  You can work
around this by calling readdir(), sorting all of the entries by inode
number, and then calling open or stat or whatever.  So this can help
out for workloads that are doing find or rm -r on a dir_index
workload.  Basically, it helps for some things, hurts for others.
Once things are in the cache it doesn't matter of course.

The following ld_preload can help in some cases.  Mutt has this hack
encoded in for maildir directories, which helps.

> Also keeping enough free space is also a good idea because that
> allows the file system code better choices on where to place data.

Yep, that too.

- Ted

/*
 * readdir accelerator
 *
 * (C) Copyright 2003, 2004 by Theodore Ts'o.
 *
 * Compile using the command:
 *
 * gcc -o spd_readdir.so -shared spd_readdir.c -ldl
 *
 * Use it by setting the LD_PRELOAD environment variable:
 * 
 * export LD_PRELOAD=/usr/local/sbin/spd_readdir.so
 *
 * %Begin-Header%
 * This file may be redistributed under the terms of the GNU Public
 * License.
 * %End-Header%
 * 
 */

#define ALLOC_STEPSIZE	100
#define MAX_DIRSIZE	0

#define DEBUG

#ifdef DEBUG
#define DEBUG_DIR(x)	{if (do_debug) { x; }}
#else
#define DEBUG_DIR(x)
#endif

#define _GNU_SOURCE
#define __USE_LARGEFILE64

#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

struct dirent_s {
	unsigned long long d_ino;
	long long d_off;
	unsigned short int d_reclen;
	unsigned char d_type;
	char *d_name;
};

struct dir_s {
	DIR	*dir;
	int	num;
	int	max;
	struct dirent_s *dp;
	int	pos;
	int	fd;
	struct dirent ret_dir;
	struct dirent64 ret_dir64;
};

static int (*real_closedir)(DIR *dir) = 0;
static DIR *(*real_opendir)(const char *name) = 0;
static struct dirent *(*real_readdir)(DIR *dir) = 0;
static struct dirent64 *(*real_readdir64)(DIR *dir) = 0;
static off_t (*real_telldir)(DIR *dir) = 0;
static void (*real_seekdir)(DIR *dir, off_t offset) = 0;
static int (*real_dirfd)(DIR *dir) = 0;
static unsigned long max_dirsize = MAX_DIRSIZE;
static num_open = 0;
#ifdef DEBUG
static int do_debug = 0;
#endif

static void setup_ptr()
{
	char *cp;

	real_opendir = dlsym(RTLD_NEXT, "opendir");
	real_closedir = dlsym(RTLD_NEXT, "closedir");
	real_readdir = dlsym(RTLD_NEXT, "readdir");
	real_readdir64 = dlsym(RTLD_NEXT, "readdir64");
	real_telldir = dlsym(RTLD_NEXT, "telldir");
	real_seekdir = dlsym(RTLD_NEXT, "seekdir");
	real_dirfd = dlsym(RTLD_NEXT, "dirfd");
	if ((cp = getenv("SPD_READDIR_MAX_SIZE")) != NULL) {
		max_dirsize = atol(cp);
	}
#ifdef DEBUG
	if (getenv("SPD_READDIR_DEBUG"))
		do_debug++;
#endif
}

static void free_cached_dir(struct dir_s *dirstruct)
{
	int i;

	if (!dirstruct->dp)
		return;

	for (i=0; i < dirstruct->num; i++) {
		free(dirstruct->dp[i].d_name);
	}
	free(dirstruct->dp);
	dirstruct->dp = 0;
}	

static int ino_cmp(const void *a, const void *b)
{
	const struct dirent_s *ds_a = (const struct dirent_s *) a;
	const struct dirent_s *ds_b = (const struct dirent_s *) b;
	ino_t i_a, i_b;
	
	i_a = ds_a->d_ino;
	i_b = ds_b->d_ino;

	if (ds_a->d_name[0] == '.') {
		if (ds_a->d_name[1] == 0)
			i_a = 0;
		else if ((ds_a->d_name[1] == '.') && (ds_a->d_name[2] == 0))
			i_a = 1;
	}
	if (ds_b->d_name[0] == '.') {
		if (ds_b->d_name[1] == 0)
			i_b = 0;
		else if ((ds_b->d_name[1] == '.') && (ds_b->d_name[2] == 0))
			i_b = 1;
	}

	return (i_a - i_b);
}


DIR *opendir(const char *name)
{
	DIR *dir;
	struct dir_s	*dirstruct;
	struct dirent_s *ds, *dnew;
	struct dirent64 *d;
	struct stat st;

	if (!real_opendir)
		setup_ptr();

	DEBUG_DIR(printf("Opendir(%s) (%d open)\n", name, num_open++));
	dir = (*real_opendir)(name);
	if (!dir)
		return NULL;

	dirstruct = malloc(sizeof(struct

Re: [2.6 patch] fs/jbd/journal.c: cleanups

2008-02-18 Thread Theodore Tso

> So please deal with it like most other subsystem maintainers do and stop 
> complaining about "code churn" - nobody but you changes the ext3 
> codebase, it's one of the codebases least affected by general kernel 
> flux, it's an ultimate "leaf" subsystem.

Right, sorry.  I misread the filename; I thought this was against
fs/jbd2, instead of fs/jbd.

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [2.6 patch] fs/jbd/journal.c: cleanups

2008-02-18 Thread Theodore Tso

On Mon, Feb 18, 2008 at 03:12:09PM +0200, Adrian Bunk wrote:
> If me resending this old patch collides with something finally getting a 
> user this part of my patch shouldn't be applied now (but you might get 
> it again in 6 months if it's still unused...).
> 
> But generally such conflicts would become visible if "known development 
> trees that are intended for mainline" were in -mm.

It *has* been in -mm, except for periods when akpm has dropped it due
to conflicts due to the "must have an in-tree user" doctrinaire
attitude due to a conflict with the r/o bind patch.

Did you actually try to do a compile test, or only made sure the patch
would apply?  The patch won't collide at application time, but it
would when you compile it

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [2.6 patch] fs/jbd/journal.c: cleanups

2008-02-18 Thread Theodore Tso

On Mon, Feb 18, 2008 at 08:12:29AM +0100, Ingo Molnar wrote:
> > Nack.  I don't object to un-exporting journal_update_superblock(), 
> > because that is pretty internal, but the other functions are intended 
> > specifically for use by code outside of JBD.  For example, the journal 
> > checksum patch for ext3/4 uses journal_set_features() to turn on 
> > features in the JBD superblock.
> > 
> > Similarly, for 64-bit support in ext4 uses journal_set_features() to 
> > set a 64-bit feature flag in the journal superblock.
> 
> that's an invalid excuse for the benefit of out-of-tree forks: reality 
> is that you can export those functions in the "journal checksum patch" 
> just fine. So you cannot 'nack' a sensible patch on that ground and no 
> maintainer does it on that ground. Once you get your stuff upstream, you 
> can re-add the export.

I'm going to NACK it as well.  This kind of code churn where we make
symbols static only to make them non-static again in an existing ext4
tree is exactly the sort of needless code churn that makes patches
start to conflict and where we need different patches depending on
whether it is intended for -mm or linux-next or mainline.

I think we really have gotten WAY to doctrinaire on the if there are
no in-tree users, it MUST be static.  This is exactly the sort of
mindless rules that cause the patch conflicts that have been causing
us so much pain and grief.  In this case, it is an existing symbol
which is already non-static, and for which we have code in a
development tree that will be using it.  In the r/o bind case, it is
the insistence that you can't push an existing patch to expose a new
interface that must be used later in the r/o bind patchset and which
sweeps across all trees changing stuff that causes pain and grief.

In both cases, if we expand "in-tree" development users to include
known development trees that are intended for mainline, it makes all
of our lives MUCH easier.  

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [2.6 patch] fs/jbd/journal.c: cleanups

2008-02-18 Thread Theodore Tso

On Mon, Feb 18, 2008 at 08:12:29AM +0100, Ingo Molnar wrote:
  Nack.  I don't object to un-exporting journal_update_superblock(), 
  because that is pretty internal, but the other functions are intended 
  specifically for use by code outside of JBD.  For example, the journal 
  checksum patch for ext3/4 uses journal_set_features() to turn on 
  features in the JBD superblock.
  
  Similarly, for 64-bit support in ext4 uses journal_set_features() to 
  set a 64-bit feature flag in the journal superblock.
 
 that's an invalid excuse for the benefit of out-of-tree forks: reality 
 is that you can export those functions in the journal checksum patch 
 just fine. So you cannot 'nack' a sensible patch on that ground and no 
 maintainer does it on that ground. Once you get your stuff upstream, you 
 can re-add the export.

I'm going to NACK it as well.  This kind of code churn where we make
symbols static only to make them non-static again in an existing ext4
tree is exactly the sort of needless code churn that makes patches
start to conflict and where we need different patches depending on
whether it is intended for -mm or linux-next or mainline.

I think we really have gotten WAY to doctrinaire on the if there are
no in-tree users, it MUST be static.  This is exactly the sort of
mindless rules that cause the patch conflicts that have been causing
us so much pain and grief.  In this case, it is an existing symbol
which is already non-static, and for which we have code in a
development tree that will be using it.  In the r/o bind case, it is
the insistence that you can't push an existing patch to expose a new
interface that must be used later in the r/o bind patchset and which
sweeps across all trees changing stuff that causes pain and grief.

In both cases, if we expand in-tree development users to include
known development trees that are intended for mainline, it makes all
of our lives MUCH easier.  

- Ted
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [2.6 patch] fs/jbd/journal.c: cleanups

2008-02-18 Thread Theodore Tso

On Mon, Feb 18, 2008 at 03:12:09PM +0200, Adrian Bunk wrote:
 If me resending this old patch collides with something finally getting a 
 user this part of my patch shouldn't be applied now (but you might get 
 it again in 6 months if it's still unused...).
 
 But generally such conflicts would become visible if known development 
 trees that are intended for mainline were in -mm.

It *has* been in -mm, except for periods when akpm has dropped it due
to conflicts due to the must have an in-tree user doctrinaire
attitude due to a conflict with the r/o bind patch.

Did you actually try to do a compile test, or only made sure the patch
would apply?  The patch won't collide at application time, but it
would when you compile it

- Ted
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: very poor ext3 write performance on big filesystems?

2008-02-18 Thread Theodore Tso

On Mon, Feb 18, 2008 at 03:03:44PM +0100, Andi Kleen wrote:
 Tomasz Chmielewski [EMAIL PROTECTED] writes:
 
  Is it normal to expect the write speed go down to only few dozens of
  kilobytes/s? Is it because of that many seeks? Can it be somehow
  optimized? 
 
 I have similar problems on my linux source partition which also
 has a lot of hard linked files (although probably not quite
 as many as you do). It seems like hard linking prevents
 some of the heuristics ext* uses to generate non fragmented
 disk layouts and the resulting seeking makes things slow.

ext3 tries to keep inodes in the same block group as their containing
directory.  If you have lots of hard links, obviously it can't really
do that, especially since we don't have a good way at mkdir time to
tell the filesystem, Psst!  This is going to be a hard link clone of
that directory over there, put it in the same block group.

 What has helped a bit was to recreate the file system with -O^dir_index
 dir_index seems to cause more seeks.

Part of it may have simply been recreating the filesystem, not
necessarily removing the dir_index feature.  Dir_index speeds up
individual lookups, but it slows down workloads that do a readdir
followed by a stat of all of the files in the workload.  You can work
around this by calling readdir(), sorting all of the entries by inode
number, and then calling open or stat or whatever.  So this can help
out for workloads that are doing find or rm -r on a dir_index
workload.  Basically, it helps for some things, hurts for others.
Once things are in the cache it doesn't matter of course.

The following ld_preload can help in some cases.  Mutt has this hack
encoded in for maildir directories, which helps.

 Also keeping enough free space is also a good idea because that
 allows the file system code better choices on where to place data.

Yep, that too.

- Ted

/*
 * readdir accelerator
 *
 * (C) Copyright 2003, 2004 by Theodore Ts'o.
 *
 * Compile using the command:
 *
 * gcc -o spd_readdir.so -shared spd_readdir.c -ldl
 *
 * Use it by setting the LD_PRELOAD environment variable:
 * 
 * export LD_PRELOAD=/usr/local/sbin/spd_readdir.so
 *
 * %Begin-Header%
 * This file may be redistributed under the terms of the GNU Public
 * License.
 * %End-Header%
 * 
 */

#define ALLOC_STEPSIZE	100
#define MAX_DIRSIZE	0

#define DEBUG

#ifdef DEBUG
#define DEBUG_DIR(x)	{if (do_debug) { x; }}
#else
#define DEBUG_DIR(x)
#endif

#define _GNU_SOURCE
#define __USE_LARGEFILE64

#include stdio.h
#include unistd.h
#include sys/types.h
#include sys/stat.h
#include stdlib.h
#include string.h
#include dirent.h
#include errno.h
#include dlfcn.h

struct dirent_s {
	unsigned long long d_ino;
	long long d_off;
	unsigned short int d_reclen;
	unsigned char d_type;
	char *d_name;
};

struct dir_s {
	DIR	*dir;
	int	num;
	int	max;
	struct dirent_s *dp;
	int	pos;
	int	fd;
	struct dirent ret_dir;
	struct dirent64 ret_dir64;
};

static int (*real_closedir)(DIR *dir) = 0;
static DIR *(*real_opendir)(const char *name) = 0;
static struct dirent *(*real_readdir)(DIR *dir) = 0;
static struct dirent64 *(*real_readdir64)(DIR *dir) = 0;
static off_t (*real_telldir)(DIR *dir) = 0;
static void (*real_seekdir)(DIR *dir, off_t offset) = 0;
static int (*real_dirfd)(DIR *dir) = 0;
static unsigned long max_dirsize = MAX_DIRSIZE;
static num_open = 0;
#ifdef DEBUG
static int do_debug = 0;
#endif

static void setup_ptr()
{
	char *cp;

	real_opendir = dlsym(RTLD_NEXT, opendir);
	real_closedir = dlsym(RTLD_NEXT, closedir);
	real_readdir = dlsym(RTLD_NEXT, readdir);
	real_readdir64 = dlsym(RTLD_NEXT, readdir64);
	real_telldir = dlsym(RTLD_NEXT, telldir);
	real_seekdir = dlsym(RTLD_NEXT, seekdir);
	real_dirfd = dlsym(RTLD_NEXT, dirfd);
	if ((cp = getenv(SPD_READDIR_MAX_SIZE)) != NULL) {
		max_dirsize = atol(cp);
	}
#ifdef DEBUG
	if (getenv(SPD_READDIR_DEBUG))
		do_debug++;
#endif
}

static void free_cached_dir(struct dir_s *dirstruct)
{
	int i;

	if (!dirstruct-dp)
		return;

	for (i=0; i  dirstruct-num; i++) {
		free(dirstruct-dp[i].d_name);
	}
	free(dirstruct-dp);
	dirstruct-dp = 0;
}	

static int ino_cmp(const void *a, const void *b)
{
	const struct dirent_s *ds_a = (const struct dirent_s *) a;
	const struct dirent_s *ds_b = (const struct dirent_s *) b;
	ino_t i_a, i_b;
	
	i_a = ds_a-d_ino;
	i_b = ds_b-d_ino;

	if (ds_a-d_name[0] == '.') {
		if (ds_a-d_name[1] == 0)
			i_a = 0;
		else if ((ds_a-d_name[1] == '.')  (ds_a-d_name[2] == 0))
			i_a = 1;
	}
	if (ds_b-d_name[0] == '.') {
		if (ds_b-d_name[1] == 0)
			i_b = 0;
		else if ((ds_b-d_name[1] == '.')  (ds_b-d_name[2] == 0))
			i_b = 1;
	}

	return (i_a - i_b);
}


DIR *opendir(const char *name)
{
	DIR *dir;
	struct dir_s	*dirstruct;
	struct dirent_s *ds, *dnew;
	struct dirent64 *d;
	struct stat st;

	if (!real_opendir)
		setup_ptr();

	DEBUG_DIR(printf(Opendir(%s) (%d open)\n, name, num_open++));
	dir = (*real_opendir)(name);
	if (!dir)
		return NULL;

	dirstruct =

Re: [2.6 patch] fs/jbd/journal.c: cleanups

2008-02-18 Thread Theodore Tso

 So please deal with it like most other subsystem maintainers do and stop 
 complaining about code churn - nobody but you changes the ext3 
 codebase, it's one of the codebases least affected by general kernel 
 flux, it's an ultimate leaf subsystem.

Right, sorry.  I misread the filename; I thought this was against
fs/jbd2, instead of fs/jbd.

- Ted
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: very poor ext3 write performance on big filesystems?

2008-02-18 Thread Theodore Tso

On Mon, Feb 18, 2008 at 04:18:23PM +0100, Andi Kleen wrote:
 On Mon, Feb 18, 2008 at 09:16:41AM -0500, Theodore Tso wrote:
  ext3 tries to keep inodes in the same block group as their containing
  directory.  If you have lots of hard links, obviously it can't really
  do that, especially since we don't have a good way at mkdir time to
  tell the filesystem, Psst!  This is going to be a hard link clone of
  that directory over there, put it in the same block group.
 
 Hmm, you think such a hint interface would be worth it?

It would definitely help ext2/3/4.  An interesting question is whether
it would help enough other filesystems that's worth adding.  

  necessarily removing the dir_index feature.  Dir_index speeds up
  individual lookups, but it slows down workloads that do a readdir
 
 But only for large directories right? For kernel source like
 directory sizes it seems to be a general loss.

On my todo list is a hack which does the sorting of directory inodes
by inode number inside the kernel for smallish directories (say, less
than 2-3 blocks) where using the kernel memory space to store the
directory entries is acceptable, and which would speed up dir_index
performance for kernel source-like directory sizes --- without needing
to use the spd_readdir LD_PRELOAD hack.

But yes, right now, if you know that your directories are almost
always going to be kernel source like in size, then omitting dir_index
is probably goint to be a good idea.  

- Ted
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: very poor ext3 write performance on big filesystems?

2008-02-18 Thread Theodore Tso

On Mon, Feb 18, 2008 at 04:02:36PM +0100, Tomasz Chmielewski wrote:
 I tried to copy that filesystem once (when it was much smaller) with rsync 
 -a -H, but after 3 days, rsync was still building an index and didn't copy 
 any file.

If you're going to copy the whole filesystem don't use rsync!  Use cp
or a tar pipeline to move the files.

 Also, as files/hardlinks come and go, it would degrade again.

Yes...

 Are there better choices than ext3 for a filesystem with lots of hardlinks? 
 ext4, once it's ready? xfs?

All filesystems are going to have problems keeping inodes close to
directories when you have huge numbers of hard links.

I'd really need to know exactly what kind of operations you were
trying to do that were causing problems before I could say for sure.
Yes, you said you were removing unneeded files, but how were you doing
it?  With rm -r of old hard-linked directories?  How big are the
average files involved?  Etc.

- Ted
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [2.6 patch] fs/jbd/journal.c: cleanups

2008-02-18 Thread Theodore Tso

On Mon, Feb 18, 2008 at 02:31:40PM +0100, Ingo Molnar wrote:
 i guess this explains what static code metrics already suggest:

Am I right in assuming that code-quality is just a program which runs
checkpatch.pl and measures the number of warnings and calls them
errors?

- Ted
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: very poor ext3 write performance on big filesystems?

2008-02-18 Thread Theodore Tso

On Mon, Feb 18, 2008 at 04:57:25PM +0100, Andi Kleen wrote:
  Use cp
  or a tar pipeline to move the files.
 
 Are you sure cp handles hardlinks correctly? I know tar does,
 but I have my doubts about cp.

I *think* GNU cp does the right thing with --preserve=links.  I'm not
100% sure, though --- like you, probably, I always use tar for moving
or copying directory hierarchies.

   - Ted
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: very poor ext3 write performance on big filesystems?

2008-02-18 Thread Theodore Tso

On Mon, Feb 18, 2008 at 05:16:55PM +0100, Tomasz Chmielewski wrote:
 Theodore Tso schrieb:

 I'd really need to know exactly what kind of operations you were
 trying to do that were causing problems before I could say for sure.
 Yes, you said you were removing unneeded files, but how were you doing
 it?  With rm -r of old hard-linked directories?

 Yes, with rm -r.

You should definitely try the spd_readdir hack; that will help reduce
the seek times.  This will probably help on any block group oriented
filesystems, including XFS, etc.

 How big are the
 average files involved?  Etc.

 It's hard to estimate the average size of a file. I'd say there are not 
 many files bigger than 50 MB.

Well, Ext4 will help for files bigger than 48k.

The other thing that might help for you is using an external journal
on a separate hard drive (either for ext3 or ext4).  That will help
alleviate some of the seek storms going on, since the journal is
written to only sequentially, and putting it on a separate hard drive
will help remove some of the contention on the hard drive.  

I assume that your 1.2 TB filesystem is located on a RAID array; did
you use the mke2fs -E stride option to make sure all of the bitmaps
don't get concentrated on one hard drive spindle?  One of the failure
modes which can happen is if you use a 4+1 raid 5 setup, that all of
the block and inode bitmaps can end up getting laid out on a single
hard drive, so it becomes a bottleneck for bitmap intensive workloads
--- including rm -rf.  So that's another thing that might be going
on.  If you do a dumpe2fs, and look at the block numbers for the
block and inode allocation bitmaps, and you find that they are are all
landing on the same physical hard drive, then that's very clearly the
biggest problem given an rm -rf workload.  You should be able to see
this as well visually; if one hard drive has its hard drive light
almost constantly on, and the other ones don't have much activity,
that's probably what is happening.

- Ted
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/7] fs/ext{2,3,4}: Use BUG_ON

2008-02-17 Thread Theodore Tso

On Sun, Feb 17, 2008 at 06:55:06PM +0100, Julia Lawall wrote:
> From: Julia Lawall <[EMAIL PROTECTED]>
> 
> if (...) BUG(); should be replaced with BUG_ON(...) when the test has no
> side-effects to allow a definition of BUG_ON that drops the code completely.

Hi, in the future, please separate ext4 changes from ext2/3.  Thanks!!

   - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/7] fs/ext{2,3,4}: Use BUG_ON

2008-02-17 Thread Theodore Tso

On Sun, Feb 17, 2008 at 06:55:06PM +0100, Julia Lawall wrote:
 From: Julia Lawall [EMAIL PROTECTED]
 
 if (...) BUG(); should be replaced with BUG_ON(...) when the test has no
 side-effects to allow a definition of BUG_ON that drops the code completely.

Hi, in the future, please separate ext4 changes from ext2/3.  Thanks!!

   - Ted
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 07/30] r/o bind mounts: stub functions

2008-02-15 Thread Theodore Tso

On Fri, Feb 15, 2008 at 04:49:39PM -0800, Dave Hansen wrote:
> On Fri, 2008-02-15 at 19:32 -0500, Theodore Tso wrote:
> > On Fri, Feb 15, 2008 at 02:37:30PM -0800, Dave Hansen wrote:
> > > 
> > > This patch adds two function mnt_want_write() and mnt_drop_write().
> > > These are used like a lock pair around and fs operations that might
> > > cause a write to the filesystem.
> > 
> > Argh, is there some reason why this couldn't have gotten merged in
> > -rc1, ahead of the rest of the patch series?  This one is going to
> > cause more cross-tree merge pain with any filesystem tree that have
> > changes to fs/*/ioctl.c.
> 
> I wasn't meaning for this to hit the 2.6.25-rc series.  We had some
> review comments just when the merge window opened, and I was expecting
> them to get stuck back in -mm for another round.

Yeah, but it means that I need one set of patches for -mm, and another
set of patches for Linus's mainline.  I notice that your patchset is
currently missing changes for fs/ext4/ioctl.c --- I think because you
dropped them when Mingming picked them up, and then I dropped them
when I was trying to prepare the set of patches to push to Linus.

No problem, I'm sure I can ressurect them, but it's still the same
basic problem that when there are patchsets such as yours which touch
multiple trees in -mm, there are almost inevitably patch conflicts.

It would be nice if an initial patch which introduces the new
functionality you need for r/o bind mounts could get introduced into
mainline *first*, and then people could add patches that call
mnt_want_write(), et. al into their trees gradually.

As it is, I can't see a way around this other than maintaining two
separate patch sets, one that works with r/o bind mounts, and one for
mainline, since otherwise akpm gets grumpy and starts dropping either
your patchset or the ext4 patchset because *he* has to manually fix up
the patch conflicts.  (So instead I have to deal with it by hand, and
then *I* get grumpy.  :-/)

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 07/30] r/o bind mounts: stub functions

2008-02-15 Thread Theodore Tso

On Fri, Feb 15, 2008 at 02:37:30PM -0800, Dave Hansen wrote:
> 
> This patch adds two function mnt_want_write() and mnt_drop_write().
> These are used like a lock pair around and fs operations that might
> cause a write to the filesystem.

Argh, is there some reason why this couldn't have gotten merged in
-rc1, ahead of the rest of the patch series?  This one is going to
cause more cross-tree merge pain with any filesystem tree that have
changes to fs/*/ioctl.c.

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 07/30] r/o bind mounts: stub functions

2008-02-15 Thread Theodore Tso

On Fri, Feb 15, 2008 at 02:37:30PM -0800, Dave Hansen wrote:
 
 This patch adds two function mnt_want_write() and mnt_drop_write().
 These are used like a lock pair around and fs operations that might
 cause a write to the filesystem.

Argh, is there some reason why this couldn't have gotten merged in
-rc1, ahead of the rest of the patch series?  This one is going to
cause more cross-tree merge pain with any filesystem tree that have
changes to fs/*/ioctl.c.

- Ted
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 07/30] r/o bind mounts: stub functions

2008-02-15 Thread Theodore Tso

On Fri, Feb 15, 2008 at 04:49:39PM -0800, Dave Hansen wrote:
 On Fri, 2008-02-15 at 19:32 -0500, Theodore Tso wrote:
  On Fri, Feb 15, 2008 at 02:37:30PM -0800, Dave Hansen wrote:
   
   This patch adds two function mnt_want_write() and mnt_drop_write().
   These are used like a lock pair around and fs operations that might
   cause a write to the filesystem.
  
  Argh, is there some reason why this couldn't have gotten merged in
  -rc1, ahead of the rest of the patch series?  This one is going to
  cause more cross-tree merge pain with any filesystem tree that have
  changes to fs/*/ioctl.c.
 
 I wasn't meaning for this to hit the 2.6.25-rc series.  We had some
 review comments just when the merge window opened, and I was expecting
 them to get stuck back in -mm for another round.

Yeah, but it means that I need one set of patches for -mm, and another
set of patches for Linus's mainline.  I notice that your patchset is
currently missing changes for fs/ext4/ioctl.c --- I think because you
dropped them when Mingming picked them up, and then I dropped them
when I was trying to prepare the set of patches to push to Linus.

No problem, I'm sure I can ressurect them, but it's still the same
basic problem that when there are patchsets such as yours which touch
multiple trees in -mm, there are almost inevitably patch conflicts.

It would be nice if an initial patch which introduces the new
functionality you need for r/o bind mounts could get introduced into
mainline *first*, and then people could add patches that call
mnt_want_write(), et. al into their trees gradually.

As it is, I can't see a way around this other than maintaining two
separate patch sets, one that works with r/o bind mounts, and one for
mainline, since otherwise akpm gets grumpy and starts dropping either
your patchset or the ext4 patchset because *he* has to manually fix up
the patch conflicts.  (So instead I have to deal with it by hand, and
then *I* get grumpy.  :-/)

- Ted
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Announce: Linux-next (Or Andrew's dream :-))

2008-02-13 Thread Theodore Tso

On Tue, Feb 12, 2008 at 10:16:53PM -0800, Greg KH wrote:
> I was amazed at how slow stgit was when I tried it out.  I use
> git-quiltimport a lot and I don't think it's any slower than just using
> quilt on its own.  So I think that the speed issue should be the same.

I like using "guilt" because I can easily reapply the patchset using
"guilt push -a", which is just slightly fewer characters to type than
"git-quiltimport".  This also means that I don't need to switch back
and forth between "git mode" and "quilt mode" when I'm editing the
patches (either directly by editing the patch files, in which case
afterwards I do a "guilt pop -a; guilt push -a", or by using "guilt
pop", "guilt push", and "guilt refresh").

"guilt push -a" is a little bit slower than "quilt push -a", but not
enough to be seriously annoying.  And besides, "guilt pop -a" is
slightly faster than "quilt pop -a".

Using guilt is also nice because there is a bit of additional backup
for previous work via the git reflogs, although to be honest I've
rarely needed to use it.

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Announce: Linux-next (Or Andrew's dream :-))

2008-02-13 Thread Theodore Tso

On Tue, Feb 12, 2008 at 10:16:53PM -0800, Greg KH wrote:
 I was amazed at how slow stgit was when I tried it out.  I use
 git-quiltimport a lot and I don't think it's any slower than just using
 quilt on its own.  So I think that the speed issue should be the same.

I like using guilt because I can easily reapply the patchset using
guilt push -a, which is just slightly fewer characters to type than
git-quiltimport.  This also means that I don't need to switch back
and forth between git mode and quilt mode when I'm editing the
patches (either directly by editing the patch files, in which case
afterwards I do a guilt pop -a; guilt push -a, or by using guilt
pop, guilt push, and guilt refresh).

guilt push -a is a little bit slower than quilt push -a, but not
enough to be seriously annoying.  And besides, guilt pop -a is
slightly faster than quilt pop -a.

Using guilt is also nice because there is a bit of additional backup
for previous work via the git reflogs, although to be honest I've
rarely needed to use it.

- Ted
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Announce: Linux-next (Or Andrew's dream :-))

2008-02-12 Thread Theodore Tso

On Tue, Feb 12, 2008 at 04:49:46PM -0800, Linus Torvalds wrote:
> On Tue, 12 Feb 2008, Greg KH wrote:
> > 
> > Perhaps you need to switch to using quilt.  This is the main reason why
> > I use it.
> 
> Btw, on that note: if some quilt user can send an "annotated history file" 
> of their quilt usage, it's something that git really can do, and I'll see 
> if I can merge (or rather, coax Junio to merge) the relevant part of stgit 
> to make it possible to just basically get "quilt behaviour" for the parts 
> of a git tree that you haven't pushed out yet.

So this is what I do for ext4 development.  We maintain a quilt series
in git, which is located here at: http://repo.or.cz/w/ext4-patch-queue.git

A number of ext4 developers have write access to commit into that
tree, and we coordinate amongst ourselves and on
[EMAIL PROTECTED]  I tend to suck it into git using the
"guilt" package, and do periodic merge testing with a number of git
queues to detect potential merge conflicts.  Not as many as James
does, but I may start doing more of that once I steal his scripts.  :-)

The patch queue also gets automatic testing on a number different
platforms; for that reason the series files comments which version of
the kernel it was last based off of, so the ABAT system can know what
version of the kernel to use as the base of the quilt series.

I do a fair amount of QA, including copy editing and in some cases
rewriting the patch descriptions (which are often pretty vile, due to
a number of the ext4 developers not being native English speakers; not
their fault, but more than once I've had no idea what the patch
description is trying to say until I read through the patch very
closely, which is also good for me to do from a code QA point of view  :-).

Periodically, the patch queue gets pushed into the ext4.git tree and
as a patch series on ftp.kernel.org.

I've never been very happy with stgit because of past experiences
which has scarred me when it got get confused and lost my entire patch
series (this was before git reflogs, so recovery was interesting).
There's always been something deeply comforting about having the ASCII
patch series since it's easy to back it up and know you're in no
danger of losing everything in case of a bug.  Also, having the patch
series stored in ASCII as a quilt stack means that we can store the
quilt stack itself in git, and with repo.or.cz it allows us to have
multiple write access to the shared quilt stack, while still giving us
the off-line access advantages of git.  (Yes, I've spent plane rides
rewriting patch descriptions.  :-)

The other advantage of storing the patch stack as a an ASCII quilt
series is we have a history of changes of the patches, which we don't
necessarily have if you just use stgit to rewrite the patch.  So we
have the best of both worlds; what gets checked into Linus's tree is a
clean patch series, but we keep some history of different versions of
a patch over time in the ext4-patch-queue git repository.  (I wish we
had better changelog comments there too, but I'll take what I can
get.)

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: BTRFS partition usage...

2008-02-12 Thread Theodore Tso

On Tue, Feb 12, 2008 at 03:28:26PM -0800, David Miller wrote:
> From: Jan Engelhardt <[EMAIL PROTECTED]>
> Date: Tue, 12 Feb 2008 15:00:20 +0100 (CET)
> 
> > Something looks wrong here. Why would btrfs need to zero at all?
> 
> So that existing superblocks on the partition won't
> be interpreted as correct by other filesystems.  It's
> a safety measure many mkfs programs use.
> 
> > Superblock at 0, and done. Just like xfs.
> 
> No, we won't do stupid things like that and make an entire
> cylinder of our disks unusable.  See my other reply.

The reason why we don't put the superblock at 0 is not because it
screws over the sparc, but because on many systems (including x86) the
bootsector is stored at 0.  It's not hard for mke2fs to zap the boot
sector which we do on all architectures *except* sparc, to avoid
nuking the disk label.  (Chris just missed the "#ifndef __sparc //
#define ZAP_BOOTBLOCK // #endif" at the beginning of mke2fs.c)

This is the best of all words; it makes sparc happy; it allows boot
loaders to put the x86 standard initial stage 0 boot loader in the
first 446 bytes of the disk; and by zapping sector 0 on all
architectures except the sparc, it solves the previous filesystem
"ghost traces" detection problem for filesystems like xfs that put the
superblock at 0.

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Announce: Linux-next (Or Andrew's dream :-))

2008-02-12 Thread Theodore Tso

On Tue, Feb 12, 2008 at 12:48:13PM -0800, Greg KH wrote:
> On Tue, Feb 12, 2008 at 11:55:45AM -0800, Linus Torvalds wrote:
> > > 
> > > Not it isn't.  To quote you a number of years ago:
> > >   "Linux is evolution, not intelligent design"

I think this statement has been used unfortunately as a hard and fast
rule (which we all know how much Linus loves :-) to mean, in its most
extreme form, that we should *never* try to do some careful reflection
about careful API design, and that the extremes of "no interface
without an in-tree user" applies to (a) parameters in a function call
(heck, we can always sweep through all the in-tree users to add that
extra parameter later, and thats a *good* thing because it breaks
those evil out-of-tree drivers) and (b) to not even thinking if some
particular interface (that is not needed now but which reasonably will
be needed later) is even *possible* without doing a sweep of all of
the in-tree users of the interface.

Related to this syndrome is the assumption that measuring the rate of
changes in lines of code changed per second implies that any
development process which causes the number of lines of code changed
second, including frequent sweeps through the tree changing all
interfaces, is a *good* thing.

Yes, this is an extreme position, and I'm not accusing anyone of
holding the above in its entirety --- but I've seen aspects of all of
these from one developer or another.

We come to it from the attacking another strawman, which assumes that
*all* interfaces which don't have an in-tree are evil, and that
keeping old __deprecated interfaces for a long time is an evil which
causes intolerable pain, and that it's never worthwhile to try to
anticipate future expandibility into an interface because you will
inevitably get it wrong.

Clearly, we are right to mock Solaris for making claims that they will
never, ever, *ever* change an interface, not even one that goes back
sixteen years to Solaris 2.3.  But it doesn't follow the opposite
point of view, that constant mutability of kernel interfaces to make
sure that things are always perfect and pure and clean is the right
one either.

> > The examples are legion. The mammalian eye has the retina "backwards", 
> > with the blind spot appearing because the fundmanetal infrastructure (the 
> > optical nerves) actually being in *front* of the light sensor and needing 
> > a hole in the retina to get the information (and blood flow) to go to the 
> > brain!

Also, evolution also means that things like vestigal organs (like our
appendix) are tolerated.  So are things like clearly very badly
designed things, like human backs.  To the extent that we don't like
vestigal old __deprecated interfaces, and want things to be perfect,
we are actually straying into the realms where we want the sort of
things that you would get if you *did* have an intelligent designer
designing the human body from scratch.

So the "Linux is evolution, not intelligent design" quote is
unfortunately getting used to imply that no amount of intelligent
foresight is worthwhile, and I think that's unfortunate.  It implies
an extreme position which is not warranted.

> > > But they do happen about once or twice a kernel release, just by virtue
> > > of the way things need to happen.
> > 
> > And I violently disagree.
> > 
> > It should not be "once of twice a kernel release".
> > 
> > It should be "once or twice a year" that you hit a flag-day issue. The 
> > rest of the time you should be able to do it without breakage. It's 
> > doable. You just HAVEN'T EVEN TRIED, and seem to be actively against even 
> > doing so.
> 
> No, not at all.
> 
> I have tried, and successfully done this many times in the past.  The
> kobject change was one example: add a new function, migrate all users of
> a direct pointer over to that function, after that work is all done and
> in, change the structure and do the needed work afterward.  All is
> bisectable completly, with no big "flag day" needed.

Collectively, we need to try harder.

We can debate exactly where the right line is, in terms of whether
it's only "once or twice a kernel release", or "once or twice a year",
but clearly the current amount of interface changes and cross-tree
dependencies has been causing Andrew pain.  And to me, that means we
need to turn the knob back a quarter turn towards tolerating
__deprecated old interfaces a little bit more, and trying to get
interfaces right just a little bit more and try building in just a
little bit more future expandability, and to try just *little* bit
harder to preserve a *little* bit more stable API.

In other words, maybe we need to write a counterpoint to the
stable_api_nonsense.txt and call it unstable_api_nonsense.txt --- and
in it, we note that if we start burning out Andrew and he starts
getting really, REALLY grumpy --- and if especially we start making
Stephen (normally a very mild-mannered and not terribly excitable guy)
grumpy, that it's time that we try just a

Re: Announce: Linux-next (Or Andrew's dream :-))

2008-02-12 Thread Theodore Tso

On Mon, Feb 11, 2008 at 11:06:17PM -0800, Arjan van de Ven wrote:
> There is maybe a middle ground in this -next idea; as very first
> part of the series, the new api gets added, current users converted
> and api marked __deprecated.
> 
> Then there's a second part to the patch, which is a separate tree,
> which gets added at the very end, which removed the old api.
> 
> Both will go in at the same merge window, and the next-meister needs
> to track that no new users show up... but the final tree allows this
> to be done somewhat more gentle.
> 
> Doesn't work for API changes that just change the API rather than
> extending it, and doesn't solve the dependency issues. So I still
> think a cleansweep works best in general, but I suspect Andrew just
> disagrees with that.

Yes, that's exactly what I was suggesting.  The __deprecate only lasts
for the merge window, and we remove the old API at the end of the
merge window.  So it's only there for a very short time, and it's only
there to make the cleen sweep a little less painful --- not one where
"shit hangs around in the tree forever".

- Ted


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Announce: Linux-next (Or Andrew's dream :-))

2008-02-12 Thread Theodore Tso

On Mon, Feb 11, 2008 at 11:06:17PM -0800, Arjan van de Ven wrote:
 There is maybe a middle ground in this -next idea; as very first
 part of the series, the new api gets added, current users converted
 and api marked __deprecated.
 
 Then there's a second part to the patch, which is a separate tree,
 which gets added at the very end, which removed the old api.
 
 Both will go in at the same merge window, and the next-meister needs
 to track that no new users show up... but the final tree allows this
 to be done somewhat more gentle.
 
 Doesn't work for API changes that just change the API rather than
 extending it, and doesn't solve the dependency issues. So I still
 think a cleansweep works best in general, but I suspect Andrew just
 disagrees with that.

Yes, that's exactly what I was suggesting.  The __deprecate only lasts
for the merge window, and we remove the old API at the end of the
merge window.  So it's only there for a very short time, and it's only
there to make the cleen sweep a little less painful --- not one where
shit hangs around in the tree forever.

- Ted


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Announce: Linux-next (Or Andrew's dream :-))

2008-02-12 Thread Theodore Tso

On Tue, Feb 12, 2008 at 12:48:13PM -0800, Greg KH wrote:
 On Tue, Feb 12, 2008 at 11:55:45AM -0800, Linus Torvalds wrote:
   
   Not it isn't.  To quote you a number of years ago:
 Linux is evolution, not intelligent design

I think this statement has been used unfortunately as a hard and fast
rule (which we all know how much Linus loves :-) to mean, in its most
extreme form, that we should *never* try to do some careful reflection
about careful API design, and that the extremes of no interface
without an in-tree user applies to (a) parameters in a function call
(heck, we can always sweep through all the in-tree users to add that
extra parameter later, and thats a *good* thing because it breaks
those evil out-of-tree drivers) and (b) to not even thinking if some
particular interface (that is not needed now but which reasonably will
be needed later) is even *possible* without doing a sweep of all of
the in-tree users of the interface.

Related to this syndrome is the assumption that measuring the rate of
changes in lines of code changed per second implies that any
development process which causes the number of lines of code changed
second, including frequent sweeps through the tree changing all
interfaces, is a *good* thing.

Yes, this is an extreme position, and I'm not accusing anyone of
holding the above in its entirety --- but I've seen aspects of all of
these from one developer or another.

We come to it from the attacking another strawman, which assumes that
*all* interfaces which don't have an in-tree are evil, and that
keeping old __deprecated interfaces for a long time is an evil which
causes intolerable pain, and that it's never worthwhile to try to
anticipate future expandibility into an interface because you will
inevitably get it wrong.

Clearly, we are right to mock Solaris for making claims that they will
never, ever, *ever* change an interface, not even one that goes back
sixteen years to Solaris 2.3.  But it doesn't follow the opposite
point of view, that constant mutability of kernel interfaces to make
sure that things are always perfect and pure and clean is the right
one either.

  The examples are legion. The mammalian eye has the retina backwards, 
  with the blind spot appearing because the fundmanetal infrastructure (the 
  optical nerves) actually being in *front* of the light sensor and needing 
  a hole in the retina to get the information (and blood flow) to go to the 
  brain!

Also, evolution also means that things like vestigal organs (like our
appendix) are tolerated.  So are things like clearly very badly
designed things, like human backs.  To the extent that we don't like
vestigal old __deprecated interfaces, and want things to be perfect,
we are actually straying into the realms where we want the sort of
things that you would get if you *did* have an intelligent designer
designing the human body from scratch.

So the Linux is evolution, not intelligent design quote is
unfortunately getting used to imply that no amount of intelligent
foresight is worthwhile, and I think that's unfortunate.  It implies
an extreme position which is not warranted.

   But they do happen about once or twice a kernel release, just by virtue
   of the way things need to happen.
  
  And I violently disagree.
  
  It should not be once of twice a kernel release.
  
  It should be once or twice a year that you hit a flag-day issue. The 
  rest of the time you should be able to do it without breakage. It's 
  doable. You just HAVEN'T EVEN TRIED, and seem to be actively against even 
  doing so.
 
 No, not at all.
 
 I have tried, and successfully done this many times in the past.  The
 kobject change was one example: add a new function, migrate all users of
 a direct pointer over to that function, after that work is all done and
 in, change the structure and do the needed work afterward.  All is
 bisectable completly, with no big flag day needed.

Collectively, we need to try harder.

We can debate exactly where the right line is, in terms of whether
it's only once or twice a kernel release, or once or twice a year,
but clearly the current amount of interface changes and cross-tree
dependencies has been causing Andrew pain.  And to me, that means we
need to turn the knob back a quarter turn towards tolerating
__deprecated old interfaces a little bit more, and trying to get
interfaces right just a little bit more and try building in just a
little bit more future expandability, and to try just *little* bit
harder to preserve a *little* bit more stable API.

In other words, maybe we need to write a counterpoint to the
stable_api_nonsense.txt and call it unstable_api_nonsense.txt --- and
in it, we note that if we start burning out Andrew and he starts
getting really, REALLY grumpy --- and if especially we start making
Stephen (normally a very mild-mannered and not terribly excitable guy)
grumpy, that it's time that we try just a little bit harder to make
our API's a little bit more stable.

Re: BTRFS partition usage...

2008-02-12 Thread Theodore Tso

On Tue, Feb 12, 2008 at 03:28:26PM -0800, David Miller wrote:
 From: Jan Engelhardt [EMAIL PROTECTED]
 Date: Tue, 12 Feb 2008 15:00:20 +0100 (CET)

  Something looks wrong here. Why would btrfs need to zero at all?

 So that existing superblocks on the partition won't
 be interpreted as correct by other filesystems.  It's
 a safety measure many mkfs programs use.

  Superblock at 0, and done. Just like xfs.

 No, we won't do stupid things like that and make an entire
 cylinder of our disks unusable.  See my other reply.

The reason why we don't put the superblock at 0 is not because it
screws over the sparc, but because on many systems (including x86) the
bootsector is stored at 0.  It's not hard for mke2fs to zap the boot
sector which we do on all architectures *except* sparc, to avoid
nuking the disk label.  (Chris just missed the #ifndef __sparc //
#define ZAP_BOOTBLOCK // #endif at the beginning of mke2fs.c)

This is the best of all words; it makes sparc happy; it allows boot
loaders to put the x86 standard initial stage 0 boot loader in the
first 446 bytes of the disk; and by zapping sector 0 on all
architectures except the sparc, it solves the previous filesystem
ghost traces detection problem for filesystems like xfs that put the
superblock at 0.

- Ted
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Announce: Linux-next (Or Andrew's dream :-))

2008-02-12 Thread Theodore Tso

On Tue, Feb 12, 2008 at 04:49:46PM -0800, Linus Torvalds wrote:
 On Tue, 12 Feb 2008, Greg KH wrote:
  
  Perhaps you need to switch to using quilt.  This is the main reason why
  I use it.
 
 Btw, on that note: if some quilt user can send an annotated history file 
 of their quilt usage, it's something that git really can do, and I'll see 
 if I can merge (or rather, coax Junio to merge) the relevant part of stgit 
 to make it possible to just basically get quilt behaviour for the parts 
 of a git tree that you haven't pushed out yet.

So this is what I do for ext4 development.  We maintain a quilt series
in git, which is located here at: http://repo.or.cz/w/ext4-patch-queue.git

A number of ext4 developers have write access to commit into that
tree, and we coordinate amongst ourselves and on
[EMAIL PROTECTED]  I tend to suck it into git using the
guilt package, and do periodic merge testing with a number of git
queues to detect potential merge conflicts.  Not as many as James
does, but I may start doing more of that once I steal his scripts.  :-)

The patch queue also gets automatic testing on a number different
platforms; for that reason the series files comments which version of
the kernel it was last based off of, so the ABAT system can know what
version of the kernel to use as the base of the quilt series.

I do a fair amount of QA, including copy editing and in some cases
rewriting the patch descriptions (which are often pretty vile, due to
a number of the ext4 developers not being native English speakers; not
their fault, but more than once I've had no idea what the patch
description is trying to say until I read through the patch very
closely, which is also good for me to do from a code QA point of view  :-).

Periodically, the patch queue gets pushed into the ext4.git tree and
as a patch series on ftp.kernel.org.

I've never been very happy with stgit because of past experiences
which has scarred me when it got get confused and lost my entire patch
series (this was before git reflogs, so recovery was interesting).
There's always been something deeply comforting about having the ASCII
patch series since it's easy to back it up and know you're in no
danger of losing everything in case of a bug.  Also, having the patch
series stored in ASCII as a quilt stack means that we can store the
quilt stack itself in git, and with repo.or.cz it allows us to have
multiple write access to the shared quilt stack, while still giving us
the off-line access advantages of git.  (Yes, I've spent plane rides
rewriting patch descriptions.  :-)

The other advantage of storing the patch stack as a an ASCII quilt
series is we have a history of changes of the patches, which we don't
necessarily have if you just use stgit to rewrite the patch.  So we
have the best of both worlds; what gets checked into Linus's tree is a
clean patch series, but we keep some history of different versions of
a patch over time in the ext4-patch-queue git repository.  (I wish we
had better changelog comments there too, but I'll take what I can
get.)

- Ted
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Announce: Linux-next (Or Andrew's dream :-))

2008-02-11 Thread Theodore Tso

On Mon, Feb 11, 2008 at 11:45:55PM -0500, Trond Myklebust wrote:
> It would be very nice to have a separate tree with _only_ API changes
> that could be frozen well before Linus' merge window opens. It should be
> a requirement that maintainers use this tree as a basis for testing API
> changes and even test that their own changesets were properly integrated
> with the changed APIs.

The other way that might work in some circumstances would be if we
tried a little harder to avoid API changes that don't involve an
interface naming change.  That is, instead of adding a new parameter
to a function, and then having to sweep through all of the trees to
catch all of the users of siad function, we could instead add a new a
new interface, __deprecate the old one, and then give enough time for
trees to adapt, you can avoid needing to do flag day transitions.  If
the old interface is __deprecated at the beginning of the merge
window, and then disappears at the very end of the merge window,
that's plenty of time for the subsystem maintainers to move to the new
interface.

This doesn't always work, of course (for example, if we make a
fundamental change in how some critical low-level data structure is
locked).  But every little bit that we can do to avoid the tree
integration pain would be a win.

Regards,

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Announce: Linux-next (Or Andrew's dream :-))

2008-02-11 Thread Theodore Tso

On Mon, Feb 11, 2008 at 11:45:55PM -0500, Trond Myklebust wrote:
 It would be very nice to have a separate tree with _only_ API changes
 that could be frozen well before Linus' merge window opens. It should be
 a requirement that maintainers use this tree as a basis for testing API
 changes and even test that their own changesets were properly integrated
 with the changed APIs.

The other way that might work in some circumstances would be if we
tried a little harder to avoid API changes that don't involve an
interface naming change.  That is, instead of adding a new parameter
to a function, and then having to sweep through all of the trees to
catch all of the users of siad function, we could instead add a new a
new interface, __deprecate the old one, and then give enough time for
trees to adapt, you can avoid needing to do flag day transitions.  If
the old interface is __deprecated at the beginning of the merge
window, and then disappears at the very end of the merge window,
that's plenty of time for the subsystem maintainers to move to the new
interface.

This doesn't always work, of course (for example, if we make a
fundamental change in how some critical low-level data structure is
locked).  But every little bit that we can do to avoid the tree
integration pain would be a win.

Regards,

- Ted
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ndiswrapper and GPL-only symbols redux

2008-02-06 Thread Theodore Tso

On Wed, Feb 06, 2008 at 03:38:52AM -0800, David Schwartz wrote:
> > 
> > Ndiswrapper loads and executes code with not GPLv2 compatible licences 
> > in a way in the kernel that might be considered similar to a GPLv2'ed 
> > userspace program dlopen() a dynamic library file with a not GPLv2 
> > compatible licence.
> > 
> > IANAL, but I do think there might be real copyright issues with 
> > ndiswrapper.
> 
> Neither the kernel+ndiswrapper nor the non-free driver were
> developed with knowledge of the other, so there is simply no way one
> could be a derivative work of the other. Since no creative effort is
> required to link them together, and the linked result is not fixed
> in a permanent medium, a derivative work cannot be created by the
> linking process itself.

Indeed, there is a similar issue with libss, which was originally
written for use with Kerberos v5, and licensed under an MIT (BSD-style
plus you must not use MIT's name in advertising) license.  Kerberos V5
was adapted by Sun to create a propietary product called SEAM (Sun
Enterprise Authentication Mechanism), and contains a program called
kadmin, which uses libss as part of its user interface.

In the meantime, libss was enhanced to use a search path to dlopen the
first readline library it can find (some are GPL, some are
BSD-licensed), so that people could use debugfs while being able to
have command-line editing, and this is shipping in e2fsprogs.  I used
dlopen so that use of libreadline is optional; so if it doesn't fit on
a rescue floppy, it's no big deal; you can still use debugfs to edit
an ext2/3/4 filesystem.  So there was very much a valid technical
reason for doing what I did; I wasn't trying to circumvent any license
requirements, but trying to solve a perfect valid problem when you
only have 1440k on a 3.5" floppy (and libreadline is 296k, or 21% of
total amount of space available).

But if you compile and install e2fsprogs on Solaris, and then run
kadmin, you can have in one address space the proprietary kadmin
binary from SEAM, the BSD-licensed libss shared library from
e2fsprogs, and the GPL-licensed libreadline shared library.

Answer quickly!  Is there a license violation, and if so, who was
responsible for comitting the license violation?  This is my favorate
real-life case study that I roll out when I want to torture people who
claim that dynamic linking with a GPL shared library automatically
results a GPL violation.  :-)

The bottom line is that you should ask a lawyer, and not believe
anyone who has claimed to give you legal advice, whether or not they
have talked to "dozens of lawyers".  What's most important is the
lawyer with whom you have paid money so he can take the facts specific
to your case, and apply them to the relevant legal statues in those
legal jurisdictions applicable for the software/product in question.

Regards,

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: T61P sound issue

2008-02-06 Thread Theodore Tso

On Wed, Feb 06, 2008 at 09:11:27AM +0100, Jiri Kosina wrote:
> On Tue, 5 Feb 2008, Theodore Tso wrote:
> 
> I have also seen sound working flawlessly on another X61s. Maybe they 
> changed some chipset revisions on the fly, or whatever.
> 
> What does lspci -v show for your soundcard please?

Attached please find my lspci -v and dmidecode information.

- Ted

00:00.0 Host bridge: Intel Corporation Mobile PM965/GM965/GL960 Memory 
Controller Hub (rev 0c)
Subsystem: Lenovo Unknown device 20b3
Flags: bus master, fast devsel, latency 0
Capabilities: 

00:02.0 VGA compatible controller: Intel Corporation Mobile GM965/GL960 
Integrated Graphics Controller (rev 0c) (prog-if 00 [VGA])
Subsystem: Lenovo Unknown device 20b5
Flags: bus master, fast devsel, latency 0, IRQ 16
Memory at f810 (64-bit, non-prefetchable) [size=1M]
Memory at e000 (64-bit, prefetchable) [size=256M]
I/O ports at 1800 [size=8]
Capabilities: 

00:02.1 Display controller: Intel Corporation Mobile GM965/GL960 Integrated 
Graphics Controller (rev 0c)
Subsystem: Lenovo Unknown device 20b5
Flags: bus master, fast devsel, latency 0
Memory at f820 (64-bit, non-prefetchable) [size=1M]
Capabilities: 

00:19.0 Ethernet controller: Intel Corporation 82566MM Gigabit Network 
Connection (rev 03)
Subsystem: Lenovo Unknown device 20de
Flags: bus master, fast devsel, latency 0, IRQ 508
Memory at fe00 (32-bit, non-prefetchable) [size=128K]
Memory at fe225000 (32-bit, non-prefetchable) [size=4K]
I/O ports at 1840 [size=32]
Capabilities: 

00:1a.0 USB Controller: Intel Corporation 82801H (ICH8 Family) USB UHCI 
Contoller #4 (rev 03) (prog-if 00 [UHCI])
Subsystem: Lenovo Lenovo Thinkpad T61
Flags: bus master, medium devsel, latency 0, IRQ 20
I/O ports at 1860 [size=32]

00:1a.1 USB Controller: Intel Corporation 82801H (ICH8 Family) USB UHCI 
Controller #5 (rev 03) (prog-if 00 [UHCI])
Subsystem: Lenovo Lenovo Thinkpad T60
Flags: bus master, medium devsel, latency 0, IRQ 21
I/O ports at 1880 [size=32]

00:1a.7 USB Controller: Intel Corporation 82801H (ICH8 Family) USB2 EHCI 
Controller #2 (rev 03) (prog-if 20 [EHCI])
Subsystem: Lenovo Lenovo Thinkpad T61
Flags: bus master, medium devsel, latency 0, IRQ 22
Memory at fe226c00 (32-bit, non-prefetchable) [size=1K]
Capabilities: 

00:1b.0 Audio device: Intel Corporation 82801H (ICH8 Family) HD Audio 
Controller (rev 03)
Subsystem: Lenovo Lenovo Thinkpad T61
Flags: bus master, fast devsel, latency 0, IRQ 17
Memory at fe22 (64-bit, non-prefetchable) [size=16K]
Capabilities: 

00:1c.0 PCI bridge: Intel Corporation 82801H (ICH8 Family) PCI Express Port 1 
(rev 03) (prog-if 00 [Normal decode])
Flags: bus master, fast devsel, latency 0
Bus: primary=00, secondary=02, subordinate=02, sec-latency=0
I/O behind bridge: 2000-2fff
Memory behind bridge: dc10-dfcf
Prefetchable memory behind bridge: dfe0-dfef
Capabilities: 

00:1c.1 PCI bridge: Intel Corporation 82801H (ICH8 Family) PCI Express Port 2 
(rev 03) (prog-if 00 [Normal decode])
Flags: bus master, fast devsel, latency 0
Bus: primary=00, secondary=03, subordinate=03, sec-latency=0
I/O behind bridge: 3000-3fff
Memory behind bridge: fc00-fdff
Prefetchable memory behind bridge: f800-f80f
Capabilities: 

00:1d.0 USB Controller: Intel Corporation 82801H (ICH8 Family) USB UHCI 
Controller #1 (rev 03) (prog-if 00 [UHCI])
Subsystem: Lenovo Unknown device 20aa
Flags: bus master, medium devsel, latency 0, IRQ 16
I/O ports at 18a0 [size=32]

00:1d.1 USB Controller: Intel Corporation 82801H (ICH8 Family) USB UHCI 
Controller #2 (rev 03) (prog-if 00 [UHCI])
Subsystem: Lenovo Unknown device 20aa
Flags: bus master, medium devsel, latency 0, IRQ 17
I/O ports at 18c0 [size=32]

00:1d.7 USB Controller: Intel Corporation 82801H (ICH8 Family) USB2 EHCI 
Controller #1 (rev 03) (prog-if 20 [EHCI])
Subsystem: Lenovo Unknown device 20ab
Flags: bus master, medium devsel, latency 0, IRQ 19
Memory at fe227000 (32-bit, non-prefetchable) [size=1K]
Capabilities: 

00:1e.0 PCI bridge: Intel Corporation 82801 Mobile PCI Bridge (rev f3) (prog-if 
01 [Subtractive decode])
Flags: bus master, fast devsel, latency 0
Bus: primary=00, secondary=05, subordinate=08, sec-latency=32
I/O behind bridge: 4000-7fff
Memory behind bridge: f830-fbff
Prefetchable memory behind bridge: f400-f7ff
Capabilities: 

00:1f.0 ISA bridge: Intel Corporati

Re: T61P sound issue

2008-02-06 Thread Theodore Tso

On Wed, Feb 06, 2008 at 09:11:27AM +0100, Jiri Kosina wrote:
 On Tue, 5 Feb 2008, Theodore Tso wrote:
 
 I have also seen sound working flawlessly on another X61s. Maybe they 
 changed some chipset revisions on the fly, or whatever.
 
 What does lspci -v show for your soundcard please?

Attached please find my lspci -v and dmidecode information.

- Ted

00:00.0 Host bridge: Intel Corporation Mobile PM965/GM965/GL960 Memory 
Controller Hub (rev 0c)
Subsystem: Lenovo Unknown device 20b3
Flags: bus master, fast devsel, latency 0
Capabilities: access denied

00:02.0 VGA compatible controller: Intel Corporation Mobile GM965/GL960 
Integrated Graphics Controller (rev 0c) (prog-if 00 [VGA])
Subsystem: Lenovo Unknown device 20b5
Flags: bus master, fast devsel, latency 0, IRQ 16
Memory at f810 (64-bit, non-prefetchable) [size=1M]
Memory at e000 (64-bit, prefetchable) [size=256M]
I/O ports at 1800 [size=8]
Capabilities: access denied

00:02.1 Display controller: Intel Corporation Mobile GM965/GL960 Integrated 
Graphics Controller (rev 0c)
Subsystem: Lenovo Unknown device 20b5
Flags: bus master, fast devsel, latency 0
Memory at f820 (64-bit, non-prefetchable) [size=1M]
Capabilities: access denied

00:19.0 Ethernet controller: Intel Corporation 82566MM Gigabit Network 
Connection (rev 03)
Subsystem: Lenovo Unknown device 20de
Flags: bus master, fast devsel, latency 0, IRQ 508
Memory at fe00 (32-bit, non-prefetchable) [size=128K]
Memory at fe225000 (32-bit, non-prefetchable) [size=4K]
I/O ports at 1840 [size=32]
Capabilities: access denied

00:1a.0 USB Controller: Intel Corporation 82801H (ICH8 Family) USB UHCI 
Contoller #4 (rev 03) (prog-if 00 [UHCI])
Subsystem: Lenovo Lenovo Thinkpad T61
Flags: bus master, medium devsel, latency 0, IRQ 20
I/O ports at 1860 [size=32]

00:1a.1 USB Controller: Intel Corporation 82801H (ICH8 Family) USB UHCI 
Controller #5 (rev 03) (prog-if 00 [UHCI])
Subsystem: Lenovo Lenovo Thinkpad T60
Flags: bus master, medium devsel, latency 0, IRQ 21
I/O ports at 1880 [size=32]

00:1a.7 USB Controller: Intel Corporation 82801H (ICH8 Family) USB2 EHCI 
Controller #2 (rev 03) (prog-if 20 [EHCI])
Subsystem: Lenovo Lenovo Thinkpad T61
Flags: bus master, medium devsel, latency 0, IRQ 22
Memory at fe226c00 (32-bit, non-prefetchable) [size=1K]
Capabilities: access denied

00:1b.0 Audio device: Intel Corporation 82801H (ICH8 Family) HD Audio 
Controller (rev 03)
Subsystem: Lenovo Lenovo Thinkpad T61
Flags: bus master, fast devsel, latency 0, IRQ 17
Memory at fe22 (64-bit, non-prefetchable) [size=16K]
Capabilities: access denied

00:1c.0 PCI bridge: Intel Corporation 82801H (ICH8 Family) PCI Express Port 1 
(rev 03) (prog-if 00 [Normal decode])
Flags: bus master, fast devsel, latency 0
Bus: primary=00, secondary=02, subordinate=02, sec-latency=0
I/O behind bridge: 2000-2fff
Memory behind bridge: dc10-dfcf
Prefetchable memory behind bridge: dfe0-dfef
Capabilities: access denied

00:1c.1 PCI bridge: Intel Corporation 82801H (ICH8 Family) PCI Express Port 2 
(rev 03) (prog-if 00 [Normal decode])
Flags: bus master, fast devsel, latency 0
Bus: primary=00, secondary=03, subordinate=03, sec-latency=0
I/O behind bridge: 3000-3fff
Memory behind bridge: fc00-fdff
Prefetchable memory behind bridge: f800-f80f
Capabilities: access denied

00:1d.0 USB Controller: Intel Corporation 82801H (ICH8 Family) USB UHCI 
Controller #1 (rev 03) (prog-if 00 [UHCI])
Subsystem: Lenovo Unknown device 20aa
Flags: bus master, medium devsel, latency 0, IRQ 16
I/O ports at 18a0 [size=32]

00:1d.1 USB Controller: Intel Corporation 82801H (ICH8 Family) USB UHCI 
Controller #2 (rev 03) (prog-if 00 [UHCI])
Subsystem: Lenovo Unknown device 20aa
Flags: bus master, medium devsel, latency 0, IRQ 17
I/O ports at 18c0 [size=32]

00:1d.7 USB Controller: Intel Corporation 82801H (ICH8 Family) USB2 EHCI 
Controller #1 (rev 03) (prog-if 20 [EHCI])
Subsystem: Lenovo Unknown device 20ab
Flags: bus master, medium devsel, latency 0, IRQ 19
Memory at fe227000 (32-bit, non-prefetchable) [size=1K]
Capabilities: access denied

00:1e.0 PCI bridge: Intel Corporation 82801 Mobile PCI Bridge (rev f3) (prog-if 
01 [Subtractive decode])
Flags: bus master, fast devsel, latency 0
Bus: primary=00, secondary=05, subordinate=08, sec-latency=32
I/O behind bridge: 4000-7fff
Memory behind bridge: f830-fbff
Prefetchable memory behind bridge

Re: ndiswrapper and GPL-only symbols redux

2008-02-06 Thread Theodore Tso

On Wed, Feb 06, 2008 at 03:38:52AM -0800, David Schwartz wrote:
  
  Ndiswrapper loads and executes code with not GPLv2 compatible licences 
  in a way in the kernel that might be considered similar to a GPLv2'ed 
  userspace program dlopen() a dynamic library file with a not GPLv2 
  compatible licence.
  
  IANAL, but I do think there might be real copyright issues with 
  ndiswrapper.
 
 Neither the kernel+ndiswrapper nor the non-free driver were
 developed with knowledge of the other, so there is simply no way one
 could be a derivative work of the other. Since no creative effort is
 required to link them together, and the linked result is not fixed
 in a permanent medium, a derivative work cannot be created by the
 linking process itself.

Indeed, there is a similar issue with libss, which was originally
written for use with Kerberos v5, and licensed under an MIT (BSD-style
plus you must not use MIT's name in advertising) license.  Kerberos V5
was adapted by Sun to create a propietary product called SEAM (Sun
Enterprise Authentication Mechanism), and contains a program called
kadmin, which uses libss as part of its user interface.

In the meantime, libss was enhanced to use a search path to dlopen the
first readline library it can find (some are GPL, some are
BSD-licensed), so that people could use debugfs while being able to
have command-line editing, and this is shipping in e2fsprogs.  I used
dlopen so that use of libreadline is optional; so if it doesn't fit on
a rescue floppy, it's no big deal; you can still use debugfs to edit
an ext2/3/4 filesystem.  So there was very much a valid technical
reason for doing what I did; I wasn't trying to circumvent any license
requirements, but trying to solve a perfect valid problem when you
only have 1440k on a 3.5 floppy (and libreadline is 296k, or 21% of
total amount of space available).

But if you compile and install e2fsprogs on Solaris, and then run
kadmin, you can have in one address space the proprietary kadmin
binary from SEAM, the BSD-licensed libss shared library from
e2fsprogs, and the GPL-licensed libreadline shared library.

Answer quickly!  Is there a license violation, and if so, who was
responsible for comitting the license violation?  This is my favorate
real-life case study that I roll out when I want to torture people who
claim that dynamic linking with a GPL shared library automatically
results a GPL violation.  :-)

The bottom line is that you should ask a lawyer, and not believe
anyone who has claimed to give you legal advice, whether or not they
have talked to dozens of lawyers.  What's most important is the
lawyer with whom you have paid money so he can take the facts specific
to your case, and apply them to the relevant legal statues in those
legal jurisdictions applicable for the software/product in question.

Regards,

- Ted
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: T61P sound issue

2008-02-05 Thread Theodore Tso

On Tue, Feb 05, 2008 at 10:16:08PM +0100, Jiri Kosina wrote:
> [ added Takashi ]
> 
> On Tue, 5 Feb 2008, Felipe Balbi wrote:
> 
> > > > > > Could anyone make T61P's ICH8 sound controller to work properly?
> > Good that there's a lot of people using T61p, it's a good machine.
> > I'll upgrade my BIOS and try again the crappy sound.
> 
> I have just bought X61s, and it seems to have the very same soundcard as 
> your T61p does:
>
> The sound also doesn't work with 2.6.24 (tried modprobing the 
> snd-hda-intel with 'model=thinkpad', didn't make any difference). The 
> mixer settings seem to be correct, but there is no sound.
> 

Hmm.. sound works just fine for me on my X61s (model #7668-CTO)
running 2.6.24.  

I do have this private patch applied --- maybe it makes a difference
for you?  I don't think it should make a difference, but

- Ted


commit c9001b03378048cad0f5c4f87dbb97fff1f80c51
Author: Theodore Ts'o <[EMAIL PROTECTED]>
Date:   Wed Jan 9 05:14:14 2008 -0500

hda_intel suspend latency: shorten codec read

not sleeping for every codec read/write but doing a short udelay and
a conditional reschedule has cut suspend+resume latency by about 1
second on my T60.

The patch also fixes the unexpected codec-connection errors that
happen more often in the new power-save mode:
http://lkml.org/lkml/2007/11/8/255
http://bugzilla.kernel.org/show_bug.cgi?id=9332

This had been applied, and then reverted due to problems.  See commit
d238998fbfa49f30b02f0a5de5294ca53c58348c

Signed-off-by: Ingo Molnar <[EMAIL PROTECTED]>
Acked-by: Takashi Iwai <[EMAIL PROTECTED]>
Signed-off-by: Linus Torvalds <[EMAIL PROTECTED]>

diff --git a/sound/pci/hda/hda_intel.c b/sound/pci/hda/hda_intel.c
index 3fa0f97..62b9fb3 100644
--- a/sound/pci/hda/hda_intel.c
+++ b/sound/pci/hda/hda_intel.c
@@ -555,7 +555,8 @@ static unsigned int azx_rirb_get_response(struct hda_codec 
*codec)
}
if (!chip->rirb.cmds)
return chip->rirb.res; /* the last value */
-   schedule_timeout_uninterruptible(1);
+   udelay(10);
+   cond_resched();
} while (time_after_eq(timeout, jiffies));
 
if (chip->msi) {
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: T61P sound issue

2008-02-05 Thread Theodore Tso

On Tue, Feb 05, 2008 at 10:16:08PM +0100, Jiri Kosina wrote:
 [ added Takashi ]
 
 On Tue, 5 Feb 2008, Felipe Balbi wrote:
 
  Could anyone make T61P's ICH8 sound controller to work properly?
  Good that there's a lot of people using T61p, it's a good machine.
  I'll upgrade my BIOS and try again the crappy sound.
 
 I have just bought X61s, and it seems to have the very same soundcard as 
 your T61p does:

 The sound also doesn't work with 2.6.24 (tried modprobing the 
 snd-hda-intel with 'model=thinkpad', didn't make any difference). The 
 mixer settings seem to be correct, but there is no sound.
 

Hmm.. sound works just fine for me on my X61s (model #7668-CTO)
running 2.6.24.  

I do have this private patch applied --- maybe it makes a difference
for you?  I don't think it should make a difference, but

- Ted


commit c9001b03378048cad0f5c4f87dbb97fff1f80c51
Author: Theodore Ts'o [EMAIL PROTECTED]
Date:   Wed Jan 9 05:14:14 2008 -0500

hda_intel suspend latency: shorten codec read

not sleeping for every codec read/write but doing a short udelay and
a conditional reschedule has cut suspend+resume latency by about 1
second on my T60.

The patch also fixes the unexpected codec-connection errors that
happen more often in the new power-save mode:
http://lkml.org/lkml/2007/11/8/255
http://bugzilla.kernel.org/show_bug.cgi?id=9332

This had been applied, and then reverted due to problems.  See commit
d238998fbfa49f30b02f0a5de5294ca53c58348c

Signed-off-by: Ingo Molnar [EMAIL PROTECTED]
Acked-by: Takashi Iwai [EMAIL PROTECTED]
Signed-off-by: Linus Torvalds [EMAIL PROTECTED]

diff --git a/sound/pci/hda/hda_intel.c b/sound/pci/hda/hda_intel.c
index 3fa0f97..62b9fb3 100644
--- a/sound/pci/hda/hda_intel.c
+++ b/sound/pci/hda/hda_intel.c
@@ -555,7 +555,8 @@ static unsigned int azx_rirb_get_response(struct hda_codec 
*codec)
}
if (!chip-rirb.cmds)
return chip-rirb.res; /* the last value */
-   schedule_timeout_uninterruptible(1);
+   udelay(10);
+   cond_resched();
} while (time_after_eq(timeout, jiffies));
 
if (chip-msi) {
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] ext4: Replace use of iget() with iget_locked()

2008-02-02 Thread Theodore Tso

On Sat, Feb 02, 2008 at 10:55:24AM -0500, Theodore Ts'o wrote:
> In the mm tree is a patch queued up to nuke iget().  So replace use of
> iget() with iget_locked().  I will be pushing this to Linus shortly.

Oops, wrong version of the patch; this is the correct one.

   - Ted

ext4: Replace use of iget() with iget_locked()

Signed-off-by: "Theodore Ts'o" <[EMAIL PROTECTED]>

---
diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index 575b521..d45fcaa 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -805,9 +805,17 @@ struct inode *ext4_orphan_get(struct super_block *sb, 
unsigned long ino)
 * is a valid orphan (no e2fsck run on fs).  Orphans also include
 * inodes that were being truncated, so we can't check i_nlink==0.
 */
-   if (!ext4_test_bit(bit, bitmap_bh->b_data) ||
-   !(inode = iget(sb, ino)) || is_bad_inode(inode) ||
-   NEXT_ORPHAN(inode) > max_ino) {
+   if (!ext4_test_bit(bit, bitmap_bh->b_data))
+   goto bad_orphan_inode;
+   inode = iget_locked(sb, ino);
+   if (!inode)
+   goto bad_orphan_inode;
+   if (inode->i_state & I_NEW) {
+   sb->s_op->read_inode(inode);
+   unlock_new_inode(inode);
+   }
+   if (is_bad_inode(inode) || NEXT_ORPHAN(inode) > max_ino) {
+   bad_orphan_inode:
ext4_warning(sb, __FUNCTION__,
 "bad orphan inode %lu!  e2fsck was run?", ino);
printk(KERN_NOTICE "ext4_test_bit(bit=%d, block=%llu) = %d\n",
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index 67b6d8a..57dd8fb 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -1041,11 +1041,16 @@ static struct dentry *ext4_lookup(struct inode * dir, 
struct dentry *dentry, str
   "bad inode number: %lu", ino);
inode = NULL;
} else
-   inode = iget(dir->i_sb, ino);
+   inode = iget_locked(dir->i_sb, ino);
 
if (!inode)
return ERR_PTR(-EACCES);
 
+   if (inode->i_state & I_NEW) {
+   inode->i_sb->s_op->read_inode(inode);
+   unlock_new_inode(inode);
+   }
+
if (is_bad_inode(inode)) {
iput(inode);
return ERR_PTR(-ENOENT);
@@ -1080,11 +1085,16 @@ struct dentry *ext4_get_parent(struct dentry *child)
   "bad inode number: %lu", ino);
inode = NULL;
} else
-   inode = iget(child->d_inode->i_sb, ino);
+   inode = iget_locked(child->d_inode->i_sb, ino);
 
if (!inode)
return ERR_PTR(-EACCES);
 
+   if (inode->i_state & I_NEW) {
+   inode->i_sb->s_op->read_inode(inode);
+   unlock_new_inode(inode);
+   }
+
if (is_bad_inode(inode)) {
iput(inode);
return ERR_PTR(-ENOENT);
diff --git a/fs/ext4/resize.c b/fs/ext4/resize.c
index 4fbba60..ebdca31 100644
--- a/fs/ext4/resize.c
+++ b/fs/ext4/resize.c
@@ -779,7 +779,11 @@ int ext4_group_add(struct super_block *sb, struct 
ext4_new_group_data *input)
 "No reserved GDT blocks, can't resize");
return -EPERM;
}
-   inode = iget(sb, EXT4_RESIZE_INO);
+   inode = iget_locked(sb, EXT4_RESIZE_INO);
+   if (inode && (inode->i_state & I_NEW)) {
+   sb->s_op->read_inode(inode);
+   unlock_new_inode(inode);
+   }
if (!inode || is_bad_inode(inode)) {
ext4_warning(sb, __FUNCTION__,
 "Error opening resize inode");
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 055a0cd..1ef0359 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -777,9 +777,13 @@ static struct inode *ext4_nfs_get_inode(struct super_block 
*sb,
 * Currently we don't know the generation for parent directory, so
 * a generation of 0 means "accept any"
 */
-   inode = iget(sb, ino);
+   inode = iget_locked(sb, ino);
if (inode == NULL)
return ERR_PTR(-ENOMEM);
+   if (inode->i_state & I_NEW) {
+   sb->s_op->read_inode(inode);
+   unlock_new_inode(inode);
+   }
if (is_bad_inode(inode) ||
(generation && inode->i_generation != generation)) {
iput(inode);
@@ -2243,7 +2247,15 @@ static int ext4_fill_super (struct super_block *sb, void 
*data, int silent)
 * so we can safely mount the rest of the filesystem now.
 */
 
-   root = iget(sb, EXT4_ROOT_INO);
+   root = iget_locked(sb, EXT4_ROOT_INO);
+   if (!root) {
+   printk(KERN_ERR "EXT4-fs: iget_locked for root inode

Re: [PATCH] ext4: Replace use of iget() with iget_locked()

2008-02-02 Thread Theodore Tso

On Sat, Feb 02, 2008 at 10:55:24AM -0500, Theodore Ts'o wrote:
 In the mm tree is a patch queued up to nuke iget().  So replace use of
 iget() with iget_locked().  I will be pushing this to Linus shortly.

Oops, wrong version of the patch; this is the correct one.

   - Ted

ext4: Replace use of iget() with iget_locked()

Signed-off-by: Theodore Ts'o [EMAIL PROTECTED]

---
diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index 575b521..d45fcaa 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -805,9 +805,17 @@ struct inode *ext4_orphan_get(struct super_block *sb, 
unsigned long ino)
 * is a valid orphan (no e2fsck run on fs).  Orphans also include
 * inodes that were being truncated, so we can't check i_nlink==0.
 */
-   if (!ext4_test_bit(bit, bitmap_bh-b_data) ||
-   !(inode = iget(sb, ino)) || is_bad_inode(inode) ||
-   NEXT_ORPHAN(inode)  max_ino) {
+   if (!ext4_test_bit(bit, bitmap_bh-b_data))
+   goto bad_orphan_inode;
+   inode = iget_locked(sb, ino);
+   if (!inode)
+   goto bad_orphan_inode;
+   if (inode-i_state  I_NEW) {
+   sb-s_op-read_inode(inode);
+   unlock_new_inode(inode);
+   }
+   if (is_bad_inode(inode) || NEXT_ORPHAN(inode)  max_ino) {
+   bad_orphan_inode:
ext4_warning(sb, __FUNCTION__,
 bad orphan inode %lu!  e2fsck was run?, ino);
printk(KERN_NOTICE ext4_test_bit(bit=%d, block=%llu) = %d\n,
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index 67b6d8a..57dd8fb 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -1041,11 +1041,16 @@ static struct dentry *ext4_lookup(struct inode * dir, 
struct dentry *dentry, str
   bad inode number: %lu, ino);
inode = NULL;
} else
-   inode = iget(dir-i_sb, ino);
+   inode = iget_locked(dir-i_sb, ino);
 
if (!inode)
return ERR_PTR(-EACCES);
 
+   if (inode-i_state  I_NEW) {
+   inode-i_sb-s_op-read_inode(inode);
+   unlock_new_inode(inode);
+   }
+
if (is_bad_inode(inode)) {
iput(inode);
return ERR_PTR(-ENOENT);
@@ -1080,11 +1085,16 @@ struct dentry *ext4_get_parent(struct dentry *child)
   bad inode number: %lu, ino);
inode = NULL;
} else
-   inode = iget(child-d_inode-i_sb, ino);
+   inode = iget_locked(child-d_inode-i_sb, ino);
 
if (!inode)
return ERR_PTR(-EACCES);
 
+   if (inode-i_state  I_NEW) {
+   inode-i_sb-s_op-read_inode(inode);
+   unlock_new_inode(inode);
+   }
+
if (is_bad_inode(inode)) {
iput(inode);
return ERR_PTR(-ENOENT);
diff --git a/fs/ext4/resize.c b/fs/ext4/resize.c
index 4fbba60..ebdca31 100644
--- a/fs/ext4/resize.c
+++ b/fs/ext4/resize.c
@@ -779,7 +779,11 @@ int ext4_group_add(struct super_block *sb, struct 
ext4_new_group_data *input)
 No reserved GDT blocks, can't resize);
return -EPERM;
}
-   inode = iget(sb, EXT4_RESIZE_INO);
+   inode = iget_locked(sb, EXT4_RESIZE_INO);
+   if (inode  (inode-i_state  I_NEW)) {
+   sb-s_op-read_inode(inode);
+   unlock_new_inode(inode);
+   }
if (!inode || is_bad_inode(inode)) {
ext4_warning(sb, __FUNCTION__,
 Error opening resize inode);
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 055a0cd..1ef0359 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -777,9 +777,13 @@ static struct inode *ext4_nfs_get_inode(struct super_block 
*sb,
 * Currently we don't know the generation for parent directory, so
 * a generation of 0 means accept any
 */
-   inode = iget(sb, ino);
+   inode = iget_locked(sb, ino);
if (inode == NULL)
return ERR_PTR(-ENOMEM);
+   if (inode-i_state  I_NEW) {
+   sb-s_op-read_inode(inode);
+   unlock_new_inode(inode);
+   }
if (is_bad_inode(inode) ||
(generation  inode-i_generation != generation)) {
iput(inode);
@@ -2243,7 +2247,15 @@ static int ext4_fill_super (struct super_block *sb, void 
*data, int silent)
 * so we can safely mount the rest of the filesystem now.
 */
 
-   root = iget(sb, EXT4_ROOT_INO);
+   root = iget_locked(sb, EXT4_ROOT_INO);
+   if (!root) {
+   printk(KERN_ERR EXT4-fs: iget_locked for root inode failed\n);
+   goto failed_mount4;
+   }

Re: [GIT PULL] ext4 update

2008-01-29 Thread Theodore Tso

On Tue, Jan 29, 2008 at 10:54:03PM +0100, Jan Engelhardt wrote:
> 
> On Jan 29 2008 07:53, Theodore Tso wrote:
> >
> >>fwiw, diffstat is confused by git's diff output; you need to use
> >>'diffstat -p1'
> 
> I am seeing normal behavior:
>
> 22:52 sovereign:~/linux > git diff HEAD | diffstat

That's because you are doing a diff stat of changes that haven't been
checked in yet.  I was doing a "git log -p origin.. | diffstat -p1",
and in that incantation you definitely do need the -p1 to diffstat.

  - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [GIT PULL] ext4 update

2008-01-29 Thread Theodore Tso

>fwiw, diffstat is confused by git's diff output; you need to use
>'diffstat -p1'

Argh, I have *got* to create a script that does this automatically.

Revised diffstat -p1 output follows...

- Ted

 Documentation/filesystems/ext4.txt   |   20 
 Documentation/filesystems/proc.txt   |   39 
 fs/Kconfig   |1 
 fs/afs/dir.c |9 
 fs/afs/inode.c   |3 
 fs/buffer.c  |   44 
 fs/ext2/super.c  |   32 
 fs/ext3/super.c  |   32 
 fs/ext4/Makefile |4 
 fs/ext4/balloc.c |  251 +
 fs/ext4/dir.c|   14 
 fs/ext4/extents.c|  525 +--
 fs/ext4/file.c   |   23 
 fs/ext4/group.h  |8 
 fs/ext4/ialloc.c |  161 
 fs/ext4/inode.c  |  396 +-
 fs/ext4/ioctl.c  |7 
 fs/ext4/mballoc.c| 4552 +++
 fs/ext4/migrate.c|  570 +++
 fs/ext4/namei.c  |  135 
 fs/ext4/resize.c |   28 
 fs/ext4/super.c  |  389 +-
 fs/ext4/xattr.c  |4 
 fs/inode.c   |   39 
 fs/jbd2/checkpoint.c |   22 
 fs/jbd2/commit.c |  255 +
 fs/jbd2/journal.c|  368 ++
 fs/jbd2/recovery.c   |  151 
 fs/jbd2/revoke.c |6 
 fs/jbd2/transaction.c|   34 
 fs/read_write.c  |1 
 include/asm-arm/bitops.h |2 
 include/asm-generic/bitops/ext2-non-atomic.h |2 
 include/asm-generic/bitops/le.h  |4 
 include/asm-m68k/bitops.h|2 
 include/asm-m68knommu/bitops.h   |2 
 include/asm-powerpc/bitops.h |4 
 include/asm-s390/bitops.h|2 
 include/linux/buffer_head.h  |2 
 include/linux/ext4_fs.h  |  224 +
 include/linux/ext4_fs_extents.h  |   25 
 include/linux/ext4_fs_i.h|   25 
 include/linux/ext4_fs_sb.h   |   55 
 include/linux/fs.h   |   21 
 include/linux/jbd2.h |  135 
 lib/find_next_bit.c  |   43 
 46 files changed, 7773 insertions(+), 898 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [GIT PULL] ext4 update

2008-01-29 Thread Theodore Tso

fwiw, diffstat is confused by git's diff output; you need to use
'diffstat -p1'

Argh, I have *got* to create a script that does this automatically.

Revised diffstat -p1 output follows...

- Ted

 Documentation/filesystems/ext4.txt   |   20 
 Documentation/filesystems/proc.txt   |   39 
 fs/Kconfig   |1 
 fs/afs/dir.c |9 
 fs/afs/inode.c   |3 
 fs/buffer.c  |   44 
 fs/ext2/super.c  |   32 
 fs/ext3/super.c  |   32 
 fs/ext4/Makefile |4 
 fs/ext4/balloc.c |  251 +
 fs/ext4/dir.c|   14 
 fs/ext4/extents.c|  525 +--
 fs/ext4/file.c   |   23 
 fs/ext4/group.h  |8 
 fs/ext4/ialloc.c |  161 
 fs/ext4/inode.c  |  396 +-
 fs/ext4/ioctl.c  |7 
 fs/ext4/mballoc.c| 4552 +++
 fs/ext4/migrate.c|  570 +++
 fs/ext4/namei.c  |  135 
 fs/ext4/resize.c |   28 
 fs/ext4/super.c  |  389 +-
 fs/ext4/xattr.c  |4 
 fs/inode.c   |   39 
 fs/jbd2/checkpoint.c |   22 
 fs/jbd2/commit.c |  255 +
 fs/jbd2/journal.c|  368 ++
 fs/jbd2/recovery.c   |  151 
 fs/jbd2/revoke.c |6 
 fs/jbd2/transaction.c|   34 
 fs/read_write.c  |1 
 include/asm-arm/bitops.h |2 
 include/asm-generic/bitops/ext2-non-atomic.h |2 
 include/asm-generic/bitops/le.h  |4 
 include/asm-m68k/bitops.h|2 
 include/asm-m68knommu/bitops.h   |2 
 include/asm-powerpc/bitops.h |4 
 include/asm-s390/bitops.h|2 
 include/linux/buffer_head.h  |2 
 include/linux/ext4_fs.h  |  224 +
 include/linux/ext4_fs_extents.h  |   25 
 include/linux/ext4_fs_i.h|   25 
 include/linux/ext4_fs_sb.h   |   55 
 include/linux/fs.h   |   21 
 include/linux/jbd2.h |  135 
 lib/find_next_bit.c  |   43 
 46 files changed, 7773 insertions(+), 898 deletions(-)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [GIT PULL] ext4 update

2008-01-29 Thread Theodore Tso

On Tue, Jan 29, 2008 at 10:54:03PM +0100, Jan Engelhardt wrote:
 
 On Jan 29 2008 07:53, Theodore Tso wrote:
 
 fwiw, diffstat is confused by git's diff output; you need to use
 'diffstat -p1'
 
 I am seeing normal behavior:

 22:52 sovereign:~/linux  git diff HEAD | diffstat

That's because you are doing a diff stat of changes that haven't been
checked in yet.  I was doing a git log -p origin.. | diffstat -p1,
and in that incantation you definitely do need the -p1 to diffstat.

  - Ted
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Parallelize IO for e2fsck

2008-01-28 Thread Theodore Tso

On Mon, Jan 28, 2008 at 07:30:05PM +, Pavel Machek wrote:
> 
> As user pages are always in highmem, this should be easy to decide:
> only send SIGDANGER when highmem is full. (Yes, there are
> inodes/dentries/file descriptors in lowmem, but I doubt apps will
> respond to SIGDANGER by closing files).

Good point; for a system with at least (say) 2GB of memory, that
definitely makes sense.  For a system with less than 768 megs of
memory (how quaint, but it wasn't that long ago this was a lot of
memory :-), there wouldn't *be* any memory in highmem at all

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Parallelize IO for e2fsck

2008-01-28 Thread Theodore Tso

On Mon, Jan 28, 2008 at 07:30:05PM +, Pavel Machek wrote:
 
 As user pages are always in highmem, this should be easy to decide:
 only send SIGDANGER when highmem is full. (Yes, there are
 inodes/dentries/file descriptors in lowmem, but I doubt apps will
 respond to SIGDANGER by closing files).

Good point; for a system with at least (say) 2GB of memory, that
definitely makes sense.  For a system with less than 768 megs of
memory (how quaint, but it wasn't that long ago this was a lot of
memory :-), there wouldn't *be* any memory in highmem at all

- Ted
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 24/49] ext4: add block bitmap validation

2008-01-26 Thread Theodore Tso

On Wed, Jan 23, 2008 at 02:06:54PM -0800, Andrew Morton wrote:
> brelse() should only be used when the bh might be NULL - put_bh()
> can be used here.
> 
> Please review all ext4/jbd2 code for this trivial speedup.

I've reviewed all of the pending patches in the stable queue for this
speedup, and applied them where necessary; it was useful, since I
detected a buffer head leak in one of the patches while I was at it.

The ext4/jbd2 code as a whole still needs to be reviewed for this
speedup, but I don't want to fix this in the initial stable push, lest
I break something by accident.  I'll put it in the "TO DO" queue.

Regards,

 - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Parallelize IO for e2fsck

2008-01-26 Thread Theodore Tso

On Fri, Jan 25, 2008 at 05:55:51PM -0800, Bryan Henderson wrote:
> I was surprised to see AIX do late allocation by default, because IBM's 
> traditional style is bulletproof systems.  A system where a process can be 
> killed at unpredictable times because of resource demands of unrelated 
> processes doesn't really fit that style.
> 
> It's really a fairly unusual application that benefits from late 
> allocation: one that creates a lot more virtual memory than it ever 
> touches.  For example, a sparse array.  Or am I missing something?

I guess it depends on how far you try to do "bulletproof".  OSF/1 used
to use "bulletproof" as its default --- and I had to turn it off on
tsx-11.mit.edu (the first North American ftp server for Linux :-),
because the difference was something like 50 ftp daemons versus over
500 on the same server.  It reserved VM space for the text segement of
every single process, since at least in theory, it's possible for
every single text page to get modified using ptrace if (for example) a
debugger were to set a break point on every single page of every
single text segement of every single ftp daemon.

You can also see potential problems for Java programs.  Suppose you
had some gigantic Java Application (say, Lotus Notes, or Websphere
Application Server) which is taking up many, many, MANY gigabytes of
VM space.  Now suppose the Java application needs to fork and exec
some trivial helper program.  For that tiny instant, between the fork
and exec, the VM requirements in "bulletproof" mode would double,
since while 99.% of the time programs will immediately discard the
VM upon the exec, there is always the possibility that the child
process will touch every single data page, forcing a copy on write,
and never do the exec.

There are of course different levels of "bulletproof" between the
extremes of "totally bulletproof" and "late binding" from an
algorithmic standpoint.  For example, you could ignore the needed
pages caused by ptrace(); more challenging would be to how to handle
the fork/exec semantics, although there could be kludges such as
strongly encouraging applications to use an old-fashed BSD-style
vfork() to guarantee that the child couldn't double VM requirements
between the vfork() and exec().  I certainly can't say for sure what
the AIX designers had in mind, and why they didn't choose one of the
more intermediate design choices.  

However, it is fair to say that "100% bulletproof" can require
reserving far more VM resources than you might first expect.  Even a
company which is highly incented to sell large amounts of hardware,
such as Digital, might not have wanted their OS to be only able to
support an embarassingly small number of simultaneous ftpd
connections.  I know this for sure because the OSF/1 documentation,
when discussing their VM tuning knobs, specifically talked about the
scenario that I ran into with tsx-11.mit.edu.

Regards,

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Parallelize IO for e2fsck

2008-01-26 Thread Theodore Tso

On Fri, Jan 25, 2008 at 05:55:51PM -0800, Bryan Henderson wrote:
 I was surprised to see AIX do late allocation by default, because IBM's 
 traditional style is bulletproof systems.  A system where a process can be 
 killed at unpredictable times because of resource demands of unrelated 
 processes doesn't really fit that style.
 
 It's really a fairly unusual application that benefits from late 
 allocation: one that creates a lot more virtual memory than it ever 
 touches.  For example, a sparse array.  Or am I missing something?

I guess it depends on how far you try to do bulletproof.  OSF/1 used
to use bulletproof as its default --- and I had to turn it off on
tsx-11.mit.edu (the first North American ftp server for Linux :-),
because the difference was something like 50 ftp daemons versus over
500 on the same server.  It reserved VM space for the text segement of
every single process, since at least in theory, it's possible for
every single text page to get modified using ptrace if (for example) a
debugger were to set a break point on every single page of every
single text segement of every single ftp daemon.

You can also see potential problems for Java programs.  Suppose you
had some gigantic Java Application (say, Lotus Notes, or Websphere
Application Server) which is taking up many, many, MANY gigabytes of
VM space.  Now suppose the Java application needs to fork and exec
some trivial helper program.  For that tiny instant, between the fork
and exec, the VM requirements in bulletproof mode would double,
since while 99.% of the time programs will immediately discard the
VM upon the exec, there is always the possibility that the child
process will touch every single data page, forcing a copy on write,
and never do the exec.

There are of course different levels of bulletproof between the
extremes of totally bulletproof and late binding from an
algorithmic standpoint.  For example, you could ignore the needed
pages caused by ptrace(); more challenging would be to how to handle
the fork/exec semantics, although there could be kludges such as
strongly encouraging applications to use an old-fashed BSD-style
vfork() to guarantee that the child couldn't double VM requirements
between the vfork() and exec().  I certainly can't say for sure what
the AIX designers had in mind, and why they didn't choose one of the
more intermediate design choices.  

However, it is fair to say that 100% bulletproof can require
reserving far more VM resources than you might first expect.  Even a
company which is highly incented to sell large amounts of hardware,
such as Digital, might not have wanted their OS to be only able to
support an embarassingly small number of simultaneous ftpd
connections.  I know this for sure because the OSF/1 documentation,
when discussing their VM tuning knobs, specifically talked about the
scenario that I ran into with tsx-11.mit.edu.

Regards,

- Ted
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 36/49] ext4: Add EXT4_IOC_MIGRATE ioctl

2008-01-25 Thread Theodore Tso

On Thu, Jan 24, 2008 at 11:25:32AM +0530, Aneesh Kumar K.V wrote:
> +static int free_ext_idx(handle_t *handle, struct inode *inode,
> + struct ext4_extent_idx *ix)
> +{
> + int i, retval = 0;
> + ext4_fsblk_t block;
> + struct buffer_head *bh;
> + struct ext4_extent_header *eh;
> +
> + block = idx_pblock(ix);
> + bh = sb_bread(inode->i_sb, block);
> + if (!bh)
> + return -EIO;
> +
> + eh = (struct ext4_extent_header *)bh->b_data;
> + if (eh->eh_depth == 0) {
> + brelse(bh);
> + ext4_free_blocks(handle, inode, block, 1);
> + } else {
> + ix = EXT_FIRST_INDEX(eh);
> + for (i = 0; i < le16_to_cpu(eh->eh_entries); i++, ix++) {
> + retval = free_ext_idx(handle, inode, ix);
> + if (retval)
> + return retval;
> + }
> + }
> + return retval;
> +}

Aneesh, looks like if eh->eh_depth is != 0, bh gets leaked.  This is
how I plan to fix it up:

+static int free_ext_idx(handle_t *handle, struct inode *inode,
+   struct ext4_extent_idx *ix)
+{
+   int i, retval = 0;
+   ext4_fsblk_t block;
+   struct buffer_head *bh;
+   struct ext4_extent_header *eh;
+
+   block = idx_pblock(ix);
+   bh = sb_bread(inode->i_sb, block);
+   if (!bh)
+   return -EIO;
+
+   eh = (struct ext4_extent_header *)bh->b_data;
+   if (eh->eh_depth == 0)
+   ext4_free_blocks(handle, inode, block, 1);
+   else {
+   ix = EXT_FIRST_INDEX(eh);
+   for (i = 0; i < le16_to_cpu(eh->eh_entries); i++, ix++) {
+   retval = free_ext_idx(handle, inode, ix);
+   if (retval)
+   break;
+   }
+   }
+   put_bh(bh);
+   return retval;
+}

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] ext3 freeze feature

2008-01-25 Thread Theodore Tso

On Fri, Jan 25, 2008 at 10:34:25AM -0600, Eric Sandeen wrote:
> > But it was this concern which is why ext3 never exported freeze
> > functionality to userspace, even though other commercial filesystems
> > do support this.  It wasn't that it wasn't considered, but the concern
> > about whether or not it was sufficiently safe to make available.
> 
> What's the safety concern; that the admin will forget to unfreeze?

That the admin would manage to deadlock him/herself and wedge up the
whole system...

> I'm also not sure I see the point of the timeout in the original patch;
> either you are done snapshotting and ready to unfreeze, or you're not;
> 1, or 2, or 3 seconds doesn't really matter.  When you're done, you're
> done, and you can only unfreeze then.  Shouldn't this be done
> programmatically, and not with some pre-determined timeout?

This is only a guess, but I suspect it was a fail-safe in case the
admin did manage to deadlock him/herself.  

I would think a better approach would be to make the filesystem
unfreeze if the file descriptor that was used to freeze the filesystem
is closed, and then have explicit deadlock detection that kills the
process doing the freeze, at which point the filesystem unlocks and
the system can recover.

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] ext3 freeze feature

2008-01-25 Thread Theodore Tso

On Fri, Jan 25, 2008 at 03:18:51PM +0300, Dmitri Monakhov wrote:
> First of all Linux already have at least one open-source(dm-snap),
> and several commercial snapshot solutions. 

Yes, but it requires that the filesystem be stored under LVM.  Unlike
what EVMS v1 allowed us to do, we can't currently take a snapshot of a
bare block device.  This patch could potentially be useful for systems
which aren't using LVM, however

> You have to realize what delay between 1-3 stages have to be minimal.
> for example dm-snap perform it only for explicit journal flushing.
> From my experience if delay is more than 4-5 seconds whole system becomes
> unstable.

That's the problem.  You can't afford to freeze for very long.

What you *could* do is to start putting processes to sleep if they
attempt to write to the frozen filesystem, and then detect the
deadlock case where the process holding the file descriptor used to
freeze the filesystem gets frozen because it attempted to write to the
filesystem --- at which point it gets some kind of signal (which
defaults to killing the process), and the filesystem is unfrozen and
as part of the unfreeze you wake up all of the processes that were put
to sleep for touching the frozen filesystem.

The other approach would be to say, "oh well, the freeze ioctl is
inherently dangerous, and root is allowed to himself in the foot, so
who cares".  :-)

But it was this concern which is why ext3 never exported freeze
functionality to userspace, even though other commercial filesystems
do support this.  It wasn't that it wasn't considered, but the concern
about whether or not it was sufficiently safe to make available.

And I do agree that we probably should just implement this in
filesystem independent way, in which case all of the filesystems that
support this already have super_operations functions
write_super_lockfs() and unlockfs().

So if this is done using a new system call, there should be no
filesystem-specific changes needed, and all filesystems which support
those super_operations method functions would be able to provide this
functionality to the new system call.

 - Ted

P.S.  Oh yeah, it should be noted that freezing at the filesystem
layer does *not* guarantee that changes to the block device aren't
happening via mmap()'ed files.  The LVM needs to freeze writes the
block device level if it wants to guarantee a completely stable
snapshot image.  So the proposed patch doens't quite give you those
guarantees, if that was the intended goal.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 36/49] ext4: Add EXT4_IOC_MIGRATE ioctl

2008-01-25 Thread Theodore Tso

On Thu, Jan 24, 2008 at 11:25:32AM +0530, Aneesh Kumar K.V wrote:
 +static int free_ext_idx(handle_t *handle, struct inode *inode,
 + struct ext4_extent_idx *ix)
 +{
 + int i, retval = 0;
 + ext4_fsblk_t block;
 + struct buffer_head *bh;
 + struct ext4_extent_header *eh;
 +
 + block = idx_pblock(ix);
 + bh = sb_bread(inode-i_sb, block);
 + if (!bh)
 + return -EIO;
 +
 + eh = (struct ext4_extent_header *)bh-b_data;
 + if (eh-eh_depth == 0) {
 + brelse(bh);
 + ext4_free_blocks(handle, inode, block, 1);
 + } else {
 + ix = EXT_FIRST_INDEX(eh);
 + for (i = 0; i  le16_to_cpu(eh-eh_entries); i++, ix++) {
 + retval = free_ext_idx(handle, inode, ix);
 + if (retval)
 + return retval;
 + }
 + }
 + return retval;
 +}

Aneesh, looks like if eh-eh_depth is != 0, bh gets leaked.  This is
how I plan to fix it up:

+static int free_ext_idx(handle_t *handle, struct inode *inode,
+   struct ext4_extent_idx *ix)
+{
+   int i, retval = 0;
+   ext4_fsblk_t block;
+   struct buffer_head *bh;
+   struct ext4_extent_header *eh;
+
+   block = idx_pblock(ix);
+   bh = sb_bread(inode-i_sb, block);
+   if (!bh)
+   return -EIO;
+
+   eh = (struct ext4_extent_header *)bh-b_data;
+   if (eh-eh_depth == 0)
+   ext4_free_blocks(handle, inode, block, 1);
+   else {
+   ix = EXT_FIRST_INDEX(eh);
+   for (i = 0; i  le16_to_cpu(eh-eh_entries); i++, ix++) {
+   retval = free_ext_idx(handle, inode, ix);
+   if (retval)
+   break;
+   }
+   }
+   put_bh(bh);
+   return retval;
+}

- Ted
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] ext3 freeze feature

2008-01-25 Thread Theodore Tso

On Fri, Jan 25, 2008 at 10:34:25AM -0600, Eric Sandeen wrote:
  But it was this concern which is why ext3 never exported freeze
  functionality to userspace, even though other commercial filesystems
  do support this.  It wasn't that it wasn't considered, but the concern
  about whether or not it was sufficiently safe to make available.
 
 What's the safety concern; that the admin will forget to unfreeze?

That the admin would manage to deadlock him/herself and wedge up the
whole system...

 I'm also not sure I see the point of the timeout in the original patch;
 either you are done snapshotting and ready to unfreeze, or you're not;
 1, or 2, or 3 seconds doesn't really matter.  When you're done, you're
 done, and you can only unfreeze then.  Shouldn't this be done
 programmatically, and not with some pre-determined timeout?

This is only a guess, but I suspect it was a fail-safe in case the
admin did manage to deadlock him/herself.  

I would think a better approach would be to make the filesystem
unfreeze if the file descriptor that was used to freeze the filesystem
is closed, and then have explicit deadlock detection that kills the
process doing the freeze, at which point the filesystem unlocks and
the system can recover.

- Ted
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] ext3 freeze feature

2008-01-25 Thread Theodore Tso

On Fri, Jan 25, 2008 at 03:18:51PM +0300, Dmitri Monakhov wrote:
 First of all Linux already have at least one open-source(dm-snap),
 and several commercial snapshot solutions. 

Yes, but it requires that the filesystem be stored under LVM.  Unlike
what EVMS v1 allowed us to do, we can't currently take a snapshot of a
bare block device.  This patch could potentially be useful for systems
which aren't using LVM, however

 You have to realize what delay between 1-3 stages have to be minimal.
 for example dm-snap perform it only for explicit journal flushing.
 From my experience if delay is more than 4-5 seconds whole system becomes
 unstable.

That's the problem.  You can't afford to freeze for very long.

What you *could* do is to start putting processes to sleep if they
attempt to write to the frozen filesystem, and then detect the
deadlock case where the process holding the file descriptor used to
freeze the filesystem gets frozen because it attempted to write to the
filesystem --- at which point it gets some kind of signal (which
defaults to killing the process), and the filesystem is unfrozen and
as part of the unfreeze you wake up all of the processes that were put
to sleep for touching the frozen filesystem.

The other approach would be to say, oh well, the freeze ioctl is
inherently dangerous, and root is allowed to himself in the foot, so
who cares.  :-)

But it was this concern which is why ext3 never exported freeze
functionality to userspace, even though other commercial filesystems
do support this.  It wasn't that it wasn't considered, but the concern
about whether or not it was sufficiently safe to make available.

And I do agree that we probably should just implement this in
filesystem independent way, in which case all of the filesystems that
support this already have super_operations functions
write_super_lockfs() and unlockfs().

So if this is done using a new system call, there should be no
filesystem-specific changes needed, and all filesystems which support
those super_operations method functions would be able to provide this
functionality to the new system call.

 - Ted

P.S.  Oh yeah, it should be noted that freezing at the filesystem
layer does *not* guarantee that changes to the block device aren't
happening via mmap()'ed files.  The LVM needs to freeze writes the
block device level if it wants to guarantee a completely stable
snapshot image.  So the proposed patch doens't quite give you those
guarantees, if that was the intended goal.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Parallelize IO for e2fsck

2008-01-24 Thread Theodore Tso

On Fri, Jan 25, 2008 at 01:08:09AM +0200, Adrian Bunk wrote:
> In practice, there is a small number of programs that are both the
> common memory hogs and should be able to reduce their memory consumption
> by 10% or 20% without big problems when requested (e.g. Java VMs,
> Firefox and databases come into my mind).

I agree, it's only a few processes where this makes sense.  But for
those that do, it would be useful if they could register with the
kernel that would like to know, (just before the system starts
ejecting cached data, just before swapping, etc.) and at what
frequency.  And presumably, if the kernel notices that a process is
responding to such requests with memory actually getting released back
to the system, that process could get "rewarded" by having the OOM
killer less likely to target that particular thread.

AIX basically did this with SIGDANGER (the signal is ignored by
default), except there wasn't the ability for the process to tell the
kernel at what level of memory pressure before it should start getting
notified, and there was no way for the kernel to tell how bad the
memory pressure actually was.  On the other hand, it was a relatively
simple design.

In practice very few processes would indeed pay attention to
SIGDANGER, so I think you're quite right there.

> And from a performance point of view letting applications voluntarily 
> free some memory is better even than starting to swap.

Absolutely.

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Parallelize IO for e2fsck

2008-01-24 Thread Theodore Tso

On Fri, Jan 25, 2008 at 01:08:09AM +0200, Adrian Bunk wrote:
 In practice, there is a small number of programs that are both the
 common memory hogs and should be able to reduce their memory consumption
 by 10% or 20% without big problems when requested (e.g. Java VMs,
 Firefox and databases come into my mind).

I agree, it's only a few processes where this makes sense.  But for
those that do, it would be useful if they could register with the
kernel that would like to know, (just before the system starts
ejecting cached data, just before swapping, etc.) and at what
frequency.  And presumably, if the kernel notices that a process is
responding to such requests with memory actually getting released back
to the system, that process could get rewarded by having the OOM
killer less likely to target that particular thread.

AIX basically did this with SIGDANGER (the signal is ignored by
default), except there wasn't the ability for the process to tell the
kernel at what level of memory pressure before it should start getting
notified, and there was no way for the kernel to tell how bad the
memory pressure actually was.  On the other hand, it was a relatively
simple design.

In practice very few processes would indeed pay attention to
SIGDANGER, so I think you're quite right there.

 And from a performance point of view letting applications voluntarily 
 free some memory is better even than starting to swap.

Absolutely.

- Ted
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Parallelize IO for e2fsck

2008-01-22 Thread Theodore Tso

On Tue, Jan 22, 2008 at 12:00:50AM -0700, Andreas Dilger wrote:
> > AIX had SIGDANGER some 15 years ago.  Admittedly, that was sent when
> > the system was about to hit OOM, not when it was about to start swapping.
> 
> I'd tried to advocate SIGDANGER some years ago as well, but none of
> the kernel maintainers were interested.  It definitely makes sense
> to have some sort of mechanism like this.  At the time I first brought
> it up it was in conjunction with Netscape using too much cache on some
> system, but it would be just as useful for all kinds of other memory-
> hungry applications.

It's been discussed before, but I suspect the main reason why it was
never done is no one submitted a patch.  Also, the problem is actually
a pretty complex one.  There are a couple of different stages where
you might want to send an alert to processes:

* Data is starting to get ejected from page/buffer cache
* System is starting to swap
* System is starting to really struggle to find memory
* System is starting an out-of-memory killer

AIX's SIGDANGER really did the last two, where the OOM killer would
tend to avoid processes that had a SIGDANGER handler in favor of
processes that were SIGDANGER unaware.

Then there is the additional complexity in Linux that you have
multiple zones of memory, which at least on the historically more
popular x86 was highly, highly important.  You could say that whenever
there is sufficient memory pressure in any zone that you start
ejecting data from caches or start to swap that you start sending the
signals --- but on x86 systems with lowmem, that could happen quite
frequently, and since a user process has no idea whether its resources
are in lowmem or highmem, there's not much you can do about this.

Hopefully this is less of an issue today, since the 2.6 VM is much
more better behaved, and people are gradually moving over to x86_64
anyway.  (Sorry SGI and Intel, unfortunately they're not moving over
to the Itanic :-).   So maybe this would be better received now.

Bringing us back to the main topic at hand, one of the tradeoffs in
Val's current approach is that by relying on the kernel's buffer
cache, we don't have to worry about locking and coherency at the
userspace level.  OTOH, we give up low-level control about when memory
gets thrown out, and it also means that simply getting notified when
the system starts to swap isn't good enough.  We need to know much
earlier, when the system starts ejecting data from the buffer and page
caches.

Does this matter?  Well, there are a couple of use cases:

 * The restricted boot environment
 * The background "once a month" take a snapshot and check
 * The oh-my-gosh we-lost-a-filesystem -- repair it while the 
   IMAP server is still on-line serving data from the other 
   mounted filesystems.

It's the last case where things get really tricky

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Parallelize IO for e2fsck

2008-01-22 Thread Theodore Tso

On Tue, Jan 22, 2008 at 12:00:50AM -0700, Andreas Dilger wrote:
  AIX had SIGDANGER some 15 years ago.  Admittedly, that was sent when
  the system was about to hit OOM, not when it was about to start swapping.
 
 I'd tried to advocate SIGDANGER some years ago as well, but none of
 the kernel maintainers were interested.  It definitely makes sense
 to have some sort of mechanism like this.  At the time I first brought
 it up it was in conjunction with Netscape using too much cache on some
 system, but it would be just as useful for all kinds of other memory-
 hungry applications.

It's been discussed before, but I suspect the main reason why it was
never done is no one submitted a patch.  Also, the problem is actually
a pretty complex one.  There are a couple of different stages where
you might want to send an alert to processes:

* Data is starting to get ejected from page/buffer cache
* System is starting to swap
* System is starting to really struggle to find memory
* System is starting an out-of-memory killer

AIX's SIGDANGER really did the last two, where the OOM killer would
tend to avoid processes that had a SIGDANGER handler in favor of
processes that were SIGDANGER unaware.

Then there is the additional complexity in Linux that you have
multiple zones of memory, which at least on the historically more
popular x86 was highly, highly important.  You could say that whenever
there is sufficient memory pressure in any zone that you start
ejecting data from caches or start to swap that you start sending the
signals --- but on x86 systems with lowmem, that could happen quite
frequently, and since a user process has no idea whether its resources
are in lowmem or highmem, there's not much you can do about this.

Hopefully this is less of an issue today, since the 2.6 VM is much
more better behaved, and people are gradually moving over to x86_64
anyway.  (Sorry SGI and Intel, unfortunately they're not moving over
to the Itanic :-).   So maybe this would be better received now.

Bringing us back to the main topic at hand, one of the tradeoffs in
Val's current approach is that by relying on the kernel's buffer
cache, we don't have to worry about locking and coherency at the
userspace level.  OTOH, we give up low-level control about when memory
gets thrown out, and it also means that simply getting notified when
the system starts to swap isn't good enough.  We need to know much
earlier, when the system starts ejecting data from the buffer and page
caches.

Does this matter?  Well, there are a couple of use cases:

 * The restricted boot environment
 * The background once a month take a snapshot and check
 * The oh-my-gosh we-lost-a-filesystem -- repair it while the 
   IMAP server is still on-line serving data from the other 
   mounted filesystems.

It's the last case where things get really tricky

- Ted
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [CALL FOR TESTING] Make Ext3 fsck way faster [2.6.24-rc6 -mm patch]

2008-01-20 Thread Theodore Tso

On Sat, Jan 19, 2008 at 08:10:20PM -0800, Daniel Phillips wrote:
> 
> I can see value in preemptively loading indirect blocks into the buffer 
> cache, but is building a second-order extent tree really worth the 
> effort?  Probing the buffer cache is very fast.

It's not that much effort, and for a big database (say, like a 50GB
database file), the indirect blocks would take up 50 megabytes of
memory.  Collapsing it into an extent tree would save that memory into
a few kilobytes.  I suppose a database server would probably have
5-10GB's of memory, so the grand scheme of things it's not a vast
amount of memory, but the trick is keeping the indirect blocks pinned
so they don't get pushed out by some vast, gigunndo Java application
running in the same server as the database.  If you have the indirect
blocks encoded into the extent tree, then you don't have to worry
about that.

   - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [CALL FOR TESTING] Make Ext3 fsck way faster [2.6.24-rc6 -mm patch]

2008-01-20 Thread Theodore Tso

On Sat, Jan 19, 2008 at 08:10:20PM -0800, Daniel Phillips wrote:
 
 I can see value in preemptively loading indirect blocks into the buffer 
 cache, but is building a second-order extent tree really worth the 
 effort?  Probing the buffer cache is very fast.

It's not that much effort, and for a big database (say, like a 50GB
database file), the indirect blocks would take up 50 megabytes of
memory.  Collapsing it into an extent tree would save that memory into
a few kilobytes.  I suppose a database server would probably have
5-10GB's of memory, so the grand scheme of things it's not a vast
amount of memory, but the trick is keeping the indirect blocks pinned
so they don't get pushed out by some vast, gigunndo Java application
running in the same server as the database.  If you have the indirect
blocks encoded into the extent tree, then you don't have to worry
about that.

   - Ted
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)

2008-01-18 Thread Theodore Tso

On Thu, Jan 17, 2008 at 04:31:48PM -0800, Bryan Henderson wrote:
> But I heard some years ago from a disk drive engineer that that is a myth 
> just like the rotational energy thing.  I added that to the discussion, 
> but admitted that I haven't actually seen a disk drive write a partial 
> sector.

Well, it would be impossible or at least very hard to see that in
practice, right?  My understanding is that drives do sector-level
checksums, so if there was a partially written sector, the checksum
would be bogus and the drive would return an error when you tried to
read from it.

> Ted brought up the separate issue of the host sending garbage to the disk 
> device because its own power is failing at the same time, which makes the 
> integrity at the disk level moot (or even undesirable, as you'd rather 
> write a bad sector than a good one with the wrong data).

Yep, exactly.  It would be interesting to see if this happens on
modern hardware; all of the evidence I've had for this is years old at
this point.  

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)

2008-01-18 Thread Theodore Tso

On Thu, Jan 17, 2008 at 04:31:48PM -0800, Bryan Henderson wrote:
 But I heard some years ago from a disk drive engineer that that is a myth 
 just like the rotational energy thing.  I added that to the discussion, 
 but admitted that I haven't actually seen a disk drive write a partial 
 sector.

Well, it would be impossible or at least very hard to see that in
practice, right?  My understanding is that drives do sector-level
checksums, so if there was a partially written sector, the checksum
would be bogus and the drive would return an error when you tried to
read from it.

 Ted brought up the separate issue of the host sending garbage to the disk 
 device because its own power is failing at the same time, which makes the 
 integrity at the disk level moot (or even undesirable, as you'd rather 
 write a bad sector than a good one with the wrong data).

Yep, exactly.  It would be interesting to see if this happens on
modern hardware; all of the evidence I've had for this is years old at
this point.  

- Ted
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)

2008-01-17 Thread Theodore Tso

On Wed, Jan 16, 2008 at 09:02:50PM -0500, Daniel Phillips wrote:
> 
> Have you observed that in the wild?  A former engineer of a disk drive
> company suggests to me that the capacitors on the board provide enough
> power to complete the last sector, even to park the head.
> 

The problem isn't with the disk drive; it's from the DRAM, which tend
to be much more voltage sensitive than the hard drives --- so it's
quite likely that you could end up DMA'ing garbage from the memory.
In fact the fact that the disk drives lasts longer due to capacitors
on the board, rotational inertia of the platters, etc., is part of the
problem.

It was observed in the wild by SGI, many years ago on their hardware.
They later added extra capacitors on the motherboard and a powerfail
interrupt which caused the Irix to run around frantically shutting
down DMA's for a controlled shutdown.  Of course, PC-class hardware
has none of this.  My source for this was Jim Mostek, one of the
original Linux XFS porters.  He had given me source code to a test
program that would show this; basically zeroed out a region of disk,
then started writing series of patterns on that part of the, and you
you kicked out the power cord, and then see if there was any garbage
on the disk.  If you saw something that wasn't one of the patterns
being written to the disk, then you knew you had a problem.  I can't
find the program any more, but it wouldn't be hard to write.  

I do know that I have seen reports from many ext2 users in the field
that could only be explained by the hard drive scribbling garbage onto
the inode table.  Ext3 solves this problem because of its physical
block journaling.

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)

2008-01-17 Thread Theodore Tso

On Wed, Jan 16, 2008 at 09:02:50PM -0500, Daniel Phillips wrote:
 
 Have you observed that in the wild?  A former engineer of a disk drive
 company suggests to me that the capacitors on the board provide enough
 power to complete the last sector, even to park the head.
 

The problem isn't with the disk drive; it's from the DRAM, which tend
to be much more voltage sensitive than the hard drives --- so it's
quite likely that you could end up DMA'ing garbage from the memory.
In fact the fact that the disk drives lasts longer due to capacitors
on the board, rotational inertia of the platters, etc., is part of the
problem.

It was observed in the wild by SGI, many years ago on their hardware.
They later added extra capacitors on the motherboard and a powerfail
interrupt which caused the Irix to run around frantically shutting
down DMA's for a controlled shutdown.  Of course, PC-class hardware
has none of this.  My source for this was Jim Mostek, one of the
original Linux XFS porters.  He had given me source code to a test
program that would show this; basically zeroed out a region of disk,
then started writing series of patterns on that part of the, and you
you kicked out the power cord, and then see if there was any garbage
on the disk.  If you saw something that wasn't one of the patterns
being written to the disk, then you knew you had a problem.  I can't
find the program any more, but it wouldn't be hard to write.  

I do know that I have seen reports from many ext2 users in the field
that could only be explained by the hard drive scribbling garbage onto
the inode table.  Ext3 solves this problem because of its physical
block journaling.

- Ted
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [CALL FOR TESTING] Make Ext3 fsck way faster [2.6.24-rc6 -mm patch]

2008-01-15 Thread Theodore Tso

On Tue, Jan 15, 2008 at 01:15:33PM +, Christoph Hellwig wrote:
> They won't fsck in planned downtimes.  They will have to use fsck when
> the shit hits the fan and they need to.   Not sure about ext3, but big
> XFS user with a close tie to the US goverment were concerned about this
> case for really big filesystems and have sponsored speedup including
> multithreading xfs_repair.  I'm pretty sure the same arguments apply
> to ext3, even if the filesystems are a few magnitudes smaller.

Agreed, 100%.  Even if you fsck snapshots during slow periods, it
still doesn't help you if the filesystem gets corrupted due to a
hardware or software error.  That's where this will matter the most.

Val Hensen has done a proof of concept patch that multi-threads e2fsck
(and she's working on one that would be long-term supportable) that
might reduce the value of this patch, but metaclustering should still
help.

> > In any decent environment, people will fsck their ext3 filesystems during
> > planned downtime, and the benefit of reducing that downtime from 6
> > hours/machine to 2 hours/machine is probably fairly small, given that there
> > is no service interruption.  (The same applies to desktops and laptops).
> > 
> > Sure, the benefit is not *zero*, but it's small.  Much less than it would
> > be with ext2.  I mean, the "avoid unplanned fscks" feature is the whole
> > reason why ext3 has journalling (and boy is that feature expensive during
> > normal operation).

Also, it's not just reducing fsck times, although that's the main one.
The last time this was suggested, the rationale was to speed up the
"rm dvd.iso" case.  Also, something which *could* be done, if Abhishek
wants to pursue it, would be to pull in all of the indirect blocks
when the file is opened, and create an in-memory extent tree that
would speed up access to the file.  It's rarely worth doing this
without metaclustering, since it doesn't help for sequential I/O, only
random I/O, but with metaclustering it would also be a win for
sequential I/O.  (This would also remove the minor performance
degradation for sequential I/O imposed by metaclustering, and in fact
improve it slightly for really big files.)

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [CALL FOR TESTING] Make Ext3 fsck way faster [2.6.24-rc6 -mm patch]

2008-01-15 Thread Theodore Tso

On Tue, Jan 15, 2008 at 01:15:33PM +, Christoph Hellwig wrote:
 They won't fsck in planned downtimes.  They will have to use fsck when
 the shit hits the fan and they need to.   Not sure about ext3, but big
 XFS user with a close tie to the US goverment were concerned about this
 case for really big filesystems and have sponsored speedup including
 multithreading xfs_repair.  I'm pretty sure the same arguments apply
 to ext3, even if the filesystems are a few magnitudes smaller.

Agreed, 100%.  Even if you fsck snapshots during slow periods, it
still doesn't help you if the filesystem gets corrupted due to a
hardware or software error.  That's where this will matter the most.

Val Hensen has done a proof of concept patch that multi-threads e2fsck
(and she's working on one that would be long-term supportable) that
might reduce the value of this patch, but metaclustering should still
help.

  In any decent environment, people will fsck their ext3 filesystems during
  planned downtime, and the benefit of reducing that downtime from 6
  hours/machine to 2 hours/machine is probably fairly small, given that there
  is no service interruption.  (The same applies to desktops and laptops).
  
  Sure, the benefit is not *zero*, but it's small.  Much less than it would
  be with ext2.  I mean, the avoid unplanned fscks feature is the whole
  reason why ext3 has journalling (and boy is that feature expensive during
  normal operation).

Also, it's not just reducing fsck times, although that's the main one.
The last time this was suggested, the rationale was to speed up the
rm dvd.iso case.  Also, something which *could* be done, if Abhishek
wants to pursue it, would be to pull in all of the indirect blocks
when the file is opened, and create an in-memory extent tree that
would speed up access to the file.  It's rarely worth doing this
without metaclustering, since it doesn't help for sequential I/O, only
random I/O, but with metaclustering it would also be a win for
sequential I/O.  (This would also remove the minor performance
degradation for sequential I/O imposed by metaclustering, and in fact
improve it slightly for really big files.)

- Ted
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: The ext3 way of journalling

2008-01-13 Thread Theodore Tso

On Mon, Jan 14, 2008 at 12:23:10AM +0200, Tuomo Valkonen wrote:
> On 2008-01-14 00:13 +0200, Tuomo Valkonen wrote:
> > Also, I must say that e2fsck is brain-damaged, if it can be confused 
> > by/do the stupid then when the system clock has warped by just a few
> > hours, not the _days_ that a file system check interval typically is,
> > and users need to specifically kludge around such misbehaviour in 
> > e2fsck.
> 
> Just to clarify, I had about 60 days of uptime, and hence at least
> 60 days since the last FS check/mount/etc., when Linux crashed those
> few days ago, and wanted to start checking disks with "9192 days since
> last file system check".

Well, let's see.  9192 days is a little over 25 years, so that means
the filesystem was marked as having done an fsck in 2008-25 or roughly
1983.  If you're not seeing any other corruption when e2fsck runs,
it's highly unlikely that the superblock is getting corrupted.  It's
much more likely that this early in your boot cycle, your clock is
sometimes incorrect.

My suggestion to you is to rig your init scripts to print out the the
current time/date using "/bin/date" and to print out the superblock
information using "dumpe2fs -h /dev/hdXX" and record the information
someplace useful.  A simple way to do this would be via the following
command inserted into /etc/init.d/checkroot.sh:

(date; /sbin/dumpe2fs -h /dev/XXX) | logsave -a /var/log/boot-debug -

where you've replaced /dev/XXX with the block device of the filesystem
which keeps on getting checked erroneously.

All I can say is that most people aren't see what you're seeing, so
there is something unique about your system which is causing this
problem to show up.  9192 days means it's not the time going backwards
scenario; somehow the last checked value is getting set to some very
bogus value.  Normally the only way this could happen is for the time
to be set to a bogus value (i.e., 1982) when the filesystem check
takes place.  Is the "9192" number roughly constant, or is it always
changing?

I wonder if the battery-backed hardware clock in your system is
busted, and so you're always starting the system with some completely
bogus time.  If your machine is on the network, then the "ntpdate"
program could be setting your time so that it looks correct, but
that's after e2fsck is run.  If you really, really, can't guarantee
that the time on your system is correct in early boot, about the only
thing you really *can* do is to use the command "tune2fs -i 0
/dev/XXX" and disable time-based checks altogether.

Regards and best of luck,

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: The ext3 way of journalling

2008-01-13 Thread Theodore Tso

On Mon, Jan 14, 2008 at 12:23:10AM +0200, Tuomo Valkonen wrote:
 On 2008-01-14 00:13 +0200, Tuomo Valkonen wrote:
  Also, I must say that e2fsck is brain-damaged, if it can be confused 
  by/do the stupid then when the system clock has warped by just a few
  hours, not the _days_ that a file system check interval typically is,
  and users need to specifically kludge around such misbehaviour in 
  e2fsck.
 
 Just to clarify, I had about 60 days of uptime, and hence at least
 60 days since the last FS check/mount/etc., when Linux crashed those
 few days ago, and wanted to start checking disks with 9192 days since
 last file system check.

Well, let's see.  9192 days is a little over 25 years, so that means
the filesystem was marked as having done an fsck in 2008-25 or roughly
1983.  If you're not seeing any other corruption when e2fsck runs,
it's highly unlikely that the superblock is getting corrupted.  It's
much more likely that this early in your boot cycle, your clock is
sometimes incorrect.

My suggestion to you is to rig your init scripts to print out the the
current time/date using /bin/date and to print out the superblock
information using dumpe2fs -h /dev/hdXX and record the information
someplace useful.  A simple way to do this would be via the following
command inserted into /etc/init.d/checkroot.sh:

(date; /sbin/dumpe2fs -h /dev/XXX) | logsave -a /var/log/boot-debug -

where you've replaced /dev/XXX with the block device of the filesystem
which keeps on getting checked erroneously.

All I can say is that most people aren't see what you're seeing, so
there is something unique about your system which is causing this
problem to show up.  9192 days means it's not the time going backwards
scenario; somehow the last checked value is getting set to some very
bogus value.  Normally the only way this could happen is for the time
to be set to a bogus value (i.e., 1982) when the filesystem check
takes place.  Is the 9192 number roughly constant, or is it always
changing?

I wonder if the battery-backed hardware clock in your system is
busted, and so you're always starting the system with some completely
bogus time.  If your machine is on the network, then the ntpdate
program could be setting your time so that it looks correct, but
that's after e2fsck is run.  If you really, really, can't guarantee
that the time on your system is correct in early boot, about the only
thing you really *can* do is to use the command tune2fs -i 0
/dev/XXX and disable time-based checks altogether.

Regards and best of luck,

- Ted
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: The ext3 way of journalling

2008-01-12 Thread Theodore Tso

On Thu, Jan 10, 2008 at 03:41:11PM +0200, Tuomo Valkonen wrote:
> On 2008-01-10 08:16 -0500, Theodore Tso wrote:
> > > It displays just the right time. On boot anyway. (Linux has had some
> > > serious problems keeping the time after the switch from 2.6.7 to 2.6.14,
> > > advanding even 15 minutes a day -- that ntpd doesn't seem to be able 
> > > to keep up with -- requiring running adjtimexconfig every now and
> > > then for new settings. But the cmos clock displays the right time.)
> > 
> > What do you mean by "on boot"?  Which boot message, precisely?  Is the
> > time printed before or after e2fsck is run, and by which program?
> 
> The time is right as displayed by `date` after boot, i.e. after it has
> been loaded from the CMOS clock that does keep the (local, IIRC) time
> just allright. But then it often starts advancing very fast.

So running the "date" command after the boot sequence is completely
finished.  That doesn't mean that system clock was correct at the time
when fsck is run.  

See, here's the the problem.  You have the CMOS hardware clock, which
for people who are dual-booting with Windows, is unfortunately ticking
local time, instead of GMT time (or if you want to be pedantic, UTC
time; whatever).  When the kernel is first loaded and starts
executing, it will set the Linux system clock from the CMOS hardware
clock.  However, it has *no* idea whether the CMOS hardware clock is
ticking localtime or UTC time.  The Linux system clock (i.e., what is
returned via the gettimeofday() or time() functions) is always UTC
time.

What happens later is that distribution init scripts will adjust the
system clock either forward or backwards if the system is set up so
that hardware is in Windows bug-compatibility mode where the CMOS
hwclock is ticking localtime.  If it is 1400 GMT, then in the
US/Eastern timezone, the clock will be 9:00am, so the clock will be
pushed four hours later.  If you are in the Central European Timezone,
then the local time will be 3pm, and the clock will be pushed
*backwards* by one hour.

The question is when does this happen.  In some buggy distributions,
this happens *after* e2fsck is run.  And it is in those distributions
e2fsck can sometimes get confused about when the last time the
filesystem was checked --- especially if the system is getting
rebooted a lot (which tends to be the case with people who are
dual-booting).  So the cases where this happens a lot are (a) people
who are using windows and so the CMOS hwclock is ticking localtime,
(b) distributions that don't adjust the Linux system clock before
e2fsck runs.  Unfortunately Ubuntu users in Europe fit this
demographic hugely, and Ubuntu refuses to fix this problem[1], so it's
been personally very vexing, because the users complain to *me*, and I
can't fix the problem, because it's a distribution init script issue.

So what I tell people is to upgrade to the latest e2fsprogs, and then
set in /etc/e2fsck.conf:

[options]
buggy_init_scripts = 1

Maybe someday Ubuntu will get this right --- but I'm not counting on it.

[1] Something about installer CD's, and not wanting to ask the users
any questions, not even what time zone they are in, or some other
crazyness.  I never completely understood the argument and their
design constraints.

 - Ted

P.S.  If there are other scripts which are started, they can also get
confused because the time is getting warped backwards early-on.  I
haven't done an analysis to find out which sort programs might be
vulnerable to this, but this is not necessarily an e2fsck-specific
problems.  After all, it *is* reasonable to expect that the time
returned by time(0) or gettimeofday() is correct, and many programs do
make that assumption
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFD] Incremental fsck

2008-01-12 Thread Theodore Tso

On Wed, Jan 09, 2008 at 02:52:14PM +0300, Al Boldi wrote:
> 
> Ok, but let's look at this a bit more opportunistic / optimistic.
> 
> Even after a black-out shutdown, the corruption is pretty minimal, using 
> ext3fs at least.
>

After a unclean shutdown, assuming you have decent hardware that
doesn't lie about when blocks hit iron oxide, you shouldn't have any
corruption at all.  If you have crappy hardware, then all bets are off

> So let's take advantage of this fact and do an optimistic fsck, to
> assure integrity per-dir, and assume no external corruption.  Then
> we release this checked dir to the wild (optionally ro), and check
> the next.  Once we find external inconsistencies we either fix it
> unconditionally, based on some preconfigured actions, or present the
> user with options.

So what can you check?  The *only* thing you can check is whether or
not the directory syntax looks sane, whether the inode structure looks
sane, and whether or not the blocks reported as belong to an inode
looks sane.

What is very hard to check is whether or not the link count on the
inode is correct.  Suppose the link count is 1, but there are actually
two directory entries pointing at it.  Now when someone unlinks the
file through one of the directory hard entries, the link count will go
to zero, and the blocks will start to get reused, even though the
inode is still accessible via another pathname.  Oops.  Data Loss.

This is why doing incremental, on-line fsck'ing is *hard*.  You're not
going to find this while doing each directory one at a time, and if
the filesystem is changing out from under you, it gets worse.  And
it's not just the hard link count.  There is a similar issue with the
block allocation bitmap.  Detecting the case where two files are
simultaneously can't be done if you are doing it incrementally, and if
the filesystem is changing out from under you, it's impossible, unless
you also have the filesystem telling you every single change while it
is happening, and you keep an insane amount of bookkeeping.

One that you *might* be able to do, is to mount a filesystem readonly,
check it in the background while you allow users to access it
read-only.  There are a few caveats, however  (1) some filesystem
errors may cause the data to be corrupt, or in the worst case, could
cause the system to panic (that's would arguably be a
filesystem/kernel bug, but we've not necessarily done as much testing
here as we should.)  (2) if there were any filesystem errors found,
you would beed to completely unmount the filesystem to flush the inode
cache and remount it before it would be safe to remount the filesystem
read/write.  You can't just do a "mount -o remount" if the filesystem
was modified under the OS's nose.

> All this could be per-dir or using some form of on-the-fly file-block-zoning.
> 
> And there probably is a lot more to it, but it should conceptually be 
> possible, with more thoughts though...

Many things are possible, in the NASA sense of "with enough thrust,
anything will fly".  Whether or not it is *useful* and *worthwhile*
are of course different questions!  :-)

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: The ext3 way of journalling

2008-01-12 Thread Theodore Tso

On Thu, Jan 10, 2008 at 03:41:11PM +0200, Tuomo Valkonen wrote:
 On 2008-01-10 08:16 -0500, Theodore Tso wrote:
   It displays just the right time. On boot anyway. (Linux has had some
   serious problems keeping the time after the switch from 2.6.7 to 2.6.14,
   advanding even 15 minutes a day -- that ntpd doesn't seem to be able 
   to keep up with -- requiring running adjtimexconfig every now and
   then for new settings. But the cmos clock displays the right time.)
  
  What do you mean by on boot?  Which boot message, precisely?  Is the
  time printed before or after e2fsck is run, and by which program?
 
 The time is right as displayed by `date` after boot, i.e. after it has
 been loaded from the CMOS clock that does keep the (local, IIRC) time
 just allright. But then it often starts advancing very fast.

So running the date command after the boot sequence is completely
finished.  That doesn't mean that system clock was correct at the time
when fsck is run.  

See, here's the the problem.  You have the CMOS hardware clock, which
for people who are dual-booting with Windows, is unfortunately ticking
local time, instead of GMT time (or if you want to be pedantic, UTC
time; whatever).  When the kernel is first loaded and starts
executing, it will set the Linux system clock from the CMOS hardware
clock.  However, it has *no* idea whether the CMOS hardware clock is
ticking localtime or UTC time.  The Linux system clock (i.e., what is
returned via the gettimeofday() or time() functions) is always UTC
time.

What happens later is that distribution init scripts will adjust the
system clock either forward or backwards if the system is set up so
that hardware is in Windows bug-compatibility mode where the CMOS
hwclock is ticking localtime.  If it is 1400 GMT, then in the
US/Eastern timezone, the clock will be 9:00am, so the clock will be
pushed four hours later.  If you are in the Central European Timezone,
then the local time will be 3pm, and the clock will be pushed
*backwards* by one hour.

The question is when does this happen.  In some buggy distributions,
this happens *after* e2fsck is run.  And it is in those distributions
e2fsck can sometimes get confused about when the last time the
filesystem was checked --- especially if the system is getting
rebooted a lot (which tends to be the case with people who are
dual-booting).  So the cases where this happens a lot are (a) people
who are using windows and so the CMOS hwclock is ticking localtime,
(b) distributions that don't adjust the Linux system clock before
e2fsck runs.  Unfortunately Ubuntu users in Europe fit this
demographic hugely, and Ubuntu refuses to fix this problem[1], so it's
been personally very vexing, because the users complain to *me*, and I
can't fix the problem, because it's a distribution init script issue.

So what I tell people is to upgrade to the latest e2fsprogs, and then
set in /etc/e2fsck.conf:

[options]
buggy_init_scripts = 1

Maybe someday Ubuntu will get this right --- but I'm not counting on it.


[1] Something about installer CD's, and not wanting to ask the users
any questions, not even what time zone they are in, or some other
crazyness.  I never completely understood the argument and their
design constraints.

 - Ted

P.S.  If there are other scripts which are started, they can also get
confused because the time is getting warped backwards early-on.  I
haven't done an analysis to find out which sort programs might be
vulnerable to this, but this is not necessarily an e2fsck-specific
problems.  After all, it *is* reasonable to expect that the time
returned by time(0) or gettimeofday() is correct, and many programs do
make that assumption
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Make the 32 bit Frame Pointer backtracer fall back to traditional

2008-01-11 Thread Theodore Tso

On Fri, Jan 11, 2008 at 11:41:40AM -0800, Linus Torvalds wrote:
> 
> (I also wonder if we should limit the number of entries we print out. 
> Sometimes the stack frame ends up being so deep that we lose the 
> *important* stuff. I think it might be good idea to have some rule like 
> "the first 5 entries go to the screen, the rest will be KERN_DEBUG and 
> only go to the logs by default" - so a "dmesg" would show it all, but if 
> the machine is hung, the screen won't have been scrolled away from all 
> the other things by a long backtrace!)

What might be useful is the first 5 and last 5.  Sometimes if you have
a very deep call chain, the what was the original system call or
interrupt which got the kernel deep into la-la land can often be
useful.  Just a thought.

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Make the 32 bit Frame Pointer backtracer fall back to traditional

2008-01-11 Thread Theodore Tso

On Fri, Jan 11, 2008 at 11:41:40AM -0800, Linus Torvalds wrote:
 
 (I also wonder if we should limit the number of entries we print out. 
 Sometimes the stack frame ends up being so deep that we lose the 
 *important* stuff. I think it might be good idea to have some rule like 
 the first 5 entries go to the screen, the rest will be KERN_DEBUG and 
 only go to the logs by default - so a dmesg would show it all, but if 
 the machine is hung, the screen won't have been scrolled away from all 
 the other things by a long backtrace!)

What might be useful is the first 5 and last 5.  Sometimes if you have
a very deep call chain, the what was the original system call or
interrupt which got the kernel deep into la-la land can often be
useful.  Just a thought.

- Ted
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: The ext3 way of journalling

2008-01-10 Thread Theodore Tso

On Wed, Jan 09, 2008 at 02:16:52PM +, Tuomo Valkonen wrote:
> On 2008-01-09, Mathieu SEGAUD <[EMAIL PROTECTED]> wrote:
> > fix your hardware clock then
> 
> It displays just the right time. On boot anyway. (Linux has had some
> serious problems keeping the time after the switch from 2.6.7 to 2.6.14,
> advanding even 15 minutes a day -- that ntpd doesn't seem to be able 
> to keep up with -- requiring running adjtimexconfig every now and
> then for new settings. But the cmos clock displays the right time.)

What do you mean by "on boot"?  Which boot message, precisely?  Is the
time printed before or after e2fsck is run, and by which program?

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: The ext3 way of journalling

2008-01-10 Thread Theodore Tso

On Wed, Jan 09, 2008 at 02:16:52PM +, Tuomo Valkonen wrote:
 On 2008-01-09, Mathieu SEGAUD [EMAIL PROTECTED] wrote:
  fix your hardware clock then
 
 It displays just the right time. On boot anyway. (Linux has had some
 serious problems keeping the time after the switch from 2.6.7 to 2.6.14,
 advanding even 15 minutes a day -- that ntpd doesn't seem to be able 
 to keep up with -- requiring running adjtimexconfig every now and
 then for new settings. But the cmos clock displays the right time.)

What do you mean by on boot?  Which boot message, precisely?  Is the
time printed before or after e2fsck is run, and by which program?

- Ted
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: The ext3 way of journalling

2008-01-09 Thread Theodore Tso

On Wed, Jan 09, 2008 at 02:55:53AM -0500, [EMAIL PROTECTED] wrote:
> 
> Does this create a snapshot of the *disk* at that moment, or does it
> capture "disk plus still-to-be-written blocks in the cache"?
> (Phrased differently, does it Do The Right Thing regarding "blocks
> queued before lvcreate" and "blocks queued for write after
> lvcreate")?
> 
> If the snapshot doesn't capture the blocks queued but still
> unwritten by kjournald and similar, then you're still hitting the
> same old problems that you always get when you fsck an "active
> disk".

Actually, it does better than that.  For ext3 and xfs, it will take a
snapshot of the filesystem in a quiscent state; that is, it will force
the journal transaction to close, suspend all filesystem activity,
take a snapshot of the disk as if it had been unmounted, and then
allow filesystem activity to continue.

So if you look at an ext3 filesystem taken in this way, you will see
that the NEEDS_RECOVERY flag is not set, since the ext3 journal is
empty on the snapshot.  So snapshots are also a great way of doing
stable backups.  For the purposes of stable backups, you'll also want
to quiesce your application files, particularly databases.  

For example, in the case of mysql, send the server the sql commands
"flush tables with read lock; flush logs", take the snapshot, and
then after the snapshot send the server the sql command "unlock tables".
For more information, see: 

http://forums.mysql.com/read.php?26,185026,185302#msg-185302

If you do this, you will get a snapshot of your disk where *both* the
database and the filesystem is at a stable state, perfect for doing a
backup.

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: The ext3 way of journalling

2008-01-09 Thread Theodore Tso

On Wed, Jan 09, 2008 at 11:28:21AM +0100, Matthias Schniedermeyer wrote:
> On 09.01.2008 11:21, Matthias Schniedermeyer wrote:
> > On 09.01.2008 09:56, Tuomo Valkonen wrote:
> > > On 2008-01-09 00:06 +0100, Matthias Schniedermeyer wrote:
> > > > That what LABEL und UUID-Support in mount is for.
> > > 
> > > That's udev shit. I don't want it.
> > 
> > No.
> 
> To be more verbose.
> 
> The 'LABEL=' is native mount turf and is much older than udev.

Native fsck supports it to; "LABEL=" and "UUID=" support has been in
e2fsprogs since July 3rd, 1999.  (Mount had it a little before then,
but you needed both mount and fsck support before the feature could be
used.)

And it has *nothing* to do with udev

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: The ext3 way of journalling

2008-01-09 Thread Theodore Tso

On Wed, Jan 09, 2008 at 10:54:11AM +0100, Martin Schwidefsky wrote:
> On Jan 8, 2008 7:15 PM, Theodore Tso <[EMAIL PROTECTED]> wrote:
> > That will fix the this issue.  The problem you are facing is that you
> > have your hardware clock set to ticking localtime, instead of GMT.
> > Windows ticks localtime, which is a mistake carried over from the
> > 1970's and MS-DOS.  Ticking localtime has all sorts of problems, among
> > which is if you reboot around the transition between Summer Time (or
> > Daylight Savings Time, depending on your contry) and normal time, the
> > OS has no idea whether the DST adjustment has been applied or not.
> 
> Actually you can force Windows to accept a hardware clock in UTC:
> HKEY_LOCAL_MACHINE/SYSTEMCurrentControlSetControl/TimeZoneInformation/RealTimeIsUniversal

Oh, so cool!!!  Do you know off hand what version of Windows started
honoring that registry setting? 

And what do you set that registry value to?  Just a boolean "true"?

Now, how to convince Ubuntu to put this in their FAQ so I stop having
their ahhh, less than clueful dual-booting Windows users who happen to
live in Europe stop submitting bugs on this issue

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: The ext3 way of journalling

2008-01-09 Thread Theodore Tso

On Wed, Jan 09, 2008 at 10:54:11AM +0100, Martin Schwidefsky wrote:
 On Jan 8, 2008 7:15 PM, Theodore Tso [EMAIL PROTECTED] wrote:
  That will fix the this issue.  The problem you are facing is that you
  have your hardware clock set to ticking localtime, instead of GMT.
  Windows ticks localtime, which is a mistake carried over from the
  1970's and MS-DOS.  Ticking localtime has all sorts of problems, among
  which is if you reboot around the transition between Summer Time (or
  Daylight Savings Time, depending on your contry) and normal time, the
  OS has no idea whether the DST adjustment has been applied or not.
 
 Actually you can force Windows to accept a hardware clock in UTC:
 HKEY_LOCAL_MACHINE/SYSTEMCurrentControlSetControl/TimeZoneInformation/RealTimeIsUniversal

Oh, so cool!!!  Do you know off hand what version of Windows started
honoring that registry setting? 

And what do you set that registry value to?  Just a boolean true?

Now, how to convince Ubuntu to put this in their FAQ so I stop having
their ahhh, less than clueful dual-booting Windows users who happen to
live in Europe stop submitting bugs on this issue

- Ted
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: The ext3 way of journalling

2008-01-09 Thread Theodore Tso

On Wed, Jan 09, 2008 at 11:28:21AM +0100, Matthias Schniedermeyer wrote:
 On 09.01.2008 11:21, Matthias Schniedermeyer wrote:
  On 09.01.2008 09:56, Tuomo Valkonen wrote:
   On 2008-01-09 00:06 +0100, Matthias Schniedermeyer wrote:
That what LABEL und UUID-Support in mount is for.
   
   That's udev shit. I don't want it.
  
  No.
 
 To be more verbose.
 
 The 'LABEL=' is native mount turf and is much older than udev.

Native fsck supports it to; LABEL= and UUID= support has been in
e2fsprogs since July 3rd, 1999.  (Mount had it a little before then,
but you needed both mount and fsck support before the feature could be
used.)

And it has *nothing* to do with udev

- Ted
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: The ext3 way of journalling

2008-01-09 Thread Theodore Tso

On Wed, Jan 09, 2008 at 02:55:53AM -0500, [EMAIL PROTECTED] wrote:
 
 Does this create a snapshot of the *disk* at that moment, or does it
 capture disk plus still-to-be-written blocks in the cache?
 (Phrased differently, does it Do The Right Thing regarding blocks
 queued before lvcreate and blocks queued for write after
 lvcreate)?
 
 If the snapshot doesn't capture the blocks queued but still
 unwritten by kjournald and similar, then you're still hitting the
 same old problems that you always get when you fsck an active
 disk.

Actually, it does better than that.  For ext3 and xfs, it will take a
snapshot of the filesystem in a quiscent state; that is, it will force
the journal transaction to close, suspend all filesystem activity,
take a snapshot of the disk as if it had been unmounted, and then
allow filesystem activity to continue.

So if you look at an ext3 filesystem taken in this way, you will see
that the NEEDS_RECOVERY flag is not set, since the ext3 journal is
empty on the snapshot.  So snapshots are also a great way of doing
stable backups.  For the purposes of stable backups, you'll also want
to quiesce your application files, particularly databases.  

For example, in the case of mysql, send the server the sql commands
flush tables with read lock; flush logs, take the snapshot, and
then after the snapshot send the server the sql command unlock tables.
For more information, see: 

http://forums.mysql.com/read.php?26,185026,185302#msg-185302

If you do this, you will get a snapshot of your disk where *both* the
database and the filesystem is at a stable state, perfect for doing a
backup.

- Ted
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: The ext3 way of journalling

2008-01-08 Thread Theodore Tso

On Tue, Jan 08, 2008 at 09:51:53PM +0100, Andi Kleen wrote:
> Theodore Tso <[EMAIL PROTECTED]> writes:
> >
> > Now, there are good reasons for doing periodic checks every N mounts
> > and after M months.  And it has to do with PC class hardware.  (Ted's
> > aphorism: "PC class hardware is cr*p"). 
> 
> If these reasons are good ones (some skepticism here) then the correct
> way to really handle this would be to do regular background scrubbing
> during runtime; ideally with metadata checksums so that you can actually
> detect all corruption.

That's why we're adding various checksums to ext4...

And yes, I agree that background scrubbing is a good idea.  Larry
McVoy a while back told me the results of using a fast CRC to get
checksums on all of his archived data files, and then periodically
recalculating the CRC's and checking them against the stored checksum
values.  The surprising thing was that once every so often (and the
fact that it happens at all is disturbing), he would find that a file
had a broken checksum even though it had apparently never been
intentionally modified (it was in an archived file set, the modtime of
the file hadn't changed, etc.)

And the fact that disk manufacturers on their high end enterprise
disks design their block guard system to detect cases where a block
gets written to a different part of the disk than where the OS
requested it to be written, and that I've been told of at least one
commercial large-scale enterprise database which puts a logical block
number in the on-disk format of their tablespace files to detect this
problem --- should give you some pause about how much faith at least
some people who are paid a lot of money to worry about absolute data
integrity have in modern-day hard drives

> But since fsck is so slow and disks are so big this whole thing
> is a ticking time bomb now. e.g. it is not uncommon to require tens
> of minutes or even hours of fsck time and some server that reboots
> only every few months will eat that when it happens to reboot.
> This means you get a quite long downtime. 

What I actually recommend (and what I do myself) is to use
devicemapper to create a snapshot, and then run "e2fsck -p" on the
snapshot.  If the snapshot without *any* errors (i.e., exit code of
0), then it can run "tune2fs -C 0 -T now /dev/XXX", and discard the
snapshot, and exit.  If e2fsck returns any non-zero error code,
indicating that it found changes, the output of e2fsck should be sent
e-mailed to the system administrator so they can schedule downtime and
fix the filesystem corruption.

This avoids the long downtime at reboot time.  You can do the above in
a cron script that runs at some convenient time during low usage
(i.e., 3am localtime on a Saturday morning, or whatever).

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Deprecate checkpatch.pl --file

2008-01-08 Thread Theodore Tso

On Tue, Jan 08, 2008 at 12:19:44PM -0800, Daniel Walker wrote:
> > But is discourage the creation of pure clean-up patches because it
> > may have a disturbing effect on several other peoples work.
> 
> pure clean ups are _good_ patches , are they not?
> 

Not necessarily.  Whether or not it is requires common sense, and very
often we get enthusiastic new-comers (some of them with very weak C
programming skills :-) who might try to use checkpatch.pl.  So we
can't assume that they will know when a pure clean-up patch is a good
thing, and when it's a waste of everyone's time, including theirs.

That's why I think the warning is a good thing.  It makes it more
likely that this gets communicated to the enthusiastic, well-meaning,
newcomer.  Someone who is more experienced and who knows how to
determine whether some driver is ancient and not being worked on, and
hence a pure clean-up patch won't be screwing up other developers,
will know how to suppress the warning.  (OTOH, how important is it in
the grand scheem of things to create or apply a pure clean-up patch on
a patch that few people if any are looking at?)

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Deprecate checkpatch.pl --file

2008-01-08 Thread Theodore Tso

On Tue, Jan 08, 2008 at 10:01:19AM -0800, Daniel Walker wrote:
> > It is a simple pain/benefit issue.
> > Fixing the 25 errors and 13 warnings in kernel/profile.c may look
> > like an easy task but then we put additional burden on the 10 people
> > that have patches pending for this file.
> 
> This goes for all patches on kernel/profile.c tho .. If I make a big mod
> to kernel/profile.c, that will screw up anyone else who has patches for
> that file..

Obviously, but why make it worse?  And what's more important?  A
"clean tree" (especially when some of the things that checkpatch.pl
flag are arbitrary and Not All That Important), or wasting developers'
time invalidating potentially huge number of patches thanks to cleanup
patches?

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: The ext3 way of journalling

2008-01-08 Thread Theodore Tso

On Tue, Jan 08, 2008 at 05:01:30PM +, Tuomo Valkonen wrote:
> On 2008-01-08, Andi Kleen <[EMAIL PROTECTED]> wrote:
> > tune2fs -i0 -c0 device  for each file system
> >
> > Yes that should be default, unfortunately it is not. It's one 
> > of the first things I do on new machines.
> 
> I have ages ago increased those counts, but I don't want to
> completely disable them. The problem is that the superblock
> is corrupted to indicate absurd "31352 days since last check".
> Who knows, maybe it would even corrupt those settings.

Newer e2fsprogs display a better message, and you can set an
/etc/e2fsck.conf setting:

[options]
buggy_init_scripts = 1

That will fix the this issue.  The problem you are facing is that you
have your hardware clock set to ticking localtime, instead of GMT.
Windows ticks localtime, which is a mistake carried over from the
1970's and MS-DOS.  Ticking localtime has all sorts of problems, among
which is if you reboot around the transition between Summer Time (or
Daylight Savings Time, depending on your contry) and normal time, the
OS has no idea whether the DST adjustment has been applied or not.  

It gets even worse if you have multiple operating systems, because
then one OS may have made the adjustment, and other one may no have
made the adjustment.  It's for that reason that if you reboot around
the right time of year, Windows throws up a big dialog box asking you
what the correct time should be.  Genius!

The problem on the Linux side is that some distributions, and Ubuntu
is the worse offender, but probably not the only one, do not correctly
set the system clock before they run fsck.  And if you live east of
GMT, such that your localtime offset is positive instead of negative,
then time can appear to go backwards and e2fsck can't trust the last
superblock check time.  Old versions of e2fsprogs display a funny
large time interval due to an integer overflow bug; that's since been
fixed.  (This bug doesn't support people in the US, because of our
time zone offset, but it tends to affect people in Europe who are
dual-booting with Windows and hance have their hardware clock tick
localtime.)

Now, there are good reasons for doing periodic checks every N mounts
and after M months.  And it has to do with PC class hardware.  (Ted's
aphorism: "PC class hardware is cr*p").  Windows users don't notice it
much because they generally blame the occasional blue screen of death
or corrupted file as an OS bug.  But very often, it is a hardware
issue, particularily on the cheaper PC class machines with no ECC
memory, and cheapest, unshielded hard drive cables from Taiwan that
the manufacturers can find.  Hence, the default is to do periodic
checks, since if you don't a random corruption can cause massive
filesystem corruption leading to massive data loss.

But, if you're confident in your hardware, you can turn that off.
tune2fs -c 0 will disable the number of mounts check, and tune2fs -i 0
will turn of the periodic time-based check.  And given that you have a
Linux distribution with buggy init scripts, that is one way of working
around the problem.

You could also simply change your CMOS/hardware clock to use GMT time,
and not localtime.  But that doesn't work well when you need to
dual-boot with Windows, since Windows doesn't support GMT time for the
hardware clock.

Another approach would involve using the /etc/e2fsck.conf settings
described above, but that will require possibly upgrading the version
of e2fsprogs that you have.  This will be the preferred mechanism
going forward, but perhaps not for the version of e2fsprogs you have
installed on your system.

Finally, I'm sorry this has obviously caused you so much stress.  If
you're happier using some other OS, please use whatever OS you find
makes you happiest.  I find that other deficiencies in Windows caused
my blood pressure to boil when I was forced (for a previous job) to
work on making programs run on Windows.  I consider the fact that I
can spend full-time working on Linux to be a blessing.  But if you
don't feel that way, my condolences, and please do what you need to do
so you can stay in your happy place.

Best regards,

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Deprecate checkpatch.pl --file

2008-01-08 Thread Theodore Tso

On Tue, Jan 08, 2008 at 10:01:19AM -0800, Daniel Walker wrote:
  It is a simple pain/benefit issue.
  Fixing the 25 errors and 13 warnings in kernel/profile.c may look
  like an easy task but then we put additional burden on the 10 people
  that have patches pending for this file.
 
 This goes for all patches on kernel/profile.c tho .. If I make a big mod
 to kernel/profile.c, that will screw up anyone else who has patches for
 that file..

Obviously, but why make it worse?  And what's more important?  A
clean tree (especially when some of the things that checkpatch.pl
flag are arbitrary and Not All That Important), or wasting developers'
time invalidating potentially huge number of patches thanks to cleanup
patches?

- Ted
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: The ext3 way of journalling

2008-01-08 Thread Theodore Tso

On Tue, Jan 08, 2008 at 05:01:30PM +, Tuomo Valkonen wrote:
 On 2008-01-08, Andi Kleen [EMAIL PROTECTED] wrote:
  tune2fs -i0 -c0 device  for each file system
 
  Yes that should be default, unfortunately it is not. It's one 
  of the first things I do on new machines.
 
 I have ages ago increased those counts, but I don't want to
 completely disable them. The problem is that the superblock
 is corrupted to indicate absurd 31352 days since last check.
 Who knows, maybe it would even corrupt those settings.

Newer e2fsprogs display a better message, and you can set an
/etc/e2fsck.conf setting:

[options]
buggy_init_scripts = 1

That will fix the this issue.  The problem you are facing is that you
have your hardware clock set to ticking localtime, instead of GMT.
Windows ticks localtime, which is a mistake carried over from the
1970's and MS-DOS.  Ticking localtime has all sorts of problems, among
which is if you reboot around the transition between Summer Time (or
Daylight Savings Time, depending on your contry) and normal time, the
OS has no idea whether the DST adjustment has been applied or not.  

It gets even worse if you have multiple operating systems, because
then one OS may have made the adjustment, and other one may no have
made the adjustment.  It's for that reason that if you reboot around
the right time of year, Windows throws up a big dialog box asking you
what the correct time should be.  Genius!

The problem on the Linux side is that some distributions, and Ubuntu
is the worse offender, but probably not the only one, do not correctly
set the system clock before they run fsck.  And if you live east of
GMT, such that your localtime offset is positive instead of negative,
then time can appear to go backwards and e2fsck can't trust the last
superblock check time.  Old versions of e2fsprogs display a funny
large time interval due to an integer overflow bug; that's since been
fixed.  (This bug doesn't support people in the US, because of our
time zone offset, but it tends to affect people in Europe who are
dual-booting with Windows and hance have their hardware clock tick
localtime.)

Now, there are good reasons for doing periodic checks every N mounts
and after M months.  And it has to do with PC class hardware.  (Ted's
aphorism: PC class hardware is cr*p).  Windows users don't notice it
much because they generally blame the occasional blue screen of death
or corrupted file as an OS bug.  But very often, it is a hardware
issue, particularily on the cheaper PC class machines with no ECC
memory, and cheapest, unshielded hard drive cables from Taiwan that
the manufacturers can find.  Hence, the default is to do periodic
checks, since if you don't a random corruption can cause massive
filesystem corruption leading to massive data loss.

But, if you're confident in your hardware, you can turn that off.
tune2fs -c 0 will disable the number of mounts check, and tune2fs -i 0
will turn of the periodic time-based check.  And given that you have a
Linux distribution with buggy init scripts, that is one way of working
around the problem.

You could also simply change your CMOS/hardware clock to use GMT time,
and not localtime.  But that doesn't work well when you need to
dual-boot with Windows, since Windows doesn't support GMT time for the
hardware clock.

Another approach would involve using the /etc/e2fsck.conf settings
described above, but that will require possibly upgrading the version
of e2fsprogs that you have.  This will be the preferred mechanism
going forward, but perhaps not for the version of e2fsprogs you have
installed on your system.

Finally, I'm sorry this has obviously caused you so much stress.  If
you're happier using some other OS, please use whatever OS you find
makes you happiest.  I find that other deficiencies in Windows caused
my blood pressure to boil when I was forced (for a previous job) to
work on making programs run on Windows.  I consider the fact that I
can spend full-time working on Linux to be a blessing.  But if you
don't feel that way, my condolences, and please do what you need to do
so you can stay in your happy place.

Best regards,

- Ted
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

1 2 3 4 5 6 7 8 >

1 - 100 of 724 matches

Mail list logo