Re: [PATCH] [8/18] BKL-removal: Remove BKL from remote_llseek

2008-01-28 Thread Andi Kleen

 I completely agree.  If one thread writes A and another writes B then the
 kernel should record either A or B, not ((A  0x) | (B 
 0x))

The problem is pretty nasty unfortunately. To solve it properly I think
the file_operations-read/write prototypes would need to be changed
because otherwise it is not possible to do atomic relative updates
of f_pos. Right now the actual update is burrowed deeply in the low level 
read/write implementation. But that would be a huge impact all over
the tree :/

Or maybe define a new read/write64 and keep the default as 32bit only-- i 
suppose most users don't really need 64bit. Still would be a nasty API 
change.

-Andi
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[2.6.24 REGRESSION] BUG: Soft lockup - with VFS

2008-01-28 Thread Oliver Pinter (Pintér Olivér)
hi all!

in the 2.6.24 become i some soft lockups with usb-phone, when i pluged
in the mobile, then the vfs-layer crashed. am afternoon can i the
.config send, and i bisected the kernel, when i have time.

pictures from crash:
http://students.zipernowsky.hu/~oliverp/kernel/regression_2624/
-- 
Thanks,
Oliver
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] [8/18] BKL-removal: Remove BKL from remote_llseek

2008-01-28 Thread Bodo Eggert
Trond Myklebust [EMAIL PROTECTED] wrote:
 On Mon, 2008-01-28 at 05:38 +0100, Andi Kleen wrote:
 On Monday 28 January 2008 05:13:09 Trond Myklebust wrote:
  On Mon, 2008-01-28 at 03:58 +0100, Andi Kleen wrote:

   The problem is that it's not a race in who gets to do its thing first,
   but a parallel reader can actually see a corrupted value from the two
   independent words on 32bit (e.g. during a 4GB). And this could actually
   completely corrupt f_pos when it happens with two racing relative seeks
   or read/write()s
   
   I would consider that a bug.
  
  I disagree. The corruption occurs because this isn't a situation that is
  allowed by either POSIX or SUSv2/v3. Exactly what spec are you referring
  to here?
 
 No specific spec, just general quality of implementation. We normally don't
 have non thread safe system calls even if it was in theory allowed by some
 specification.
 
 We've had the existing implementation for quite some time. The arguments
 against changing it have been the same all along: if your application
 wants to share files between threads, the portability argument implies
 that you should either use pread/pwrite or use a mutex or some other
 form of synchronisation primitive in order to ensure that
 lseek()/read()/write() do not overlap.

Does anything in the kernel depend on f_pos being valid?
E.g. is it possible to read beyond the EOF using this race, or to have files
larger than the ulimit?

If not, update the manpage and be done. ¢¢

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 21/26] mount options: partially fix nfs

2008-01-28 Thread Miklos Szeredi
  All mount options should be shown, which are needed to reconstruct a
  previous mount.
 
 Ah, OK.
 
 I'm happy to implement logic to display the all missing options.  I  
 should have updated nfs_show_mount_options() when I wrote the NFS  
 mount option parser.
 
 Let me know your preference.

You are more familiar with NFS, so I think it would be better if you
updated nfs_show_mount_options().

Could you also queue my patch (updated) or incorporate it into a
combined fix?

Thanks,
Miklos


Subject: mount options: partially fix nfs

From: Miklos Szeredi [EMAIL PROTECTED]

Add posix, bsize=, namelen= options to /proc/mounts for nfs
filesystems.

Document several other options that are still missing.

Changes:

 - display namelen= unconditionally
 - addr= isn't missing after all

Signed-off-by: Miklos Szeredi [EMAIL PROTECTED]
CC: Trond Myklebust [EMAIL PROTECTED]
---

Index: linux/fs/nfs/super.c
===
--- linux.orig/fs/nfs/super.c   2008-01-25 15:44:56.0 +0100
+++ linux/fs/nfs/super.c2008-01-25 15:57:32.0 +0100
@@ -449,6 +449,7 @@ static void nfs_show_mount_options(struc
} nfs_info[] = {
{ NFS_MOUNT_SOFT, ,soft, ,hard },
{ NFS_MOUNT_INTR, ,intr, ,nointr },
+   { NFS_MOUNT_POSIX, ,posix,  },
{ NFS_MOUNT_NOCTO, ,nocto,  },
{ NFS_MOUNT_NOAC, ,noac,  },
{ NFS_MOUNT_NONLM, ,nolock,  },
@@ -463,6 +464,9 @@ static void nfs_show_mount_options(struc
seq_printf(m, ,vers=%d, clp-rpc_ops-version);
seq_printf(m, ,rsize=%d, nfss-rsize);
seq_printf(m, ,wsize=%d, nfss-wsize);
+   seq_printf(m, ,namelen=%d, nfss-namelen);
+   if (nfss-bsize != 0)
+   seq_printf(m, ,bsize=%d, nfss-bsize);
if (nfss-acregmin != 3*HZ || showdefaults)
seq_printf(m, ,acregmin=%d, nfss-acregmin/HZ);
if (nfss-acregmax != 60*HZ || showdefaults)
@@ -482,6 +486,17 @@ static void nfs_show_mount_options(struc
seq_printf(m, ,timeo=%lu, 10U * nfss-client-cl_timeout-to_initval 
/ HZ);
seq_printf(m, ,retrans=%u, nfss-client-cl_timeout-to_retries);
seq_printf(m, ,sec=%s, 
nfs_pseudoflavour_to_name(nfss-client-cl_auth-au_flavor));
+
+   /*
+* Missing options:
+* port=
+* mountport=
+* mountvers=
+* mountproto=
+* clientaddr=
+* mounthost=
+* mountaddr=
+*/
 }
 
 /*
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 00/26] mount options: fix filesystem's -show_options

2008-01-28 Thread Miklos Szeredi
  On Thu, 24 Jan 2008 20:33:41 +0100 Miklos Szeredi [EMAIL PROTECTED] wrote:
  Andrew,
  
  Would you please consider these patches for -mm?
 
 Sure, but I'm too lazy to pick through them and work out which ones need
 updating, which ones got acked and which ones someone else merged, all on a
 very bumpy plane flight ;)
 
 Please resend when the dust has settled?

Yes, I should have thought, it won't quite work in a single iteration :)

I'll resend them in a moment.

Thanks,
Miklos
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 24/26] mount options: fix tmpfs

2008-01-28 Thread Miklos Szeredi
 
 Thanks Miklos, that's a welcome enhancement, nicely done.  I've only
 noticed one thing wrong (MPOL_PREFERRED shown as default); but thought
 shmem_config didn't add much value - I'd rather avoid those syntactic
 changes to unchanged code; and several tmpfs defaults being relative
 (e.g. to totalram_pages, or to mounter's fsuid), I ended up preferring
 to do real tests in shmem_show_options.

I completely agree, this is much better than my version.

 Thus, for example, if memory is hotplugged in or out later, what started
 out as an unspecified size option will then get shown as explicit size.
 (I did think for a while that I wanted to show explicit size in all
 cases; but it looked pretty silly on udev.)  I think that's the correct
 behaviour, that otherwise would be misleading; but I may be looking at
 this the wrong way round, what's your view?

I agree, this is the correct way.

I'll add functions for calculating the default max values, so the
calculations won't accidentally become different for the
initialization and the option showing.

 If you agree with the version below, please take it into your collection
 and insert your Signed-off-by.  I should admit, I've not yet tested how
 the NUMA policies look: you'll hear from me again tomorrow morning if
 those turn out to wrong.

OK, I'll send this to Andrew.  Maybe I'll wait until tomorrow to hear
if it's working on NUMA.

Thanks,
Miklos
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] [8/18] BKL-removal: Remove BKL from remote_llseek

2008-01-28 Thread Andi Kleen
On Monday 28 January 2008 13:56:05 Alan Cox wrote:
   No specific spec, just general quality of implementation.
  
  I completely agree.  If one thread writes A and another writes B then the
  kernel should record either A or B, not ((A  0x) | (B 
  0x))
 
 Agree entirely: the spec doesn't allow for random scribbling in the wrong
 place. It doesn't cover which goes first or who wins the race but
 provides pwrite/pread for that situation. Writing somewhere unrelated is
 definitely not to spec 

Actually it would probably -- i guess it's undefined and in undefined
country such things can happen.

Also to be fair I think it's only a problem for the 4GB wrapping case
which is presumably rare (otherwise we would have heard about it)

Also worse really fixing it would be a major change to the VFS 
because of the way -read/write are defined :/

-Andi
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] ext3 freeze feature

2008-01-28 Thread Takashi Sato

Hi,


What you *could* do is to start putting processes to sleep if they
attempt to write to the frozen filesystem, and then detect the
deadlock case where the process holding the file descriptor used to
freeze the filesystem gets frozen because it attempted to write to the
filesystem --- at which point it gets some kind of signal (which
defaults to killing the process), and the filesystem is unfrozen and
as part of the unfreeze you wake up all of the processes that were put
to sleep for touching the frozen filesystem.


I don't think close() usually writes to journal and the deadlock occurs.
Is there the special case which close() writes to journal in case of
getting signal?

Cheers, Takashi 


-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] [8/18] BKL-removal: Remove BKL from remote_llseek

2008-01-28 Thread Alan Cox
  No specific spec, just general quality of implementation.
 
 I completely agree.  If one thread writes A and another writes B then the
 kernel should record either A or B, not ((A  0x) | (B 
 0x))

Agree entirely: the spec doesn't allow for random scribbling in the wrong
place. It doesn't cover which goes first or who wins the race but
provides pwrite/pread for that situation. Writing somewhere unrelated is
definitely not to spec and not good.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] [8/18] BKL-removal: Remove BKL from remote_llseek

2008-01-28 Thread Alan Cox
On Mon, 28 Jan 2008 15:10:34 +0100
Andi Kleen [EMAIL PROTECTED] wrote:

 On Monday 28 January 2008 14:38:57 Alan Cox wrote:
   Also worse really fixing it would be a major change to the VFS 
   because of the way -read/write are defined :/
  
  I don't see a problem there. -read and -write update the passed pointer
  which is not the real f_pos anyway. Just the copies need fixing. 
 
 They are effectually doing a decoupled read/modify/write cycle. e.g.:
 
 A   B
 
 read fpos   
 
 read fpos
 
 fpos += A   fpos += B
 write fpos
 
 
 write fpos
 
 So you get overlapping reads. Probably not good.

No unix system I'm aware of cares about the read/write positioning during
parallel simultaneous reads or writes, with the exception of O_APPEND
which is strictly defined. The problem case is getting fpos != either
valid value.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] [8/18] BKL-removal: Remove BKL from remote_llseek

2008-01-28 Thread Diego Calleja
El Mon, 28 Jan 2008 15:10:34 +0100, Andi Kleen [EMAIL PROTECTED] escribió:

 So you get overlapping reads. Probably not good.

This was discussed in the past i think -

http://lkml.org/lkml/2006/4/13/124
http://lkml.org/lkml/2006/4/13/130
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] ext3: per-process soft-syncing data=ordered mode

2008-01-28 Thread Jan Kara
On Sat 26-01-08 08:27:59, Al Boldi wrote:
 Jan Kara wrote:
   Greetings!
  
   data=ordered mode has proven reliable over the years, and it does this
   by ordering filedata flushes before metadata flushes.  But this
   sometimes causes contention in the order of a 10x slowdown for certain
   apps, either due to the misuse of fsync or due to inherent behaviour
   like db's, as well as inherent starvation issues exposed by the
   data=ordered mode.
  
   data=writeback mode alleviates data=order mode slowdowns, but only works
   per-mount and is too dangerous to run as a default mode.
  
   This RFC proposes to introduce a tunable which allows to disable fsync
   and changes ordered into writeback writeout on a per-process basis like
   this:
  
 echo 1  /proc/`pidof process`/softsync
 
I guess disabling fsync() was already commented on enough. Regarding
  switching to writeback mode on per-process basis - not easily possible
  because sometimes data is not written out by the process which stored
  them (think of mmaped file).
 
 Do you mean there is a locking problem?
  No, but if you write to an mmaped file, then we can find out only later
we have dirty data in pages and we call writepage() on behalf of e.g.
pdflush().

  And in case of DB, they use direct-io
  anyway most of the time so they don't care about journaling mode anyway.
 
 Testing with sqlite3 and mysql4 shows that performance drastically improves 
 with writeback writeout.
  And do you have the databases configured to use direct IO or not?

Honza
-- 
Jan Kara [EMAIL PROTECTED]
SUSE Labs, CR
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Parallelize IO for e2fsck

2008-01-28 Thread Theodore Tso
On Mon, Jan 28, 2008 at 07:30:05PM +, Pavel Machek wrote:
 
 As user pages are always in highmem, this should be easy to decide:
 only send SIGDANGER when highmem is full. (Yes, there are
 inodes/dentries/file descriptors in lowmem, but I doubt apps will
 respond to SIGDANGER by closing files).

Good point; for a system with at least (say) 2GB of memory, that
definitely makes sense.  For a system with less than 768 megs of
memory (how quaint, but it wasn't that long ago this was a lot of
memory :-), there wouldn't *be* any memory in highmem at all

- Ted
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] [8/18] BKL-removal: Remove BKL from remote_llseek

2008-01-28 Thread Steve French
On Jan 28, 2008 2:17 AM, Andi Kleen [EMAIL PROTECTED] wrote:
  I completely agree.  If one thread writes A and another writes B then the
  kernel should record either A or B, not ((A  0x) | (B 
  0x))

 The problem is pretty nasty unfortunately. To solve it properly I think
 the file_operations-read/write prototypes would need to be changed
 because otherwise it is not possible to do atomic relative updates
 of f_pos. Right now the actual update is burrowed deeply in the low level
 read/write implementation. But that would be a huge impact all over
 the tree :/

If there were a wrapper around reads and writes of f_pos as there is
for i_size e.g. it would hit a lot of code, but not as many as I had
originally thought.  the most important ones are in the vfs itself, where
there are only 59 uses of the field (not all need to be changed).   ext3
has fewer (25), and cifs only 12 uses.


-- 
Thanks,

Steve
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] [8/18] BKL-removal: Remove BKL from remote_llseek

2008-01-28 Thread Dave Kleikamp

On Mon, 2008-01-28 at 12:33 -0600, Steve French wrote:
 On Jan 28, 2008 2:17 AM, Andi Kleen [EMAIL PROTECTED] wrote:
   I completely agree.  If one thread writes A and another writes B then the
   kernel should record either A or B, not ((A  0x) | (B 
   0x))
 
  The problem is pretty nasty unfortunately. To solve it properly I think
  the file_operations-read/write prototypes would need to be changed
  because otherwise it is not possible to do atomic relative updates
  of f_pos. Right now the actual update is burrowed deeply in the low level
  read/write implementation. But that would be a huge impact all over
  the tree :/
 
 If there were a wrapper around reads and writes of f_pos as there is
 for i_size e.g. it would hit a lot of code, but not as many as I had
 originally thought.  the most important ones are in the vfs itself, where
 there are only 59 uses of the field (not all need to be changed).   ext3
 has fewer (25), and cifs only 12 uses.

Most of the uses in ext3 and cifs deal with a directory's f_pos in
readdir, which is protected by i_mutex, so I don't think we need to
worry about them at all.
-- 
David Kleikamp
IBM Linux Technology Center

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Parallelize IO for e2fsck

2008-01-28 Thread Pavel Machek
Hi!

 It's been discussed before, but I suspect the main reason why it was
 never done is no one submitted a patch.  Also, the problem is actually
 a pretty complex one.  There are a couple of different stages where
 you might want to send an alert to processes:
 
 * Data is starting to get ejected from page/buffer cache
 * System is starting to swap
 * System is starting to really struggle to find memory
 * System is starting an out-of-memory killer
 
 AIX's SIGDANGER really did the last two, where the OOM killer would
 tend to avoid processes that had a SIGDANGER handler in favor of
 processes that were SIGDANGER unaware.
 
 Then there is the additional complexity in Linux that you have
 multiple zones of memory, which at least on the historically more
 popular x86 was highly, highly important.  You could say that whenever
 there is sufficient memory pressure in any zone that you start
 ejecting data from caches or start to swap that you start sending the
 signals --- but on x86 systems with lowmem, that could happen quite
 frequently, and since a user process has no idea whether its resources
 are in lowmem or highmem, there's not much you can do about this.

As user pages are always in highmem, this should be easy to decide:
only send SIGDANGER when highmem is full. (Yes, there are
inodes/dentries/file descriptors in lowmem, but I doubt apps will
respond to SIGDANGER by closing files).
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [2.6.24 REGRESSION] BUG: Soft lockup - with VFS

2008-01-28 Thread Oliver Pinter (Pintér Olivér)
and so then dmesg ..

-- 
Thanks,
Oliver
Initializing cgroup subsys cpuset
Linux version 2.6.24-szami2 ([EMAIL PROTECTED]) (gcc version 4.1.2 20061115 
(prerelease) (Debian 4.1.1-21)) #2 SMP Sun Jan 27 01:47:58 CET 2008
BIOS-provided physical RAM map:
 BIOS-e820:  - 0009fc00 (usable)
 BIOS-e820: 0009fc00 - 000a (reserved)
 BIOS-e820: 000e8000 - 0010 (reserved)
 BIOS-e820: 0010 - 1ff3 (usable)
 BIOS-e820: 1ff3 - 1ff4 (ACPI data)
 BIOS-e820: 1ff4 - 1fff (ACPI NVS)
 BIOS-e820: 1fff - 2000 (reserved)
 BIOS-e820: ffb8 - 0001 (reserved)
0MB HIGHMEM available.
511MB LOWMEM available.
found SMP MP-table at 000ff780
Entering add_active_range(0, 0, 130864) 0 entries of 256 used
Zone PFN ranges:
  DMA 0 - 4096
  Normal   4096 -   130864
  HighMem130864 -   130864
Movable zone start PFN for each node
early_node_map[1] active PFN ranges
0:0 -   130864
On node 0 totalpages: 130864
  DMA zone: 56 pages used for memmap
  DMA zone: 0 pages reserved
  DMA zone: 4040 pages, LIFO batch:0
  Normal zone: 1733 pages used for memmap
  Normal zone: 125035 pages, LIFO batch:31
  HighMem zone: 0 pages used for memmap
  Movable zone: 0 pages used for memmap
DMI 2.3 present.
ACPI: RSDP 000F9E30, 0021 (r2 ACPIAM)
ACPI: XSDT 1FF30100, 003C (r1 A M I  OEMXSDT  1414 MSFT   97)
ACPI: FACP 1FF30290, 00F4 (r3 A M I  OEMFACP  1414 MSFT   97)
ACPI: DSDT 1FF303F0, 3779 (r1  P4C8B P4C8B106  106 INTL  2002026)
ACPI: FACS 1FF4, 0040
ACPI: APIC 1FF30390, 005C (r1 A M I  OEMAPIC  1414 MSFT   97)
ACPI: OEMB 1FF40040, 003F (r1 A M I  OEMBIOS  1414 MSFT   97)
ACPI: PM-Timer IO Port: 0x808
ACPI: Local APIC address 0xfee0
ACPI: LAPIC (acpi_id[0x01] lapic_id[0x00] enabled)
Processor #0 15:2 APIC version 20
ACPI: LAPIC (acpi_id[0x02] lapic_id[0x01] enabled)
Processor #1 15:2 APIC version 20
ACPI: IOAPIC (id[0x02] address[0xfec0] gsi_base[0])
IOAPIC[0]: apic_id 2, version 32, address 0xfec0, GSI 0-23
ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)
ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level)
ACPI: IRQ0 used by override.
ACPI: IRQ2 used by override.
ACPI: IRQ9 used by override.
Enabling APIC mode:  Flat.  Using 1 I/O APICs
Using ACPI (MADT) for SMP configuration information
Allocating PCI resources starting at 3000 (gap: 2000:dfb8)
Built 1 zonelists in Zone order, mobility grouping on.  Total pages: 129075
Kernel command line: BOOT_IMAGE=deb_s2.6.24 ro root=803 1
mapped APIC to b000 (fee0)
mapped IOAPIC to a000 (fec0)
Enabling fast FPU save and restore... done.
Enabling unmasked SIMD FPU exception support... done.
Initializing CPU#0
PID hash table entries: 2048 (order: 11, 8192 bytes)
Detected 3150.239 MHz processor.
Console: colour VGA+ 132x44
console [tty0] enabled
Lock dependency validator: Copyright (c) 2006 Red Hat, Inc., Ingo Molnar
... MAX_LOCKDEP_SUBCLASSES:8
... MAX_LOCK_DEPTH:  30
... MAX_LOCKDEP_KEYS:2048
... CLASSHASH_SIZE:   1024
... MAX_LOCKDEP_ENTRIES: 8192
... MAX_LOCKDEP_CHAINS:  16384
... CHAINHASH_SIZE:  8192
 memory used by lock dependency info: 1024 kB
 per task-struct memory footprint: 1680 bytes

| Locking API testsuite:

 | spin |wlock |rlock |mutex | wsem | rsem |
  --
 A-A deadlock:  ok  |  ok  |  ok  |  ok  |  ok  |  ok  |
 A-B-B-A deadlock:  ok  |  ok  |  ok  |  ok  |  ok  |  ok  |
 A-B-B-C-C-A deadlock:  ok  |  ok  |  ok  |  ok  |  ok  |  ok  |
 A-B-C-A-B-C deadlock:  ok  |  ok  |  ok  |  ok  |  ok  |  ok  |
 A-B-B-C-C-D-D-A deadlock:  ok  |  ok  |  ok  |  ok  |  ok  |  ok  |
 A-B-C-D-B-D-D-A deadlock:  ok  |  ok  |  ok  |  ok  |  ok  |  ok  |
 A-B-C-D-B-C-D-A deadlock:  ok  |  ok  |  ok  |  ok  |  ok  |  ok  |
double unlock:  ok  |  ok  |  ok  |  ok  |  ok  |  ok  |
  initialize held:  ok  |  ok  |  ok  |  ok  |  ok  |  ok  |
 bad unlock order:  ok  |  ok  |  ok  |  ok  |  ok  |  ok  |
  --
  recursive read-lock: |  ok  | |  ok  |
   recursive read-lock #2: |  ok  | |  ok  |
mixed read-write-lock: |  ok  | |  ok  |
mixed write-read-lock: |  ok  | |  ok  |
  --
 hard-irqs-on + irq-safe-A/12:  ok  |  ok  |  ok  |
 soft-irqs-on + irq-safe-A/12:  ok  

Re: [PATCH] [8/18] BKL-removal: Remove BKL from remote_llseek

2008-01-28 Thread Andi Kleen
On Monday 28 January 2008 14:38:57 Alan Cox wrote:
  Also worse really fixing it would be a major change to the VFS 
  because of the way -read/write are defined :/
 
 I don't see a problem there. -read and -write update the passed pointer
 which is not the real f_pos anyway. Just the copies need fixing. 

They are effectually doing a decoupled read/modify/write cycle. e.g.:

A   B

read fpos   

read fpos

fpos += A   fpos += B
write fpos


write fpos

So you get overlapping reads. Probably not good.

-Andi
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] ext3 freeze feature

2008-01-28 Thread Takashi Sato

Hi,

Thank you for your comments.


That's inherently unsafe - you can have multiple unfreezes
running in parallel which seriously screws with the bdev semaphore
count that is used to lock the device due to doing multiple up()s
for every down.

Your timeout thingy guarantee that at some point you will get
multiple up()s occuring due to the timer firing racing with
a thaw ioctl. 


If this interface is to be more widely exported, then it needs
a complete revamp of the bdev is locked while it is frozen so
that there is no chance of a double up() ever occuring on the
bd_mount_sem due to racing thaws.


My patch has the race condition as you said.
I will fix it.

Cheers, Takashi 


-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Parallelize IO for e2fsck

2008-01-28 Thread Pavel Machek
On Mon 2008-01-28 14:56:33, Theodore Tso wrote:
 On Mon, Jan 28, 2008 at 07:30:05PM +, Pavel Machek wrote:
  
  As user pages are always in highmem, this should be easy to decide:
  only send SIGDANGER when highmem is full. (Yes, there are
  inodes/dentries/file descriptors in lowmem, but I doubt apps will
  respond to SIGDANGER by closing files).
 
 Good point; for a system with at least (say) 2GB of memory, that
 definitely makes sense.  For a system with less than 768 megs of
 memory (how quaint, but it wasn't that long ago this was a lot of
 memory :-), there wouldn't *be* any memory in highmem at all

Ok, so it is 'send SIGDANGER when all zones are low', because user
allocations can go from all zones (unless you have something really
exotic, I'm not sure if that is true on huge NUMA  machines  similar).

Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] [8/18] BKL-removal: Remove BKL from remote_llseek

2008-01-28 Thread Alan Cox
 Also worse really fixing it would be a major change to the VFS 
 because of the way -read/write are defined :/

I don't see a problem there. -read and -write update the passed pointer
which is not the real f_pos anyway. Just the copies need fixing.

Alan
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] ext3: per-process soft-syncing data=ordered mode

2008-01-28 Thread Al Boldi
Jan Kara wrote:
 On Sat 26-01-08 08:27:59, Al Boldi wrote:
  Do you mean there is a locking problem?

   No, but if you write to an mmaped file, then we can find out only later
 we have dirty data in pages and we call writepage() on behalf of e.g.
 pdflush().

Ok, that's a special case, which we could code for, but doesn't seem 
worthwile.  In any case, child-forks should inherit its parent mode.

   And in case of DB, they use direct-io
   anyway most of the time so they don't care about journaling mode
   anyway.
 
  Testing with sqlite3 and mysql4 shows that performance drastically
  improves with writeback writeout.

   And do you have the databases configured to use direct IO or not?

I don't think so, but these tests are only meant to expose the underlying 
problem which needs to be fixed, while this RFC proposes a useful 
workaround.

In another post Jan Kara wrote:
   Hmm, if you're willing to test patches, then you could try a debug
 patch: http://bugzilla.kernel.org/attachment.cgi?id=14574
   and send me the output. What kind of load do you observe problems with
 and which problems exactly?

8M-record insert into indexed db-table:
 ordered  writeback
sqlite3:  75m22s8m45s
mysql4 :  23m35s5m29s

Also, see the 'konqueror deadlocks in 2.6.22' thread.


Thanks!

--
Al

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 21/26] mount options: partially fix nfs

2008-01-28 Thread Chuck Lever

On Jan 28, 2008, at 6:34 AM, Miklos Szeredi wrote:

All mount options should be shown, which are needed to reconstruct a
previous mount.


Ah, OK.

I'm happy to implement logic to display the all missing options.  I
should have updated nfs_show_mount_options() when I wrote the NFS
mount option parser.

Let me know your preference.


You are more familiar with NFS, so I think it would be better if you
updated nfs_show_mount_options().

Could you also queue my patch (updated) or incorporate it into a
combined fix?


Yes.  I'll have time in a day or two to get this finished.


Thanks,
Miklos


Subject: mount options: partially fix nfs

From: Miklos Szeredi [EMAIL PROTECTED]

Add posix, bsize=, namelen= options to /proc/mounts for nfs
filesystems.

Document several other options that are still missing.

Changes:

 - display namelen= unconditionally
 - addr= isn't missing after all

Signed-off-by: Miklos Szeredi [EMAIL PROTECTED]
CC: Trond Myklebust [EMAIL PROTECTED]
---

Index: linux/fs/nfs/super.c
===
--- linux.orig/fs/nfs/super.c   2008-01-25 15:44:56.0 +0100
+++ linux/fs/nfs/super.c2008-01-25 15:57:32.0 +0100
@@ -449,6 +449,7 @@ static void nfs_show_mount_options(struc
} nfs_info[] = {
{ NFS_MOUNT_SOFT, ,soft, ,hard },
{ NFS_MOUNT_INTR, ,intr, ,nointr },
+   { NFS_MOUNT_POSIX, ,posix,  },
{ NFS_MOUNT_NOCTO, ,nocto,  },
{ NFS_MOUNT_NOAC, ,noac,  },
{ NFS_MOUNT_NONLM, ,nolock,  },
@@ -463,6 +464,9 @@ static void nfs_show_mount_options(struc
seq_printf(m, ,vers=%d, clp-rpc_ops-version);
seq_printf(m, ,rsize=%d, nfss-rsize);
seq_printf(m, ,wsize=%d, nfss-wsize);
+   seq_printf(m, ,namelen=%d, nfss-namelen);
+   if (nfss-bsize != 0)
+   seq_printf(m, ,bsize=%d, nfss-bsize);
if (nfss-acregmin != 3*HZ || showdefaults)
seq_printf(m, ,acregmin=%d, nfss-acregmin/HZ);
if (nfss-acregmax != 60*HZ || showdefaults)
@@ -482,6 +486,17 @@ static void nfs_show_mount_options(struc
 	seq_printf(m, ,timeo=%lu, 10U * nfss-client-cl_timeout- 
to_initval / HZ);

seq_printf(m, ,retrans=%u, nfss-client-cl_timeout-to_retries);
 	seq_printf(m, ,sec=%s, nfs_pseudoflavour_to_name(nfss-client- 
cl_auth-au_flavor));

+
+   /*
+* Missing options:
+* port=
+* mountport=
+* mountvers=
+* mountproto=
+* clientaddr=
+* mounthost=
+* mountaddr=
+*/
 }

 /*


--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com



-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] ext3: per-process soft-syncing data=ordered mode

2008-01-28 Thread Jan Kara
On Sat 26-01-08 08:27:43, Al Boldi wrote:
 Diego Calleja wrote:
  El Thu, 24 Jan 2008 23:36:00 +0300, Al Boldi [EMAIL PROTECTED] escribió:
   Greetings!
  
   data=ordered mode has proven reliable over the years, and it does this
   by ordering filedata flushes before metadata flushes.  But this
   sometimes causes contention in the order of a 10x slowdown for certain
   apps, either due to the misuse of fsync or due to inherent behaviour
   like db's, as well as inherent starvation issues exposed by the
   data=ordered mode.
 
  There's a related bug in bugzilla:
  http://bugzilla.kernel.org/show_bug.cgi?id=9546
 
  The diagnostic from Jan Kara is different though, but I think it may be
  the same problem...
 
  One process does data-intensive load. Thus in the ordered mode the
  transaction is tiny but has tons of data buffers attached. If commit
  happens, it takes a long time to sync all the data before the commit
  can proceed... In the writeback mode, we don't wait for data buffers, in
  the journal mode amount of data to be written is really limited by the
  maximum size of a transaction and so we write by much smaller chunks
  and better latency is thus ensured.
 
 
  I'm hitting this bug too...it's surprising that there's not many people
  reporting more bugs about this, because it's really annoying.
 
 
  There's a patch by Jan Kara (that I'm including here because bugzilla
  didn't include it and took me a while to find it) which I don't know if
  it's supposed to fix the problem , but it'd be interesting to try:
 
 Thanks a lot, but it doesn't fix it.
  Hmm, if you're willing to test patches, then you could try a debug patch:
http://bugzilla.kernel.org/attachment.cgi?id=14574
  and send me the output. What kind of load do you observe problems with
and which problems exactly?

Honza
-- 
Jan Kara [EMAIL PROTECTED]
SUSE Labs, CR
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Kernel Event Notifications (was: [RFC] Parallelize IO for e2fsck)

2008-01-28 Thread Jon Masters
On Sat, 2008-01-26 at 16:55 +0300, Al Boldi wrote:
 KOSAKI Motohiro wrote:
And from a performance point of view letting applications voluntarily
free some memory is better even than starting to swap.
  
   Absolutely.
 
  the mem_notify patch can realize just before starting swapping
  notification :)

I looked at this a year or two back, then ran out of time. But the thing
I wanted to do was have libc's memory allocation routines extended to
handle these through reservations - the kernel should send a userspace
notification and then there should be some kind of concept of returning
memory that's been used for opportunistic userspace caching, e.g. in
firefox to cache the last 10 web pages. Let us know how you get on :)

Jon.


-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 05/26] mount options: fix afs

2008-01-28 Thread David Howells
Miklos Szeredi [EMAIL PROTECTED] wrote:

 Add a .show_options super operation to afs.
 
 Use generic_show_options() and save the complete option string in
 afs_get_sb().

Sounds reasonable, but I can't test it till I get back from LCA.

David
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Parallelize IO for e2fsck

2008-01-28 Thread david

On Mon, 28 Jan 2008, Theodore Tso wrote:


On Mon, Jan 28, 2008 at 07:30:05PM +, Pavel Machek wrote:


As user pages are always in highmem, this should be easy to decide:
only send SIGDANGER when highmem is full. (Yes, there are
inodes/dentries/file descriptors in lowmem, but I doubt apps will
respond to SIGDANGER by closing files).


Good point; for a system with at least (say) 2GB of memory, that
definitely makes sense.  For a system with less than 768 megs of
memory (how quaint, but it wasn't that long ago this was a lot of
memory :-), there wouldn't *be* any memory in highmem at all


not to mention machines with 1G of ram (900M lowmem, 128M highmem)

David Lang
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html