Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?)

2007-10-02 Thread Randy Dunlap
On Tue, 02 Oct 2007 15:36:01 +0200 Peter Zijlstra wrote:

> On Fri, 2007-09-28 at 12:16 -0700, Andrew Morton wrote:
> 
> > (Searches for the lockstat documentation)
> > 
> > Did we forget to do that?
> 
> yeah,...
> 
> /me quickly whips up something

Thanks.  Just some typos noted below.


> Signed-off-by: Peter Zijlstra <[EMAIL PROTECTED]>
> ---
>  Documentation/lockstat.txt |  119 
> +
>  1 file changed, 119 insertions(+)
> 
> Index: linux-2.6/Documentation/lockstat.txt
> ===
> --- /dev/null
> +++ linux-2.6/Documentation/lockstat.txt
> @@ -0,0 +1,119 @@
> +
> +LOCK STATISTICS
> +
> +- WHAT
> +
> +As the name suggests, it provides statistics on locks.
> +
> +- WHY
> +
> +Because things like lock contention can severely impact performance.
> +
> +- HOW
> +
> +Lockdep already has hooks in the lock functions and maps lock instances to
> +lock classes. We build on that. The graph below shows the relation between
> +the lock functions and the various hooks therein.
> +
> +__acquire
> +|
> +   lock _
> +|\
> +|__contended
> +| |
> +|   
> +| ___/
> +|/
> +|
> +   __acquired
> +|
> +.
> +  
> +.
> +|
> +   __release
> +|
> + unlock
> +
> +lock, unlock - the regular lock functions
> +__*  - the hooks
> +<>   - states
> +
> +With these hooks we provide the following statistics:
> +
> + con-bounces - number of lock contention that involved x-cpu data
> + contentions- number of lock acquisitions that had to wait
> + wait time min  - shortest (non 0) time we ever had to wait for a 
> lock

  (non-0)

> +   max  - longest time we ever had to wait for a lock
> +   total- total time we spend waiting on this lock
> + acq-bounes - number of lock acquisitions that involved x-cpu 
> data

   -bounces

> + acquisitions- number of times we took the lock
> + hold time min   - shortest (non 0) time we ever held the lock

   (non-0)

> +   max   - longest time we ever held the lock
> +   total - total time this lock was held
> +
> +From these number various other statistics can be derived, such as:
> +
> + hold time average = hold time total / acquisitions
> +
> +These numbers are gathered per lock class, per read/write state (when
> +applicable).
> +
> +It also tracks (4) contention points per class. A contention point is a call
> +site that had to wait on lock acquisition.
> +
> + - USAGE
> +
> +Look at the current lock statistics:
> +
> +(line numbers not part of actual output, done for clarity in the explanation 
> below)
> +
> +# less /proc/lock_stat
> +
> +01 lock_stat version 0.2
> +02 
> ---
> +03   class namecon-bouncescontentions   
> waittime-min   waittime-max waittime-totalacq-bounces   acquisitions   
> holdtime-min   holdtime-max holdtime-total
> +04 
> ---
...
> +15  dcache_lock180  
> [] sys_getcwd+0x11e/0x230
> +16  dcache_lock165  
> [] d_alloc+0x15a/0x210
> +17  dcache_lock 33  
> [] _atomic_dec_and_lock+0x4d/0x70
> +18  dcache_lock  1  
> [] shrink_dcache_parent+0x18/0x130
> +
> +This except shows the first two lock class statistics. Line 01 shows the 
> output

excerpt

> +version - each time the format changes this will be updated. Line 02-04 show
> +the header with column descriptions. Lines 05-10 and 13-18 show the actual
> +statistics. These statistics come in two parts; the actual stats separated 
> by a
> +short separator (line 08, 14) from the contention points.
> +
> +The first lock (05-10) is a read/write lock, and shows two lines above the
> +short separator. The contention points don't match the column descriptors,
> +they have two: contentions and [] symbol.
...

---
~Randy
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?)

2007-10-02 Thread Peter Zijlstra
On Fri, 2007-09-28 at 12:16 -0700, Andrew Morton wrote:

> (Searches for the lockstat documentation)
> 
> Did we forget to do that?

yeah,...

/me quickly whips up something

Signed-off-by: Peter Zijlstra <[EMAIL PROTECTED]>
---
 Documentation/lockstat.txt |  119 +
 1 file changed, 119 insertions(+)

Index: linux-2.6/Documentation/lockstat.txt
===
--- /dev/null
+++ linux-2.6/Documentation/lockstat.txt
@@ -0,0 +1,119 @@
+
+LOCK STATISTICS
+
+- WHAT
+
+As the name suggests, it provides statistics on locks.
+
+- WHY
+
+Because things like lock contention can severely impact performance.
+
+- HOW
+
+Lockdep already has hooks in the lock functions and maps lock instances to
+lock classes. We build on that. The graph below shows the relation between
+the lock functions and the various hooks therein.
+
+__acquire
+|
+   lock _
+|\
+|__contended
+| |
+|   
+| ___/
+|/
+|
+   __acquired
+|
+.
+  
+.
+|
+   __release
+|
+ unlock
+
+lock, unlock   - the regular lock functions
+__*- the hooks
+<> - states
+
+With these hooks we provide the following statistics:
+
+ con-bounces   - number of lock contention that involved x-cpu data
+ contentions- number of lock acquisitions that had to wait
+ wait time min  - shortest (non 0) time we ever had to wait for a lock
+   max  - longest time we ever had to wait for a lock
+   total- total time we spend waiting on this lock
+ acq-bounes - number of lock acquisitions that involved x-cpu data
+ acquisitions  - number of times we took the lock
+ hold time min - shortest (non 0) time we ever held the lock
+   max - longest time we ever held the lock
+   total   - total time this lock was held
+
+From these number various other statistics can be derived, such as:
+
+ hold time average = hold time total / acquisitions
+
+These numbers are gathered per lock class, per read/write state (when
+applicable).
+
+It also tracks (4) contention points per class. A contention point is a call
+site that had to wait on lock acquisition.
+
+ - USAGE
+
+Look at the current lock statistics:
+
+(line numbers not part of actual output, done for clarity in the explanation 
below)
+
+# less /proc/lock_stat
+
+01 lock_stat version 0.2
+02 
---
+03   class namecon-bouncescontentions   
waittime-min   waittime-max waittime-totalacq-bounces   acquisitions   
holdtime-min   holdtime-max holdtime-total
+04 
---
+05
+06   >i_data.tree_lock-W:15  21657  
 0.18 1093295.30 11547131054.85 58  10415   
0.16  87.516387.60
+07   >i_data.tree_lock-R: 0  0  
 0.00   0.00   0.00  23302 231198   
0.25   8.45   98023.38
+08   --
+09 >i_data.tree_lock  0  
[] add_to_page_cache+0x5f/0x190
+10
+11 
...
+12
+13  dcache_lock:  1037   1161  
 0.38  45.32 774.51   6611 243371   
0.15 306.48   77387.24
+14  ---
+15  dcache_lock180  
[] sys_getcwd+0x11e/0x230
+16  dcache_lock165  
[] d_alloc+0x15a/0x210
+17  dcache_lock 33  
[] _atomic_dec_and_lock+0x4d/0x70
+18  dcache_lock  1  
[] shrink_dcache_parent+0x18/0x130
+
+This except shows the first two lock class statistics. Line 01 shows the output
+version - each time the format changes this will be updated. Line 02-04 show
+the header with column descriptions. Lines 05-10 and 13-18 show the actual
+statistics. These statistics come in two parts; the actual stats separated by a
+short separator (line 08, 14) from the contention points.
+
+The 

Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?)

2007-10-02 Thread Peter Zijlstra
On Fri, 2007-09-28 at 12:16 -0700, Andrew Morton wrote:

 (Searches for the lockstat documentation)
 
 Did we forget to do that?

yeah,...

/me quickly whips up something

Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
---
 Documentation/lockstat.txt |  119 +
 1 file changed, 119 insertions(+)

Index: linux-2.6/Documentation/lockstat.txt
===
--- /dev/null
+++ linux-2.6/Documentation/lockstat.txt
@@ -0,0 +1,119 @@
+
+LOCK STATISTICS
+
+- WHAT
+
+As the name suggests, it provides statistics on locks.
+
+- WHY
+
+Because things like lock contention can severely impact performance.
+
+- HOW
+
+Lockdep already has hooks in the lock functions and maps lock instances to
+lock classes. We build on that. The graph below shows the relation between
+the lock functions and the various hooks therein.
+
+__acquire
+|
+   lock _
+|\
+|__contended
+| |
+|   wait
+| ___/
+|/
+|
+   __acquired
+|
+.
+  hold
+.
+|
+   __release
+|
+ unlock
+
+lock, unlock   - the regular lock functions
+__*- the hooks
+ - states
+
+With these hooks we provide the following statistics:
+
+ con-bounces   - number of lock contention that involved x-cpu data
+ contentions- number of lock acquisitions that had to wait
+ wait time min  - shortest (non 0) time we ever had to wait for a lock
+   max  - longest time we ever had to wait for a lock
+   total- total time we spend waiting on this lock
+ acq-bounes - number of lock acquisitions that involved x-cpu data
+ acquisitions  - number of times we took the lock
+ hold time min - shortest (non 0) time we ever held the lock
+   max - longest time we ever held the lock
+   total   - total time this lock was held
+
+From these number various other statistics can be derived, such as:
+
+ hold time average = hold time total / acquisitions
+
+These numbers are gathered per lock class, per read/write state (when
+applicable).
+
+It also tracks (4) contention points per class. A contention point is a call
+site that had to wait on lock acquisition.
+
+ - USAGE
+
+Look at the current lock statistics:
+
+(line numbers not part of actual output, done for clarity in the explanation 
below)
+
+# less /proc/lock_stat
+
+01 lock_stat version 0.2
+02 
---
+03   class namecon-bouncescontentions   
waittime-min   waittime-max waittime-totalacq-bounces   acquisitions   
holdtime-min   holdtime-max holdtime-total
+04 
---
+05
+06   inode-i_data.tree_lock-W:15  21657  
 0.18 1093295.30 11547131054.85 58  10415   
0.16  87.516387.60
+07   inode-i_data.tree_lock-R: 0  0  
 0.00   0.00   0.00  23302 231198   
0.25   8.45   98023.38
+08   --
+09 inode-i_data.tree_lock  0  
[8027c08f] add_to_page_cache+0x5f/0x190
+10
+11 
...
+12
+13  dcache_lock:  1037   1161  
 0.38  45.32 774.51   6611 243371   
0.15 306.48   77387.24
+14  ---
+15  dcache_lock180  
[802c0d7e] sys_getcwd+0x11e/0x230
+16  dcache_lock165  
[802c002a] d_alloc+0x15a/0x210
+17  dcache_lock 33  
[8035818d] _atomic_dec_and_lock+0x4d/0x70
+18  dcache_lock  1  
[802beef8] shrink_dcache_parent+0x18/0x130
+
+This except shows the first two lock class statistics. Line 01 shows the output
+version - each time the format changes this will be updated. Line 02-04 show
+the header with column descriptions. Lines 05-10 and 13-18 show the actual
+statistics. These statistics come in two parts; the 

Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?)

2007-10-02 Thread Randy Dunlap
On Tue, 02 Oct 2007 15:36:01 +0200 Peter Zijlstra wrote:

 On Fri, 2007-09-28 at 12:16 -0700, Andrew Morton wrote:
 
  (Searches for the lockstat documentation)
  
  Did we forget to do that?
 
 yeah,...
 
 /me quickly whips up something

Thanks.  Just some typos noted below.


 Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
 ---
  Documentation/lockstat.txt |  119 
 +
  1 file changed, 119 insertions(+)
 
 Index: linux-2.6/Documentation/lockstat.txt
 ===
 --- /dev/null
 +++ linux-2.6/Documentation/lockstat.txt
 @@ -0,0 +1,119 @@
 +
 +LOCK STATISTICS
 +
 +- WHAT
 +
 +As the name suggests, it provides statistics on locks.
 +
 +- WHY
 +
 +Because things like lock contention can severely impact performance.
 +
 +- HOW
 +
 +Lockdep already has hooks in the lock functions and maps lock instances to
 +lock classes. We build on that. The graph below shows the relation between
 +the lock functions and the various hooks therein.
 +
 +__acquire
 +|
 +   lock _
 +|\
 +|__contended
 +| |
 +|   wait
 +| ___/
 +|/
 +|
 +   __acquired
 +|
 +.
 +  hold
 +.
 +|
 +   __release
 +|
 + unlock
 +
 +lock, unlock - the regular lock functions
 +__*  - the hooks
 +   - states
 +
 +With these hooks we provide the following statistics:
 +
 + con-bounces - number of lock contention that involved x-cpu data
 + contentions- number of lock acquisitions that had to wait
 + wait time min  - shortest (non 0) time we ever had to wait for a 
 lock

  (non-0)

 +   max  - longest time we ever had to wait for a lock
 +   total- total time we spend waiting on this lock
 + acq-bounes - number of lock acquisitions that involved x-cpu 
 data

   -bounces

 + acquisitions- number of times we took the lock
 + hold time min   - shortest (non 0) time we ever held the lock

   (non-0)

 +   max   - longest time we ever held the lock
 +   total - total time this lock was held
 +
 +From these number various other statistics can be derived, such as:
 +
 + hold time average = hold time total / acquisitions
 +
 +These numbers are gathered per lock class, per read/write state (when
 +applicable).
 +
 +It also tracks (4) contention points per class. A contention point is a call
 +site that had to wait on lock acquisition.
 +
 + - USAGE
 +
 +Look at the current lock statistics:
 +
 +(line numbers not part of actual output, done for clarity in the explanation 
 below)
 +
 +# less /proc/lock_stat
 +
 +01 lock_stat version 0.2
 +02 
 ---
 +03   class namecon-bouncescontentions   
 waittime-min   waittime-max waittime-totalacq-bounces   acquisitions   
 holdtime-min   holdtime-max holdtime-total
 +04 
 ---
...
 +15  dcache_lock180  
 [802c0d7e] sys_getcwd+0x11e/0x230
 +16  dcache_lock165  
 [802c002a] d_alloc+0x15a/0x210
 +17  dcache_lock 33  
 [8035818d] _atomic_dec_and_lock+0x4d/0x70
 +18  dcache_lock  1  
 [802beef8] shrink_dcache_parent+0x18/0x130
 +
 +This except shows the first two lock class statistics. Line 01 shows the 
 output

excerpt

 +version - each time the format changes this will be updated. Line 02-04 show
 +the header with column descriptions. Lines 05-10 and 13-18 show the actual
 +statistics. These statistics come in two parts; the actual stats separated 
 by a
 +short separator (line 08, 14) from the contention points.
 +
 +The first lock (05-10) is a read/write lock, and shows two lines above the
 +short separator. The contention points don't match the column descriptors,
 +they have two: contentions and [IP] symbol.
...

---
~Randy
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?)

2007-10-01 Thread Chuck Ebbert
On 09/29/2007 07:04 AM, Fengguang Wu wrote:
> On Thu, Sep 27, 2007 at 11:32:36PM -0700, Chakri n wrote:
>> Hi,
>>
>> In my testing, a unresponsive file system can hang all I/O in the system.
>> This is not seen in 2.4.
>>
>> I started 20 threads doing I/O on a NFS share. They are just doing 4K
>> writes in a loop.
>>
>> Now I stop NFS server hosting the NFS share and start a
>> "dd" process to write a file on local EXT3 file system.
>>
>> # dd if=/dev/zero of=/tmp/x count=1000
>>
>> This process never progresses.
> 
> Peter, do you think this patch will help?
> 
> ===
> writeback: avoid possible balance_dirty_pages() lockup on light-load bdi
> 
> On a busy-writing system, a writer could be hold up infinitely on a
> light-load device. It will be trying to sync more than enough dirty data.
> 
> The problem case:
> 
> 0. sda/nr_dirty >= dirty_limit;
>sdb/nr_dirty == 0
> 1. dd writes 32 pages on sdb
> 2. balance_dirty_pages() blocks dd, and tries to write 6MB.
> 3. it never gets there: there's only 128KB dirty data.
> 4. dd may be blocked for a lng time as long as sda is overloaded
> 
> Fix it by returning on 'zero dirty inodes' in the current bdi.
> (In fact there are slight differences between 'dirty inodes' and 'dirty 
> pages'.
> But there is no available counters for 'dirty pages'.)
> 
> Cc: Peter Zijlstra <[EMAIL PROTECTED]>
> Signed-off-by: Fengguang Wu <[EMAIL PROTECTED]>
> ---
>  mm/page-writeback.c |3 +++
>  1 file changed, 3 insertions(+)
> 
> --- linux-2.6.22.orig/mm/page-writeback.c
> +++ linux-2.6.22/mm/page-writeback.c
> @@ -227,6 +227,9 @@ static void balance_dirty_pages(struct a
>   if (nr_reclaimable + global_page_state(NR_WRITEBACK) <=
>   dirty_thresh)
>   break;
> + if (list_empty(>host->i_sb->s_dirty) &&
> + list_empty(>host->i_sb->s_io))
> + break;
>  
>   if (!dirty_exceeded)
>   dirty_exceeded = 1;
> 

This looks better than the other candidate to fix the problem. Are we going
to fix 2.6.23 before release? Multiple people have reported this problem now...

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?)

2007-10-01 Thread Chuck Ebbert
On 09/29/2007 07:04 AM, Fengguang Wu wrote:
 On Thu, Sep 27, 2007 at 11:32:36PM -0700, Chakri n wrote:
 Hi,

 In my testing, a unresponsive file system can hang all I/O in the system.
 This is not seen in 2.4.

 I started 20 threads doing I/O on a NFS share. They are just doing 4K
 writes in a loop.

 Now I stop NFS server hosting the NFS share and start a
 dd process to write a file on local EXT3 file system.

 # dd if=/dev/zero of=/tmp/x count=1000

 This process never progresses.
 
 Peter, do you think this patch will help?
 
 ===
 writeback: avoid possible balance_dirty_pages() lockup on light-load bdi
 
 On a busy-writing system, a writer could be hold up infinitely on a
 light-load device. It will be trying to sync more than enough dirty data.
 
 The problem case:
 
 0. sda/nr_dirty = dirty_limit;
sdb/nr_dirty == 0
 1. dd writes 32 pages on sdb
 2. balance_dirty_pages() blocks dd, and tries to write 6MB.
 3. it never gets there: there's only 128KB dirty data.
 4. dd may be blocked for a lng time as long as sda is overloaded
 
 Fix it by returning on 'zero dirty inodes' in the current bdi.
 (In fact there are slight differences between 'dirty inodes' and 'dirty 
 pages'.
 But there is no available counters for 'dirty pages'.)
 
 Cc: Peter Zijlstra [EMAIL PROTECTED]
 Signed-off-by: Fengguang Wu [EMAIL PROTECTED]
 ---
  mm/page-writeback.c |3 +++
  1 file changed, 3 insertions(+)
 
 --- linux-2.6.22.orig/mm/page-writeback.c
 +++ linux-2.6.22/mm/page-writeback.c
 @@ -227,6 +227,9 @@ static void balance_dirty_pages(struct a
   if (nr_reclaimable + global_page_state(NR_WRITEBACK) =
   dirty_thresh)
   break;
 + if (list_empty(mapping-host-i_sb-s_dirty) 
 + list_empty(mapping-host-i_sb-s_io))
 + break;
  
   if (!dirty_exceeded)
   dirty_exceeded = 1;
 

This looks better than the other candidate to fix the problem. Are we going
to fix 2.6.23 before release? Multiple people have reported this problem now...

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?)

2007-09-29 Thread Peter Zijlstra

On Sat, 2007-09-29 at 20:28 +0800, Fengguang Wu wrote:
> On Sat, Sep 29, 2007 at 01:48:01PM +0200, Peter Zijlstra wrote:

> > On the patch itself, not sure if it would have been enough. As soon as
> > there is a single dirty inode on the list one would get caught in the
> > same problem as before.
> 
> That should not be a problem.  Normally the few new dirty inodes will
> be all cleaned in one go and there are no more dirty inodes left(at
> least for a moment). Hmm, I guess the new 'break' should be moved
> immediately after writeback_inodes()...
> 
> > That is, if NFS_dirty+NFS_unstable+NFS_writeback > dirty_limit this
> > break won't fix it.
> 
> In fact this patch exactly targets at this condition.
> When NFS* < dirty_limit, Chakri won't see the lockup at all.
> The problem was, there are only two 'break's in the loop, and neither
> one evaluates to true for his dd command.

Yeah indeed, when put in the loop, after writeback_inodes() it makes
sense.

No idea what I was thinking, must be one of those days... :-/



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?)

2007-09-29 Thread Fengguang Wu
On Sat, Sep 29, 2007 at 01:48:01PM +0200, Peter Zijlstra wrote:
> 
> On Sat, 2007-09-29 at 19:04 +0800, Fengguang Wu wrote:
> > On Thu, Sep 27, 2007 at 11:32:36PM -0700, Chakri n wrote:
> > > Hi,
> > > 
> > > In my testing, a unresponsive file system can hang all I/O in the system.
> > > This is not seen in 2.4.
> > > 
> > > I started 20 threads doing I/O on a NFS share. They are just doing 4K
> > > writes in a loop.
> > > 
> > > Now I stop NFS server hosting the NFS share and start a
> > > "dd" process to write a file on local EXT3 file system.
> > > 
> > > # dd if=/dev/zero of=/tmp/x count=1000
> > > 
> > > This process never progresses.
> > 
> > Peter, do you think this patch will help?
> 
> In another sub-thread:
> 
> > It's works on .23-rc8-mm2 with out any problems.
> > 
> > "dd" process does not hang any more.
> > 
> > Thanks for all the help.
> > 
> > Cheers
> > --Chakri
> 
> So the per-bdi dirty patches that are in -mm already fix the problem.

That's good.
But still it could be a good candidate for 2.6.22.x or even 2.6.23.

> > ===
> > writeback: avoid possible balance_dirty_pages() lockup on light-load bdi
> > 
> > On a busy-writing system, a writer could be hold up infinitely on a
> > light-load device. It will be trying to sync more than enough dirty data.
> > 
> > The problem case:
> > 
> > 0. sda/nr_dirty >= dirty_limit;
> >sdb/nr_dirty == 0
> > 1. dd writes 32 pages on sdb
> > 2. balance_dirty_pages() blocks dd, and tries to write 6MB.
> > 3. it never gets there: there's only 128KB dirty data.
> > 4. dd may be blocked for a lng time as long as sda is overloaded
> > 
> > Fix it by returning on 'zero dirty inodes' in the current bdi.
> > (In fact there are slight differences between 'dirty inodes' and 'dirty 
> > pages'.
> > But there is no available counters for 'dirty pages'.)
> > 
> > Cc: Peter Zijlstra <[EMAIL PROTECTED]>
> > Signed-off-by: Fengguang Wu <[EMAIL PROTECTED]>
> > ---
> >  mm/page-writeback.c |3 +++
> >  1 file changed, 3 insertions(+)
> > 
> > --- linux-2.6.22.orig/mm/page-writeback.c
> > +++ linux-2.6.22/mm/page-writeback.c
> > @@ -227,6 +227,9 @@ static void balance_dirty_pages(struct a
> > if (nr_reclaimable + global_page_state(NR_WRITEBACK) <=
> > dirty_thresh)
> > break;
> > +   if (list_empty(>host->i_sb->s_dirty) &&
> > +   list_empty(>host->i_sb->s_io))
> > +   break;
> >  
> > if (!dirty_exceeded)
> > dirty_exceeded = 1;
> > 
> 
> On the patch itself, not sure if it would have been enough. As soon as
> there is a single dirty inode on the list one would get caught in the
> same problem as before.

That should not be a problem.  Normally the few new dirty inodes will
be all cleaned in one go and there are no more dirty inodes left(at
least for a moment). Hmm, I guess the new 'break' should be moved
immediately after writeback_inodes()...

> That is, if NFS_dirty+NFS_unstable+NFS_writeback > dirty_limit this
> break won't fix it.

In fact this patch exactly targets at this condition.
When NFS* < dirty_limit, Chakri won't see the lockup at all.
The problem was, there are only two 'break's in the loop, and neither
one evaluates to true for his dd command.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?)

2007-09-29 Thread Peter Zijlstra

On Sat, 2007-09-29 at 19:04 +0800, Fengguang Wu wrote:
> On Thu, Sep 27, 2007 at 11:32:36PM -0700, Chakri n wrote:
> > Hi,
> > 
> > In my testing, a unresponsive file system can hang all I/O in the system.
> > This is not seen in 2.4.
> > 
> > I started 20 threads doing I/O on a NFS share. They are just doing 4K
> > writes in a loop.
> > 
> > Now I stop NFS server hosting the NFS share and start a
> > "dd" process to write a file on local EXT3 file system.
> > 
> > # dd if=/dev/zero of=/tmp/x count=1000
> > 
> > This process never progresses.
> 
> Peter, do you think this patch will help?

In another sub-thread:

> It's works on .23-rc8-mm2 with out any problems.
> 
> "dd" process does not hang any more.
> 
> Thanks for all the help.
> 
> Cheers
> --Chakri

So the per-bdi dirty patches that are in -mm already fix the problem.

> ===
> writeback: avoid possible balance_dirty_pages() lockup on light-load bdi
> 
> On a busy-writing system, a writer could be hold up infinitely on a
> light-load device. It will be trying to sync more than enough dirty data.
> 
> The problem case:
> 
> 0. sda/nr_dirty >= dirty_limit;
>sdb/nr_dirty == 0
> 1. dd writes 32 pages on sdb
> 2. balance_dirty_pages() blocks dd, and tries to write 6MB.
> 3. it never gets there: there's only 128KB dirty data.
> 4. dd may be blocked for a lng time as long as sda is overloaded
> 
> Fix it by returning on 'zero dirty inodes' in the current bdi.
> (In fact there are slight differences between 'dirty inodes' and 'dirty 
> pages'.
> But there is no available counters for 'dirty pages'.)
> 
> Cc: Peter Zijlstra <[EMAIL PROTECTED]>
> Signed-off-by: Fengguang Wu <[EMAIL PROTECTED]>
> ---
>  mm/page-writeback.c |3 +++
>  1 file changed, 3 insertions(+)
> 
> --- linux-2.6.22.orig/mm/page-writeback.c
> +++ linux-2.6.22/mm/page-writeback.c
> @@ -227,6 +227,9 @@ static void balance_dirty_pages(struct a
>   if (nr_reclaimable + global_page_state(NR_WRITEBACK) <=
>   dirty_thresh)
>   break;
> + if (list_empty(>host->i_sb->s_dirty) &&
> + list_empty(>host->i_sb->s_io))
> + break;
>  
>   if (!dirty_exceeded)
>   dirty_exceeded = 1;
> 

On the patch itself, not sure if it would have been enough. As soon as
there is a single dirty inode on the list one would get caught in the
same problem as before.

That is, if NFS_dirty+NFS_unstable+NFS_writeback > dirty_limit this
break won't fix it.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?)

2007-09-29 Thread Fengguang Wu
On Thu, Sep 27, 2007 at 11:32:36PM -0700, Chakri n wrote:
> Hi,
> 
> In my testing, a unresponsive file system can hang all I/O in the system.
> This is not seen in 2.4.
> 
> I started 20 threads doing I/O on a NFS share. They are just doing 4K
> writes in a loop.
> 
> Now I stop NFS server hosting the NFS share and start a
> "dd" process to write a file on local EXT3 file system.
> 
> # dd if=/dev/zero of=/tmp/x count=1000
> 
> This process never progresses.

Peter, do you think this patch will help?

===
writeback: avoid possible balance_dirty_pages() lockup on light-load bdi

On a busy-writing system, a writer could be hold up infinitely on a
light-load device. It will be trying to sync more than enough dirty data.

The problem case:

0. sda/nr_dirty >= dirty_limit;
   sdb/nr_dirty == 0
1. dd writes 32 pages on sdb
2. balance_dirty_pages() blocks dd, and tries to write 6MB.
3. it never gets there: there's only 128KB dirty data.
4. dd may be blocked for a lng time as long as sda is overloaded

Fix it by returning on 'zero dirty inodes' in the current bdi.
(In fact there are slight differences between 'dirty inodes' and 'dirty pages'.
But there is no available counters for 'dirty pages'.)

Cc: Peter Zijlstra <[EMAIL PROTECTED]>
Signed-off-by: Fengguang Wu <[EMAIL PROTECTED]>
---
 mm/page-writeback.c |3 +++
 1 file changed, 3 insertions(+)

--- linux-2.6.22.orig/mm/page-writeback.c
+++ linux-2.6.22/mm/page-writeback.c
@@ -227,6 +227,9 @@ static void balance_dirty_pages(struct a
if (nr_reclaimable + global_page_state(NR_WRITEBACK) <=
dirty_thresh)
break;
+   if (list_empty(>host->i_sb->s_dirty) &&
+   list_empty(>host->i_sb->s_io))
+   break;
 
if (!dirty_exceeded)
dirty_exceeded = 1;

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?)

2007-09-29 Thread Fengguang Wu
On Thu, Sep 27, 2007 at 11:32:36PM -0700, Chakri n wrote:
 Hi,
 
 In my testing, a unresponsive file system can hang all I/O in the system.
 This is not seen in 2.4.
 
 I started 20 threads doing I/O on a NFS share. They are just doing 4K
 writes in a loop.
 
 Now I stop NFS server hosting the NFS share and start a
 dd process to write a file on local EXT3 file system.
 
 # dd if=/dev/zero of=/tmp/x count=1000
 
 This process never progresses.

Peter, do you think this patch will help?

===
writeback: avoid possible balance_dirty_pages() lockup on light-load bdi

On a busy-writing system, a writer could be hold up infinitely on a
light-load device. It will be trying to sync more than enough dirty data.

The problem case:

0. sda/nr_dirty = dirty_limit;
   sdb/nr_dirty == 0
1. dd writes 32 pages on sdb
2. balance_dirty_pages() blocks dd, and tries to write 6MB.
3. it never gets there: there's only 128KB dirty data.
4. dd may be blocked for a lng time as long as sda is overloaded

Fix it by returning on 'zero dirty inodes' in the current bdi.
(In fact there are slight differences between 'dirty inodes' and 'dirty pages'.
But there is no available counters for 'dirty pages'.)

Cc: Peter Zijlstra [EMAIL PROTECTED]
Signed-off-by: Fengguang Wu [EMAIL PROTECTED]
---
 mm/page-writeback.c |3 +++
 1 file changed, 3 insertions(+)

--- linux-2.6.22.orig/mm/page-writeback.c
+++ linux-2.6.22/mm/page-writeback.c
@@ -227,6 +227,9 @@ static void balance_dirty_pages(struct a
if (nr_reclaimable + global_page_state(NR_WRITEBACK) =
dirty_thresh)
break;
+   if (list_empty(mapping-host-i_sb-s_dirty) 
+   list_empty(mapping-host-i_sb-s_io))
+   break;
 
if (!dirty_exceeded)
dirty_exceeded = 1;

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?)

2007-09-29 Thread Peter Zijlstra

On Sat, 2007-09-29 at 19:04 +0800, Fengguang Wu wrote:
 On Thu, Sep 27, 2007 at 11:32:36PM -0700, Chakri n wrote:
  Hi,
  
  In my testing, a unresponsive file system can hang all I/O in the system.
  This is not seen in 2.4.
  
  I started 20 threads doing I/O on a NFS share. They are just doing 4K
  writes in a loop.
  
  Now I stop NFS server hosting the NFS share and start a
  dd process to write a file on local EXT3 file system.
  
  # dd if=/dev/zero of=/tmp/x count=1000
  
  This process never progresses.
 
 Peter, do you think this patch will help?

In another sub-thread:

 It's works on .23-rc8-mm2 with out any problems.
 
 dd process does not hang any more.
 
 Thanks for all the help.
 
 Cheers
 --Chakri

So the per-bdi dirty patches that are in -mm already fix the problem.

 ===
 writeback: avoid possible balance_dirty_pages() lockup on light-load bdi
 
 On a busy-writing system, a writer could be hold up infinitely on a
 light-load device. It will be trying to sync more than enough dirty data.
 
 The problem case:
 
 0. sda/nr_dirty = dirty_limit;
sdb/nr_dirty == 0
 1. dd writes 32 pages on sdb
 2. balance_dirty_pages() blocks dd, and tries to write 6MB.
 3. it never gets there: there's only 128KB dirty data.
 4. dd may be blocked for a lng time as long as sda is overloaded
 
 Fix it by returning on 'zero dirty inodes' in the current bdi.
 (In fact there are slight differences between 'dirty inodes' and 'dirty 
 pages'.
 But there is no available counters for 'dirty pages'.)
 
 Cc: Peter Zijlstra [EMAIL PROTECTED]
 Signed-off-by: Fengguang Wu [EMAIL PROTECTED]
 ---
  mm/page-writeback.c |3 +++
  1 file changed, 3 insertions(+)
 
 --- linux-2.6.22.orig/mm/page-writeback.c
 +++ linux-2.6.22/mm/page-writeback.c
 @@ -227,6 +227,9 @@ static void balance_dirty_pages(struct a
   if (nr_reclaimable + global_page_state(NR_WRITEBACK) =
   dirty_thresh)
   break;
 + if (list_empty(mapping-host-i_sb-s_dirty) 
 + list_empty(mapping-host-i_sb-s_io))
 + break;
  
   if (!dirty_exceeded)
   dirty_exceeded = 1;
 

On the patch itself, not sure if it would have been enough. As soon as
there is a single dirty inode on the list one would get caught in the
same problem as before.

That is, if NFS_dirty+NFS_unstable+NFS_writeback  dirty_limit this
break won't fix it.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?)

2007-09-29 Thread Fengguang Wu
On Sat, Sep 29, 2007 at 01:48:01PM +0200, Peter Zijlstra wrote:
 
 On Sat, 2007-09-29 at 19:04 +0800, Fengguang Wu wrote:
  On Thu, Sep 27, 2007 at 11:32:36PM -0700, Chakri n wrote:
   Hi,
   
   In my testing, a unresponsive file system can hang all I/O in the system.
   This is not seen in 2.4.
   
   I started 20 threads doing I/O on a NFS share. They are just doing 4K
   writes in a loop.
   
   Now I stop NFS server hosting the NFS share and start a
   dd process to write a file on local EXT3 file system.
   
   # dd if=/dev/zero of=/tmp/x count=1000
   
   This process never progresses.
  
  Peter, do you think this patch will help?
 
 In another sub-thread:
 
  It's works on .23-rc8-mm2 with out any problems.
  
  dd process does not hang any more.
  
  Thanks for all the help.
  
  Cheers
  --Chakri
 
 So the per-bdi dirty patches that are in -mm already fix the problem.

That's good.
But still it could be a good candidate for 2.6.22.x or even 2.6.23.

  ===
  writeback: avoid possible balance_dirty_pages() lockup on light-load bdi
  
  On a busy-writing system, a writer could be hold up infinitely on a
  light-load device. It will be trying to sync more than enough dirty data.
  
  The problem case:
  
  0. sda/nr_dirty = dirty_limit;
 sdb/nr_dirty == 0
  1. dd writes 32 pages on sdb
  2. balance_dirty_pages() blocks dd, and tries to write 6MB.
  3. it never gets there: there's only 128KB dirty data.
  4. dd may be blocked for a lng time as long as sda is overloaded
  
  Fix it by returning on 'zero dirty inodes' in the current bdi.
  (In fact there are slight differences between 'dirty inodes' and 'dirty 
  pages'.
  But there is no available counters for 'dirty pages'.)
  
  Cc: Peter Zijlstra [EMAIL PROTECTED]
  Signed-off-by: Fengguang Wu [EMAIL PROTECTED]
  ---
   mm/page-writeback.c |3 +++
   1 file changed, 3 insertions(+)
  
  --- linux-2.6.22.orig/mm/page-writeback.c
  +++ linux-2.6.22/mm/page-writeback.c
  @@ -227,6 +227,9 @@ static void balance_dirty_pages(struct a
  if (nr_reclaimable + global_page_state(NR_WRITEBACK) =
  dirty_thresh)
  break;
  +   if (list_empty(mapping-host-i_sb-s_dirty) 
  +   list_empty(mapping-host-i_sb-s_io))
  +   break;
   
  if (!dirty_exceeded)
  dirty_exceeded = 1;
  
 
 On the patch itself, not sure if it would have been enough. As soon as
 there is a single dirty inode on the list one would get caught in the
 same problem as before.

That should not be a problem.  Normally the few new dirty inodes will
be all cleaned in one go and there are no more dirty inodes left(at
least for a moment). Hmm, I guess the new 'break' should be moved
immediately after writeback_inodes()...

 That is, if NFS_dirty+NFS_unstable+NFS_writeback  dirty_limit this
 break won't fix it.

In fact this patch exactly targets at this condition.
When NFS*  dirty_limit, Chakri won't see the lockup at all.
The problem was, there are only two 'break's in the loop, and neither
one evaluates to true for his dd command.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?)

2007-09-29 Thread Peter Zijlstra

On Sat, 2007-09-29 at 20:28 +0800, Fengguang Wu wrote:
 On Sat, Sep 29, 2007 at 01:48:01PM +0200, Peter Zijlstra wrote:

  On the patch itself, not sure if it would have been enough. As soon as
  there is a single dirty inode on the list one would get caught in the
  same problem as before.
 
 That should not be a problem.  Normally the few new dirty inodes will
 be all cleaned in one go and there are no more dirty inodes left(at
 least for a moment). Hmm, I guess the new 'break' should be moved
 immediately after writeback_inodes()...
 
  That is, if NFS_dirty+NFS_unstable+NFS_writeback  dirty_limit this
  break won't fix it.
 
 In fact this patch exactly targets at this condition.
 When NFS*  dirty_limit, Chakri won't see the lockup at all.
 The problem was, there are only two 'break's in the loop, and neither
 one evaluates to true for his dd command.

Yeah indeed, when put in the loop, after writeback_inodes() it makes
sense.

No idea what I was thinking, must be one of those days... :-/



-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?)

2007-09-28 Thread Daniel Phillips
On Friday 28 September 2007 06:35, Peter Zijlstra wrote:
> ,,,it would be grand (and dangerous) if we could provide for a
> button that would just kill off all outstanding pages against a dead
> device.

Substitute "resources" for "pages" and you begin to get an idea of how 
tricky that actually is.  That said, this is exactly what we have done 
with ddsnap, for the simple reason that our users, now emboldened by 
being able to stop or terminate the user space part, felt justified in 
expecting that the system continue as if nothing had happened, and 
furthermore, be able to restart ddsnap without a hiccup.  (Otherwise 
known as a sysop's diety-given right to kill.)

So this is what we do in the specific case of ddsnap:

   * When we detect some nasty state change such as our userspace
  control daemon disappearing on us, we go poking around and
  explicitly release every semaphore that the device driver could
  possibly wait on forever (interestingly they are all in our own
  driver except for BKL, which is just an artifact of device mapper
  not having gone over to unlock_ioctl for no good reason that I
  know of).

   * Then at the points were the driver falls through some lock thus
  released, we check our "ready" flag, and if it indicates "busted",
  proceed with wherever cleanup is needed at that point.

Does not sound like an approach one would expect to work reliably, does 
it?  But there just may be some general principle to be ferretted out 
here.  (Anyone who has ideas on how bits of this procedure could be 
abstracted, please do not hesitate to step boldly forth into the 
limelight.)

Incidentally, only a small subset of locks needed special handling as 
above.  Most can be shown to have no way to block forever, short of an 
outright bug.

I shudder to think how much work it would be to bring every driver in 
the kernel up to such a standard, particularly if user space components 
are involved, as with USB.  On the other hand, every driver fixed is 
one less driver that sucks.  The next one to emerge from the pipeline 
will most likely be NBD, which we have been working on in fits and 
starts for a while.  Look for it to morph into "ddbd", with cross-node 
distributed data awareness, in addition to perforning its current job 
without deadlocking.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?)

2007-09-28 Thread Daniel Phillips
On Thursday 27 September 2007 23:50, Andrew Morton wrote:
> Actually we perhaps could address this at the VFS level in another
> way. Processes which are writing to the dead NFS server will
> eventually block in balance_dirty_pages() once they've exceeded the
> memory limits and will remain blocked until the server wakes up -
> that's the behaviour we want.

It is not necessary to restrict total dirty pages at all.  Instead it is 
necessary to restrict total writeout in flight.  This is evident from 
the fact that making progress is the one and only reason our kernel 
exists, and writeout is how we make progress clearing memory.  In other 
words, if we guarantee the progress of writeout, we will live happily 
ever after and not have to sell the farm.

The current situation has an eerily similar feeling to the VM 
instability in early 2.4, which was never solved until we convinced 
ourselves that the only way to deal with Moore's law as applied to 
number of memory pages was to implement positive control of swapout in 
the form of reverse mapping[1].  This time round, we need to add 
positive control of writeout in the form of rate limiting.

I _think_ Peter is with me on this, and not only that, but between the 
too of us we already have patches for most of the subsystems that need 
it, and we have both been busy testing (different subsets of) these 
patches to destruction for the better part of a year.

Anyway, to fix the immediate bug before the one true dirty_limit removal 
patch lands (promise) I think you are on the right track by noticing 
that balance_dirty_pages has to become aware of how congested the 
involved block device is, since blocking a writeout process on an 
underused block device is clearly a bad idea.  Note how much this idea 
looks like rate limiting.

[1] We lost the scent for a number of reasons, not least because the 
experimental implementation of reverse mapping at the time was buggy 
for reasons entirely unrelated to the reverse mapping itself.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?)

2007-09-28 Thread Chakri n
No change in behavior even in case of low memory systems. I confirmed
it running on 1Gig machine.

Thanks
--Chakri

On 9/28/07, Chakri n <[EMAIL PROTECTED]> wrote:
> Here is a the snapshot of vmstats when the problem happened. I believe
> this could help a little.
>
> crash> kmem -V
>NR_FREE_PAGES: 680853
>  NR_INACTIVE: 95380
>NR_ACTIVE: 26891
>NR_ANON_PAGES: 2507
>   NR_FILE_MAPPED: 1832
>NR_FILE_PAGES: 119779
>NR_FILE_DIRTY: 0
> NR_WRITEBACK: 18272
>  NR_SLAB_RECLAIMABLE: 1305
> NR_SLAB_UNRECLAIMABLE: 2085
> NR_PAGETABLE: 123
>  NR_UNSTABLE_NFS: 0
>NR_BOUNCE: 0
>  NR_VMSCAN_WRITE: 0
>
> In my testing, I always saw the processes are waiting in
> balance_dirty_pages_ratelimited(), never in throttle_vm_writeout()
> path.
>
> But this could be because I have about 4Gig of memory in the system
> and plenty of mem is still available around.
>
> I will rerun the test limiting memory to 1024MB and lets see if it
> takes in any different path.
>
> Thanks
> --Chakri
>
>
> On 9/28/07, Andrew Morton <[EMAIL PROTECTED]> wrote:
> > On Fri, 28 Sep 2007 16:32:18 -0400
> > Trond Myklebust <[EMAIL PROTECTED]> wrote:
> >
> > > On Fri, 2007-09-28 at 13:10 -0700, Andrew Morton wrote:
> > > > On Fri, 28 Sep 2007 15:52:28 -0400
> > > > Trond Myklebust <[EMAIL PROTECTED]> wrote:
> > > >
> > > > > On Fri, 2007-09-28 at 12:26 -0700, Andrew Morton wrote:
> > > > > > On Fri, 28 Sep 2007 15:16:11 -0400 Trond Myklebust <[EMAIL 
> > > > > > PROTECTED]> wrote:
> > > > > > > Looking back, they were getting caught up in
> > > > > > > balance_dirty_pages_ratelimited() and friends. See the attached
> > > > > > > example...
> > > > > >
> > > > > > that one is nfs-on-loopback, which is a special case, isn't it?
> > > > >
> > > > > I'm not sure that the hang that is illustrated here is so special. It 
> > > > > is
> > > > > an example of a bog-standard ext3 write, that ends up calling the NFS
> > > > > client, which is hanging. The fact that it happens to be hanging on 
> > > > > the
> > > > > nfsd process is more or less irrelevant here: the same thing could
> > > > > happen to any other process in the case where we have an NFS server 
> > > > > that
> > > > > is down.
> > > >
> > > > hm, so ext3 got stuck in nfs via __alloc_pages direct reclaim?
> > > >
> > > > We should be able to fix that by marking the backing device as
> > > > write-congested.  That'll have small race windows, but it should be a 
> > > > 99.9%
> > > > fix?
> > >
> > > No. The problem would rather appear to be that we're doing
> > > per-backing_dev writeback (if I read sync_sb_inodes() correctly), but
> > > we're measuring variables which are global to the VM. The backing device
> > > that we are selecting may not be writing out any dirty pages, in which
> > > case, we're just spinning in balance_dirty_pages_ratelimited().
> >
> > OK, so it's unrelated to page reclaim.
> >
> > > Should we therefore perhaps be looking at adding per-backing_dev stats
> > > too?
> >
> > That's what mm-per-device-dirty-threshold.patch and friends are doing.
> > Whether it works adequately is not really known at this time.
> > Unfortunately kernel developers don't test -mm much.
> >
>
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?)

2007-09-28 Thread Chakri n
Here is a the snapshot of vmstats when the problem happened. I believe
this could help a little.

crash> kmem -V
   NR_FREE_PAGES: 680853
 NR_INACTIVE: 95380
   NR_ACTIVE: 26891
   NR_ANON_PAGES: 2507
  NR_FILE_MAPPED: 1832
   NR_FILE_PAGES: 119779
   NR_FILE_DIRTY: 0
NR_WRITEBACK: 18272
 NR_SLAB_RECLAIMABLE: 1305
NR_SLAB_UNRECLAIMABLE: 2085
NR_PAGETABLE: 123
 NR_UNSTABLE_NFS: 0
   NR_BOUNCE: 0
 NR_VMSCAN_WRITE: 0

In my testing, I always saw the processes are waiting in
balance_dirty_pages_ratelimited(), never in throttle_vm_writeout()
path.

But this could be because I have about 4Gig of memory in the system
and plenty of mem is still available around.

I will rerun the test limiting memory to 1024MB and lets see if it
takes in any different path.

Thanks
--Chakri


On 9/28/07, Andrew Morton <[EMAIL PROTECTED]> wrote:
> On Fri, 28 Sep 2007 16:32:18 -0400
> Trond Myklebust <[EMAIL PROTECTED]> wrote:
>
> > On Fri, 2007-09-28 at 13:10 -0700, Andrew Morton wrote:
> > > On Fri, 28 Sep 2007 15:52:28 -0400
> > > Trond Myklebust <[EMAIL PROTECTED]> wrote:
> > >
> > > > On Fri, 2007-09-28 at 12:26 -0700, Andrew Morton wrote:
> > > > > On Fri, 28 Sep 2007 15:16:11 -0400 Trond Myklebust <[EMAIL 
> > > > > PROTECTED]> wrote:
> > > > > > Looking back, they were getting caught up in
> > > > > > balance_dirty_pages_ratelimited() and friends. See the attached
> > > > > > example...
> > > > >
> > > > > that one is nfs-on-loopback, which is a special case, isn't it?
> > > >
> > > > I'm not sure that the hang that is illustrated here is so special. It is
> > > > an example of a bog-standard ext3 write, that ends up calling the NFS
> > > > client, which is hanging. The fact that it happens to be hanging on the
> > > > nfsd process is more or less irrelevant here: the same thing could
> > > > happen to any other process in the case where we have an NFS server that
> > > > is down.
> > >
> > > hm, so ext3 got stuck in nfs via __alloc_pages direct reclaim?
> > >
> > > We should be able to fix that by marking the backing device as
> > > write-congested.  That'll have small race windows, but it should be a 
> > > 99.9%
> > > fix?
> >
> > No. The problem would rather appear to be that we're doing
> > per-backing_dev writeback (if I read sync_sb_inodes() correctly), but
> > we're measuring variables which are global to the VM. The backing device
> > that we are selecting may not be writing out any dirty pages, in which
> > case, we're just spinning in balance_dirty_pages_ratelimited().
>
> OK, so it's unrelated to page reclaim.
>
> > Should we therefore perhaps be looking at adding per-backing_dev stats
> > too?
>
> That's what mm-per-device-dirty-threshold.patch and friends are doing.
> Whether it works adequately is not really known at this time.
> Unfortunately kernel developers don't test -mm much.
>
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?)

2007-09-28 Thread Andrew Morton
On Fri, 28 Sep 2007 16:32:18 -0400
Trond Myklebust <[EMAIL PROTECTED]> wrote:

> On Fri, 2007-09-28 at 13:10 -0700, Andrew Morton wrote:
> > On Fri, 28 Sep 2007 15:52:28 -0400
> > Trond Myklebust <[EMAIL PROTECTED]> wrote:
> > 
> > > On Fri, 2007-09-28 at 12:26 -0700, Andrew Morton wrote:
> > > > On Fri, 28 Sep 2007 15:16:11 -0400 Trond Myklebust <[EMAIL PROTECTED]> 
> > > > wrote:
> > > > > Looking back, they were getting caught up in
> > > > > balance_dirty_pages_ratelimited() and friends. See the attached
> > > > > example...
> > > > 
> > > > that one is nfs-on-loopback, which is a special case, isn't it?
> > > 
> > > I'm not sure that the hang that is illustrated here is so special. It is
> > > an example of a bog-standard ext3 write, that ends up calling the NFS
> > > client, which is hanging. The fact that it happens to be hanging on the
> > > nfsd process is more or less irrelevant here: the same thing could
> > > happen to any other process in the case where we have an NFS server that
> > > is down.
> > 
> > hm, so ext3 got stuck in nfs via __alloc_pages direct reclaim?
> > 
> > We should be able to fix that by marking the backing device as
> > write-congested.  That'll have small race windows, but it should be a 99.9%
> > fix?
> 
> No. The problem would rather appear to be that we're doing
> per-backing_dev writeback (if I read sync_sb_inodes() correctly), but
> we're measuring variables which are global to the VM. The backing device
> that we are selecting may not be writing out any dirty pages, in which
> case, we're just spinning in balance_dirty_pages_ratelimited().

OK, so it's unrelated to page reclaim.

> Should we therefore perhaps be looking at adding per-backing_dev stats
> too?

That's what mm-per-device-dirty-threshold.patch and friends are doing. 
Whether it works adequately is not really known at this time. 
Unfortunately kernel developers don't test -mm much.  
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?)

2007-09-28 Thread Trond Myklebust
On Fri, 2007-09-28 at 13:10 -0700, Andrew Morton wrote:
> On Fri, 28 Sep 2007 15:52:28 -0400
> Trond Myklebust <[EMAIL PROTECTED]> wrote:
> 
> > On Fri, 2007-09-28 at 12:26 -0700, Andrew Morton wrote:
> > > On Fri, 28 Sep 2007 15:16:11 -0400 Trond Myklebust <[EMAIL PROTECTED]> 
> > > wrote:
> > > > Looking back, they were getting caught up in
> > > > balance_dirty_pages_ratelimited() and friends. See the attached
> > > > example...
> > > 
> > > that one is nfs-on-loopback, which is a special case, isn't it?
> > 
> > I'm not sure that the hang that is illustrated here is so special. It is
> > an example of a bog-standard ext3 write, that ends up calling the NFS
> > client, which is hanging. The fact that it happens to be hanging on the
> > nfsd process is more or less irrelevant here: the same thing could
> > happen to any other process in the case where we have an NFS server that
> > is down.
> 
> hm, so ext3 got stuck in nfs via __alloc_pages direct reclaim?
> 
> We should be able to fix that by marking the backing device as
> write-congested.  That'll have small race windows, but it should be a 99.9%
> fix?

No. The problem would rather appear to be that we're doing
per-backing_dev writeback (if I read sync_sb_inodes() correctly), but
we're measuring variables which are global to the VM. The backing device
that we are selecting may not be writing out any dirty pages, in which
case, we're just spinning in balance_dirty_pages_ratelimited().

Should we therefore perhaps be looking at adding per-backing_dev stats
too?

Trond

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?)

2007-09-28 Thread Daniel Phillips
On Friday 28 September 2007 12:52, Trond Myklebust wrote:
> I'm not sure that the hang that is illustrated here is so special. It
> is an example of a bog-standard ext3 write, that ends up calling the
> NFS client, which is hanging. The fact that it happens to be hanging
> on the nfsd process is more or less irrelevant here: the same thing
> could happen to any other process in the case where we have an NFS
> server that is down.

Hi Trond,

Could you clarify what you meant by "calling the NFS client"?  I don't 
see any direct call in the posted backtrace.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?)

2007-09-28 Thread Andrew Morton
On Fri, 28 Sep 2007 15:52:28 -0400
Trond Myklebust <[EMAIL PROTECTED]> wrote:

> On Fri, 2007-09-28 at 12:26 -0700, Andrew Morton wrote:
> > On Fri, 28 Sep 2007 15:16:11 -0400 Trond Myklebust <[EMAIL PROTECTED]> 
> > wrote:
> > > Looking back, they were getting caught up in
> > > balance_dirty_pages_ratelimited() and friends. See the attached
> > > example...
> > 
> > that one is nfs-on-loopback, which is a special case, isn't it?
> 
> I'm not sure that the hang that is illustrated here is so special. It is
> an example of a bog-standard ext3 write, that ends up calling the NFS
> client, which is hanging. The fact that it happens to be hanging on the
> nfsd process is more or less irrelevant here: the same thing could
> happen to any other process in the case where we have an NFS server that
> is down.

hm, so ext3 got stuck in nfs via __alloc_pages direct reclaim?

We should be able to fix that by marking the backing device as
write-congested.  That'll have small race windows, but it should be a 99.9%
fix?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?)

2007-09-28 Thread Trond Myklebust
On Fri, 2007-09-28 at 12:26 -0700, Andrew Morton wrote:
> On Fri, 28 Sep 2007 15:16:11 -0400 Trond Myklebust <[EMAIL PROTECTED]> wrote:
> > Looking back, they were getting caught up in
> > balance_dirty_pages_ratelimited() and friends. See the attached
> > example...
> 
> that one is nfs-on-loopback, which is a special case, isn't it?

I'm not sure that the hang that is illustrated here is so special. It is
an example of a bog-standard ext3 write, that ends up calling the NFS
client, which is hanging. The fact that it happens to be hanging on the
nfsd process is more or less irrelevant here: the same thing could
happen to any other process in the case where we have an NFS server that
is down.

> NFS on loopback used to hang, but then we fixed it.  It looks like we
> broke it again sometime in the intervening four years or so.

It has been quirky all through the 2.6.x series because of this issue.

Cheers
  Trond

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?)

2007-09-28 Thread Andrew Morton
On Fri, 28 Sep 2007 15:16:11 -0400 Trond Myklebust <[EMAIL PROTECTED]> wrote:

> On Fri, 2007-09-28 at 11:49 -0700, Andrew Morton wrote:
> > On Fri, 28 Sep 2007 13:00:53 -0400 Trond Myklebust <[EMAIL PROTECTED]> 
> > wrote:
> > > Do these patches also cause the memory reclaimers to steer clear of
> > > devices that are congested (and stop waiting on a congested device if
> > > they see that it remains congested for a long period of time)? Most of
> > > the collateral blocking I see tends to happen in memory allocation...
> > > 
> > 
> > No, they don't attempt to do that, but I suspect they put in place
> > infrastructure which could be used to improve direct-reclaimer latency.  In
> > the throttle_vm_writeout() path, at least.
> > 
> > Do you know where the stalls are occurring?  throttle_vm_writeout(), or via
> > direct calls to congestion_wait() from page_alloc.c and vmscan.c?  (running
> > sysrq-w five or ten times will probably be enough to determine this)
> 
> Looking back, they were getting caught up in
> balance_dirty_pages_ratelimited() and friends. See the attached
> example...

that one is nfs-on-loopback, which is a special case, isn't it?

NFS on loopback used to hang, but then we fixed it.  It looks like we
broke it again sometime in the intervening four years or so.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?)

2007-09-28 Thread Andrew Morton
On Fri, 28 Sep 2007 20:48:59 +0200 Peter Zijlstra <[EMAIL PROTECTED]> wrote:

> 
> On Fri, 2007-09-28 at 11:49 -0700, Andrew Morton wrote:
> 
> > Do you know where the stalls are occurring?  throttle_vm_writeout(), or via
> > direct calls to congestion_wait() from page_alloc.c and vmscan.c?  (running
> > sysrq-w five or ten times will probably be enough to determine this)
> 
> would it make sense to instrument congestion_wait() callsites with
> vmstats?

Better than nothing, but it isn't a great fit: we'd need one vmstat counter
per congestion_wait() callsite, and it's all rather specific to the
kernel-of-the-day.

taskstats delay accounting isn't useful either - it will aggregate all the
schedule() callsites.

profile=sleep is just about ideal for this, isn't it?  I suspect that most
people don't know it's there, or forgot about it.

It could be that profile=sleep just tells us "you're spending a lot of time
in io_schedule()" or congestion_wait(), so perhaps we need to teach it to
go for walk up the stack somehow.

But lockdep knows how to do that already so perhaps we (ie: you ;)) can
bolt sleep instrumentation onto lockdep as we (ie you ;)) did with the
lockstat stuff?

(Searches for the lockstat documentation)

Did we forget to do that?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?)

2007-09-28 Thread Trond Myklebust
On Fri, 2007-09-28 at 11:49 -0700, Andrew Morton wrote:
> On Fri, 28 Sep 2007 13:00:53 -0400 Trond Myklebust <[EMAIL PROTECTED]> wrote:
> > Do these patches also cause the memory reclaimers to steer clear of
> > devices that are congested (and stop waiting on a congested device if
> > they see that it remains congested for a long period of time)? Most of
> > the collateral blocking I see tends to happen in memory allocation...
> > 
> 
> No, they don't attempt to do that, but I suspect they put in place
> infrastructure which could be used to improve direct-reclaimer latency.  In
> the throttle_vm_writeout() path, at least.
> 
> Do you know where the stalls are occurring?  throttle_vm_writeout(), or via
> direct calls to congestion_wait() from page_alloc.c and vmscan.c?  (running
> sysrq-w five or ten times will probably be enough to determine this)

Looking back, they were getting caught up in
balance_dirty_pages_ratelimited() and friends. See the attached
example...

Cheers
  Trond
--- Begin Message ---
Hi,

I am testing NFS on loopback locks up entire system with 2.6.23-rc6 kernel.

I have mounted a local ext3 partition using loopback NFS (version 3)
and started my test program. The test program forks 20 threads
allocates 10MB for each thread, writes & reads a file on the loopback
NFS mount. After running for about 5 min, I cannot even login to the
machine. Commands like ps etc, hang in a live session.

The machine is a DELL 1950 with 4Gig of RAM, so there is plenty of RAM
& CPU to play around and no other io/heavy processes are running on
the system.

vmstat output shows no buffers are actually getting transferred in or
out and iowait is 100%.

[EMAIL PROTECTED] ~]# vmstat 1
procs ---memory-- ---swap-- -io --system--
-cpu--
 r  bswpd   free   buff   cache   si   so   bi   bo
in cs us sy id wa st
 0 24116 110080  11132 304566400 0 0   28  345  0
1  0 99  0
 0 24116 110080  11132 304566400 0 05  329  0
0  0 100  0
 0 24116 110080  11132 304566400 0 0   26  336  0
0  0 100  0
 0 24116 110080  11132 304566400 0 08  335  0
0  0 100  0
 0 24116 110080  11132 304566400 0 0   26  352  0
0  0 100  0
 0 24116 110080  11132 304566400 0 08  351  0
0  0 100  0
 0 24116 110080  11132 304566400 0 0   23  358  0
1  0 99  0
 0 24116 110080  11132 304566400 0 0   10  350  0
0  0 100  0
 0 24116 110080  11132 304566400 0 0   26  363  0
0  0 100  0
 0 24116 110080  11132 304566400 0 08  346  0
1  0 99  0
 0 24116 110080  11132 304566400 0 0   26  360  0
0  0 100  0
 0 24116 110080  11140 304565600 8 0   11  345  0
0  0 100  0
 0 24116 110080  11140 304566400 0 0   27  355  0
0  2 97  0
 0 24116 110080  11140 304566400 0 09  330  0
0  0 100  0
 0 24116 110080  11140 304566400 0 0   26  358  0
0  0 100  0


The following is the backtrace of
1. one of the threads of my test program
2. nfsd daemon and
3. a generic command like pstree, after the machine hangs:
-
crash> bt 3252
PID: 3252   TASK: f6f3c610  CPU: 0   COMMAND: "test"
 #0 [f6bdcc10] schedule at c0624a34
 #1 [f6bdcc84] schedule_timeout at c06250ee
 #2 [f6bdccc8] io_schedule_timeout at c0624c15
 #3 [f6bdccdc] congestion_wait at c045eb7d
 #4 [f6bdcd00] balance_dirty_pages_ratelimited_nr at c045ab91
 #5 [f6bdcd54] generic_file_buffered_write at c0457148
 #6 [f6bdcde8] __generic_file_aio_write_nolock at c04576e5
 #7 [f6bdce40] try_to_wake_up at c042342b
 #8 [f6bdce5c] generic_file_aio_write at c0457799
 #9 [f6bdce8c] nfs_file_write at f8c25cee
#10 [f6bdced0] do_sync_write at c0472e27
#11 [f6bdcf7c] vfs_write at c0473689
#12 [f6bdcf98] sys_write at c0473c95
#13 [f6bdcfb4] sysenter_entry at c0404ddf
EAX: 0004  EBX: 0013  ECX: a4966008  EDX: 0098
DS:  007b  ESI: 0098  ES:  007b  EDI: a4966008
SS:  007b  ESP: a5ae6ec0  EBP: a5ae6ef0
CS:  0073  EIP: b7eed410  ERR: 0004  EFLAGS: 0246
crash> bt 3188
PID: 3188   TASK: f74c4000  CPU: 1   COMMAND: "nfsd"
 #0 [f6836c7c] schedule at c0624a34
 #1 [f6836cf0] __mutex_lock_slowpath at c062543d
 #2 [f6836d0c] mutex_lock at c0625326
 #3 [f6836d18] generic_file_aio_write at c0457784
 #4 [f6836d48] ext3_file_write at ffd7
 #5 [f6836d64] do_sync_readv_writev at c0472d1f
 #6 [f6836e08] do_readv_writev at c0473486
 #7 [f6836e6c] vfs_writev at c047358e
 #8 [f6836e7c] nfsd_vfs_write at f8e7f8d7
 #9 [f6836ee0] nfsd_write at f8e80139
#10 [f6836f10] nfsd3_proc_write at f8e86afd
#11 [f6836f44] nfsd_dispatch at f8e7c20c
#12 [f6836f6c] svc_process at f89c18e0
#13 [f6836fbc] nfsd at f8e7c794
#14 [f6836fe4] kernel_thread_helper at c0405a35
crash> ps|grep ps
234  

Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?)

2007-09-28 Thread Peter Zijlstra

On Fri, 2007-09-28 at 11:49 -0700, Andrew Morton wrote:

> Do you know where the stalls are occurring?  throttle_vm_writeout(), or via
> direct calls to congestion_wait() from page_alloc.c and vmscan.c?  (running
> sysrq-w five or ten times will probably be enough to determine this)

would it make sense to instrument congestion_wait() callsites with
vmstats?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?)

2007-09-28 Thread Andrew Morton
On Fri, 28 Sep 2007 13:00:53 -0400 Trond Myklebust <[EMAIL PROTECTED]> wrote:

> On Thu, 2007-09-27 at 23:50 -0700, Andrew Morton wrote:
> 
> > Actually we perhaps could address this at the VFS level in another way. 
> > Processes which are writing to the dead NFS server will eventually block in
> > balance_dirty_pages() once they've exceeded the memory limits and will
> > remain blocked until the server wakes up - that's the behaviour we want.
> > 
> > What we _don't_ want to happen is for other processes which are writing to
> > other, non-dead devices to get collaterally blocked.  We have patches which
> > might fix that queued for 2.6.24.  Peter?
> 
> Do these patches also cause the memory reclaimers to steer clear of
> devices that are congested (and stop waiting on a congested device if
> they see that it remains congested for a long period of time)? Most of
> the collateral blocking I see tends to happen in memory allocation...
> 

No, they don't attempt to do that, but I suspect they put in place
infrastructure which could be used to improve direct-reclaimer latency.  In
the throttle_vm_writeout() path, at least.

Do you know where the stalls are occurring?  throttle_vm_writeout(), or via
direct calls to congestion_wait() from page_alloc.c and vmscan.c?  (running
sysrq-w five or ten times will probably be enough to determine this)

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?)

2007-09-28 Thread Andrew Morton
On Fri, 28 Sep 2007 07:28:52 -0600 [EMAIL PROTECTED] (Jonathan Corbet) wrote:

> Andrew wrote:
> > It's unrelated to the actual value of dirty_thresh: if the machine fills up
> > with dirty (or unstable) NFS pages then eventually new writers will block
> > until that condition clears.
> > 
> > 2.4 doesn't have this problem at low levels of dirty data because 2.4
> > VFS/MM doesn't account for NFS pages at all.
> 
> Is it really NFS-related?  I was trying to back up my 2.6.23-rc8 system
> to an external USB drive the other day when something flaked and the
> drive fell off the bus.  That, too, was sufficient to wedge the entire
> system, even though the only thing which needed the dead drive was one
> rsync process.  It's kind of a bummer to have to hit the reset button
> after the failure of (what should be) a non-critical piece of hardware.
> 
> Not that I have a fix to propose...:)
> 

That's a USB bug, surely.  What should happen is that the kernel attempts
writeback, gets an IO error and then your data gets lost.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?)

2007-09-28 Thread Trond Myklebust
On Thu, 2007-09-27 at 23:50 -0700, Andrew Morton wrote:

> Actually we perhaps could address this at the VFS level in another way. 
> Processes which are writing to the dead NFS server will eventually block in
> balance_dirty_pages() once they've exceeded the memory limits and will
> remain blocked until the server wakes up - that's the behaviour we want.
> 
> What we _don't_ want to happen is for other processes which are writing to
> other, non-dead devices to get collaterally blocked.  We have patches which
> might fix that queued for 2.6.24.  Peter?

Do these patches also cause the memory reclaimers to steer clear of
devices that are congested (and stop waiting on a congested device if
they see that it remains congested for a long period of time)? Most of
the collateral blocking I see tends to happen in memory allocation...

Cheers
  Trond

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [linux-pm] Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?)

2007-09-28 Thread Alan Stern
On Fri, 28 Sep 2007, Peter Zijlstra wrote:

> On Fri, 2007-09-28 at 07:28 -0600, Jonathan Corbet wrote:

> > Is it really NFS-related?  I was trying to back up my 2.6.23-rc8 system
> > to an external USB drive the other day when something flaked and the
> > drive fell off the bus.  That, too, was sufficient to wedge the entire
> > system, even though the only thing which needed the dead drive was one
> > rsync process.  It's kind of a bummer to have to hit the reset button
> > after the failure of (what should be) a non-critical piece of hardware.
> > 
> > Not that I have a fix to propose...:)
> 
> the per bdi work in -mm should make the system not drop dead.
> 
> Still, would a remove,re-insert of the usb media end up with the same
> bdi? That is, would it recognise as the same and resume the transfer.

Removal and replacement of the media might work.  I have never tried 
it.

But Jon described removal of the device, not the media.  Replacing the 
device definitely will not work.

Alan Stern

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?)

2007-09-28 Thread Peter Zijlstra
On Fri, 2007-09-28 at 07:28 -0600, Jonathan Corbet wrote:
> Andrew wrote:
> > It's unrelated to the actual value of dirty_thresh: if the machine fills up
> > with dirty (or unstable) NFS pages then eventually new writers will block
> > until that condition clears.
> > 
> > 2.4 doesn't have this problem at low levels of dirty data because 2.4
> > VFS/MM doesn't account for NFS pages at all.
> 
> Is it really NFS-related?  I was trying to back up my 2.6.23-rc8 system
> to an external USB drive the other day when something flaked and the
> drive fell off the bus.  That, too, was sufficient to wedge the entire
> system, even though the only thing which needed the dead drive was one
> rsync process.  It's kind of a bummer to have to hit the reset button
> after the failure of (what should be) a non-critical piece of hardware.
> 
> Not that I have a fix to propose...:)

the per bdi work in -mm should make the system not drop dead.

Still, would a remove,re-insert of the usb media end up with the same
bdi? That is, would it recognise as the same and resume the transfer.

Anyway, it would be grand (and dangerous) if we could provide for a
button that would just kill off all outstanding pages against a dead
device.


signature.asc
Description: This is a digitally signed message part


Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?)

2007-09-28 Thread Jonathan Corbet
Andrew wrote:
> It's unrelated to the actual value of dirty_thresh: if the machine fills up
> with dirty (or unstable) NFS pages then eventually new writers will block
> until that condition clears.
> 
> 2.4 doesn't have this problem at low levels of dirty data because 2.4
> VFS/MM doesn't account for NFS pages at all.

Is it really NFS-related?  I was trying to back up my 2.6.23-rc8 system
to an external USB drive the other day when something flaked and the
drive fell off the bus.  That, too, was sufficient to wedge the entire
system, even though the only thing which needed the dead drive was one
rsync process.  It's kind of a bummer to have to hit the reset button
after the failure of (what should be) a non-critical piece of hardware.

Not that I have a fix to propose...:)

jon
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?)

2007-09-28 Thread Chakri n
It's works on .23-rc8-mm2 with out any problems.

"dd" process does not hang any more.

Thanks for all the help.

Cheers
--Chakri


On 9/28/07, Peter Zijlstra <[EMAIL PROTECTED]> wrote:
>  [ and one copy for the list too ]
>
> On Fri, 2007-09-28 at 02:20 -0700, Chakri n wrote:
> > It's 2.6.23-rc6.
>
> Could you try .23-rc8-mm2. It includes the per bdi stuff.
>
>
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?)

2007-09-28 Thread Peter Zijlstra
 [ and one copy for the list too ]

On Fri, 2007-09-28 at 02:20 -0700, Chakri n wrote:
> It's 2.6.23-rc6.

Could you try .23-rc8-mm2. It includes the per bdi stuff.


signature.asc
Description: This is a digitally signed message part


Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?)

2007-09-28 Thread Chakri n
It's 2.6.23-rc6.

Thanks
--Chakri

On 9/28/07, Peter Zijlstra <[EMAIL PROTECTED]> wrote:
> On Fri, 2007-09-28 at 02:01 -0700, Chakri n wrote:
> > Thanks for explaining the adaptive logic.
> >
> > > However other devices will at that moment try to maintain a limit of 0,
> > > which ends up being similar to a sync mount.
> > >
> > > So they'll not get stuck, but they will be slow.
> > >
> > >
> >
> > Sync should be ok, when the situation is bad like this and some one
> > hijacked all the buffers.
> >
> > But, I see my simple dd to write 10blocks on local disk never
> > completes even after 10 minutes.
> >
> > [EMAIL PROTECTED] ~]# dd if=/dev/zero of=/tmp/x count=10
> >
> > I think the process is completely stuck and is not progressing at all.
> >
> > Is something going wrong in the calculations where it does not fall
> > back to sync mode.
>
> What kernel is that?
>
>
>
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?)

2007-09-28 Thread Peter Zijlstra
On Fri, 2007-09-28 at 02:01 -0700, Chakri n wrote:
> Thanks for explaining the adaptive logic.
> 
> > However other devices will at that moment try to maintain a limit of 0,
> > which ends up being similar to a sync mount.
> >
> > So they'll not get stuck, but they will be slow.
> >
> >
> 
> Sync should be ok, when the situation is bad like this and some one
> hijacked all the buffers.
> 
> But, I see my simple dd to write 10blocks on local disk never
> completes even after 10 minutes.
> 
> [EMAIL PROTECTED] ~]# dd if=/dev/zero of=/tmp/x count=10
> 
> I think the process is completely stuck and is not progressing at all.
> 
> Is something going wrong in the calculations where it does not fall
> back to sync mode.

What kernel is that?



signature.asc
Description: This is a digitally signed message part


Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?)

2007-09-28 Thread Chakri n
Thanks for explaining the adaptive logic.

> However other devices will at that moment try to maintain a limit of 0,
> which ends up being similar to a sync mount.
>
> So they'll not get stuck, but they will be slow.
>
>

Sync should be ok, when the situation is bad like this and some one
hijacked all the buffers.

But, I see my simple dd to write 10blocks on local disk never
completes even after 10 minutes.

[EMAIL PROTECTED] ~]# dd if=/dev/zero of=/tmp/x count=10

I think the process is completely stuck and is not progressing at all.

Is something going wrong in the calculations where it does not fall
back to sync mode.

Thanks
--Chakri

On 9/28/07, Peter Zijlstra <[EMAIL PROTECTED]> wrote:
> [ please don't top-post! ]
>
> On Fri, 2007-09-28 at 01:27 -0700, Chakri n wrote:
>
> > On 9/27/07, Peter Zijlstra <[EMAIL PROTECTED]> wrote:
> > > On Thu, 2007-09-27 at 23:50 -0700, Andrew Morton wrote:
> > >
> > > > What we _don't_ want to happen is for other processes which are writing 
> > > > to
> > > > other, non-dead devices to get collaterally blocked.  We have patches 
> > > > which
> > > > might fix that queued for 2.6.24.  Peter?
> > >
> > > Nasty problem, don't do that :-)
> > >
> > > But yeah, with per BDI dirty limits we get stuck at whatever ratio that
> > > NFS server/mount (?) has - which could be 100%. Other processes will
> > > then work almost synchronously against their BDIs but it should work.
> > >
> > > [ They will lower the NFS-BDI's ratio, but some fancy clipping code will
> > >   limit the other BDIs their dirty limit to not exceed the total limit.
> > >   And with all these NFS pages stuck, that will still be nothing. ]
> > >
> > Thanks.
> >
> > The BDI dirty limits sounds like a good idea.
> >
> > Is there already a patch for this, which I could try?
>
> v2.6.23-rc8-mm2
>
> > I believe it works like this,
> >
> > Each BDI, will have a limit. If the dirty_thresh exceeds the limit,
> > all the I/O on the block device will be synchronous.
> >
> > so, if I have sda & a NFS mount, the dirty limit can be different for
> > each of them.
> >
> > I can set dirty limit for
> >  -  sda to be 90% and
> >  -  NFS mount to be 50%.
> >
> > So, if the dirty limit is greater than 50%, NFS does synchronously,
> > but sda can work asynchronously, till dirty limit reaches 90%.
>
> Not quite, the system determines the limit itself in an adaptive
> fashion.
>
>   bdi_limit = total_limit * p_bdi
>
> Where p is a faction [0,1], and is determined by the relative writeout
> speed of the current BDI vs all other BDIs.
>
> So if you were to have 3 BDIs (sda, sdb and 1 nfs mount), and sda is
> idle, and the nfs mount gets twice as much traffic as sdb, the ratios
> will look like:
>
>  p_sda: 0
>  p_sdb: 1/3
>  p_nfs: 2/3
>
> Once the traffic exceeds the write speed of the device we build up a
> backlog and stuff gets throttled, so these proportions converge to the
> relative write speed of the BDIs when saturated with data.
>
> So what can happen in your case is that the NFS mount is the only one
> with traffic is will get a fraction of 1. If it then disconnects like in
> your case, it will still have all of the dirty limit pinned for NFS.
>
> However other devices will at that moment try to maintain a limit of 0,
> which ends up being similar to a sync mount.
>
> So they'll not get stuck, but they will be slow.
>
>
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?)

2007-09-28 Thread Peter Zijlstra
[ please don't top-post! ]

On Fri, 2007-09-28 at 01:27 -0700, Chakri n wrote:

> On 9/27/07, Peter Zijlstra <[EMAIL PROTECTED]> wrote:
> > On Thu, 2007-09-27 at 23:50 -0700, Andrew Morton wrote:
> >
> > > What we _don't_ want to happen is for other processes which are writing to
> > > other, non-dead devices to get collaterally blocked.  We have patches 
> > > which
> > > might fix that queued for 2.6.24.  Peter?
> >
> > Nasty problem, don't do that :-)
> >
> > But yeah, with per BDI dirty limits we get stuck at whatever ratio that
> > NFS server/mount (?) has - which could be 100%. Other processes will
> > then work almost synchronously against their BDIs but it should work.
> >
> > [ They will lower the NFS-BDI's ratio, but some fancy clipping code will
> >   limit the other BDIs their dirty limit to not exceed the total limit.
> >   And with all these NFS pages stuck, that will still be nothing. ]
> >
> Thanks.
> 
> The BDI dirty limits sounds like a good idea.
> 
> Is there already a patch for this, which I could try?

v2.6.23-rc8-mm2

> I believe it works like this,
> 
> Each BDI, will have a limit. If the dirty_thresh exceeds the limit,
> all the I/O on the block device will be synchronous.
> 
> so, if I have sda & a NFS mount, the dirty limit can be different for
> each of them.
> 
> I can set dirty limit for
>  -  sda to be 90% and
>  -  NFS mount to be 50%.
> 
> So, if the dirty limit is greater than 50%, NFS does synchronously,
> but sda can work asynchronously, till dirty limit reaches 90%.

Not quite, the system determines the limit itself in an adaptive
fashion.

  bdi_limit = total_limit * p_bdi

Where p is a faction [0,1], and is determined by the relative writeout
speed of the current BDI vs all other BDIs.

So if you were to have 3 BDIs (sda, sdb and 1 nfs mount), and sda is
idle, and the nfs mount gets twice as much traffic as sdb, the ratios
will look like:

 p_sda: 0
 p_sdb: 1/3
 p_nfs: 2/3

Once the traffic exceeds the write speed of the device we build up a
backlog and stuff gets throttled, so these proportions converge to the
relative write speed of the BDIs when saturated with data.

So what can happen in your case is that the NFS mount is the only one
with traffic is will get a fraction of 1. If it then disconnects like in
your case, it will still have all of the dirty limit pinned for NFS.

However other devices will at that moment try to maintain a limit of 0,
which ends up being similar to a sync mount.

So they'll not get stuck, but they will be slow.


signature.asc
Description: This is a digitally signed message part


Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?)

2007-09-28 Thread Chakri n
Thanks.

The BDI dirty limits sounds like a good idea.

Is there already a patch for this, which I could try?

I believe it works like this,

Each BDI, will have a limit. If the dirty_thresh exceeds the limit,
all the I/O on the block device will be synchronous.

so, if I have sda & a NFS mount, the dirty limit can be different for
each of them.

I can set dirty limit for
 -  sda to be 90% and
 -  NFS mount to be 50%.

So, if the dirty limit is greater than 50%, NFS does synchronously,
but sda can work asynchronously, till dirty limit reaches 90%.

Thanks
--Chakri

On 9/27/07, Peter Zijlstra <[EMAIL PROTECTED]> wrote:
> On Thu, 2007-09-27 at 23:50 -0700, Andrew Morton wrote:
>
> > What we _don't_ want to happen is for other processes which are writing to
> > other, non-dead devices to get collaterally blocked.  We have patches which
> > might fix that queued for 2.6.24.  Peter?
>
> Nasty problem, don't do that :-)
>
> But yeah, with per BDI dirty limits we get stuck at whatever ratio that
> NFS server/mount (?) has - which could be 100%. Other processes will
> then work almost synchronously against their BDIs but it should work.
>
> [ They will lower the NFS-BDI's ratio, but some fancy clipping code will
>   limit the other BDIs their dirty limit to not exceed the total limit.
>   And with all these NFS pages stuck, that will still be nothing. ]
>
>
>
>
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?)

2007-09-28 Thread Peter Zijlstra
On Thu, 2007-09-27 at 23:50 -0700, Andrew Morton wrote:

> What we _don't_ want to happen is for other processes which are writing to
> other, non-dead devices to get collaterally blocked.  We have patches which
> might fix that queued for 2.6.24.  Peter?

Nasty problem, don't do that :-)

But yeah, with per BDI dirty limits we get stuck at whatever ratio that
NFS server/mount (?) has - which could be 100%. Other processes will
then work almost synchronously against their BDIs but it should work.

[ They will lower the NFS-BDI's ratio, but some fancy clipping code will
  limit the other BDIs their dirty limit to not exceed the total limit.
  And with all these NFS pages stuck, that will still be nothing. ]




signature.asc
Description: This is a digitally signed message part


Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?)

2007-09-28 Thread Andrew Morton
On Thu, 27 Sep 2007 23:32:36 -0700 "Chakri n" <[EMAIL PROTECTED]> wrote:

> Hi,
> 
> In my testing, a unresponsive file system can hang all I/O in the system.
> This is not seen in 2.4.
> 
> I started 20 threads doing I/O on a NFS share. They are just doing 4K
> writes in a loop.
> 
> Now I stop NFS server hosting the NFS share and start a
> "dd" process to write a file on local EXT3 file system.
> 
> # dd if=/dev/zero of=/tmp/x count=1000
> 
> This process never progresses.

yup.

> There is plenty of HIGH MEMORY available in the system, but this
> process never progresses.
> 
> ...
> 
> The problem seems to be in balance_dirty_pages, which calculates
> dirty_thresh based on only ZONE_NORMAL. The same scenario works fine
> in 2.4. The dd processes finishes in no time.
> NFS file systems can go offline, due to multiple reasons, a failed
> switch, filer etc, but that should not effect other file systems in
> the machine.
> Can this behavior be fenced?, can the buffer cache be tuned so that
> other processes do not see the effect?

It's unrelated to the actual value of dirty_thresh: if the machine fills up
with dirty (or unstable) NFS pages then eventually new writers will block
until that condition clears.

2.4 doesn't have this problem at low levels of dirty data because 2.4
VFS/MM doesn't account for NFS pages at all.

I'm not sure what we can do about this from a design perspective, really. 
We have data floating about in memory which we're not allowed to discard
and if we allow it to increase without bound it will eventually either
wedge userspace _anyway_ or it will take the machine down, resulting in
data loss.

What it would be nice to do would be to write that data to local disk if
poss, then reclaim it.  Perhaps David Howells' fscache code can do that (or
could be tweaked to do so).

If you really want to fill all memory with pages whic are dirty against a
dead NFS server then you can manually increase
/proc/sys/vm/dirty_background_ratio and dirty_ratio - that should give you
the 2.4 behaviour.




Actually we perhaps could address this at the VFS level in another way. 
Processes which are writing to the dead NFS server will eventually block in
balance_dirty_pages() once they've exceeded the memory limits and will
remain blocked until the server wakes up - that's the behaviour we want.

What we _don't_ want to happen is for other processes which are writing to
other, non-dead devices to get collaterally blocked.  We have patches which
might fix that queued for 2.6.24.  Peter?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?)

2007-09-28 Thread Chakri n
Hi,

In my testing, a unresponsive file system can hang all I/O in the system.
This is not seen in 2.4.

I started 20 threads doing I/O on a NFS share. They are just doing 4K
writes in a loop.

Now I stop NFS server hosting the NFS share and start a
"dd" process to write a file on local EXT3 file system.

# dd if=/dev/zero of=/tmp/x count=1000

This process never progresses.
There is plenty of HIGH MEMORY available in the system, but this
process never progresses.

# free
total   used  free
sharedbuffers cached
Mem:   3238004 6093402628664  0  15136
551024
-/+ buffers/cache:  431803194824
Swap:  4096532  04096532

vmstat on the machine:
# vmstat
procs ---memory-- ---swap-- -io --system--
-cpu--
 r  b   swpd   free   buff  cache   si   sobibo   in   cs us sy id wa st
 0 21  0 2628416  15152 55102400 0 0   28  344  0
0  0 100  0
 0 21  0 2628416  15152 55102400 0 08  340  0
0  0 100  0
 0 21  0 2628416  15152 55102400 0 0   26  343  0
0  0 100  0
 0 21  0 2628416  15152 55102400 0 08  341  0
0  0 100  0
 0 21  0 2628416  15152 55102400 0 0   26  357  0
0  0 100  0
 0 21  0 2628416  15152 55102400 0 08  325  0
0  0 100  0
 0 21  0 2628416  15152 55102400 0 0   26  343  0
0  0 100  0
 0 21  0 2628416  15152 55102400 0 08  325  0
0  0 100  0

The problem seems to be in balance_dirty_pages, which calculates
dirty_thresh based on only ZONE_NORMAL. The same scenario works fine
in 2.4. The dd processes finishes in no time.
NFS file systems can go offline, due to multiple reasons, a failed
switch, filer etc, but that should not effect other file systems in
the machine.
Can this behavior be fenced?, can the buffer cache be tuned so that
other processes do not see the effect?

The following is the back trace of the processes:
--
PID: 3552   TASK: cb1fc610  CPU: 0   COMMAND: "dd"
 #0 [f5c04c38] schedule at c0624a34
 #1 [f5c04cac] schedule_timeout at c06250ee
 #2 [f5c04cf0] io_schedule_timeout at c0624c15
 #3 [f5c04d04] congestion_wait at c045eb7d
 #4 [f5c04d28] balance_dirty_pages_ratelimited_nr at c045ab91
 #5 [f5c04d7c] generic_file_buffered_write at c0457148
 #6 [f5c04e10] __generic_file_aio_write_nolock at c04576e5
 #7 [f5c04e84] generic_file_aio_write at c0457799
 #8 [f5c04eb4] ext3_file_write at ffd7
 #9 [f5c04ed0] do_sync_write at c0472e27
#10 [f5c04f7c] vfs_write at c0473689
#11 [f5c04f98] sys_write at c0473c95
#12 [f5c04fb4] sysenter_entry at c0404ddf
--
PID: 3091  TASK: cb1f0100 CPU: 1   COMMAND: "test"
 #0 [f6050c10] schedule at c0624a34
 #1 [f6050c84] schedule_timeout at c06250ee
 #2 [f6050cc8] io_schedule_timeout at c0624c15
 #3 [f6050cdc] congestion_wait at c045eb7d
 #4 [f6050d00] balance_dirty_pages_ratelimited_nr at c045ab91
 #5 [f6050d54] generic_file_buffered_write at c0457148
 #6 [f6050de8] __generic_file_aio_write_nolock at c04576e5
 #7 [f6050e40] enqueue_entity at c042131f
 #8 [f6050e5c] generic_file_aio_write at c0457799
 #9 [f6050e8c] nfs_file_write at f8f90cee
#10 [f6050e9c] getnstimeofday at c043d3f7
#11 [f6050ed0] do_sync_write at c0472e27
#12 [f6050f7c] vfs_write at c0473689
#13 [f6050f98] sys_write at c0473c95
#14 [f6050fb4] sysenter_entry at c0404ddf

Thanks
--Chakri
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?)

2007-09-28 Thread Andrew Morton
On Thu, 27 Sep 2007 23:32:36 -0700 Chakri n [EMAIL PROTECTED] wrote:

 Hi,
 
 In my testing, a unresponsive file system can hang all I/O in the system.
 This is not seen in 2.4.
 
 I started 20 threads doing I/O on a NFS share. They are just doing 4K
 writes in a loop.
 
 Now I stop NFS server hosting the NFS share and start a
 dd process to write a file on local EXT3 file system.
 
 # dd if=/dev/zero of=/tmp/x count=1000
 
 This process never progresses.

yup.

 There is plenty of HIGH MEMORY available in the system, but this
 process never progresses.
 
 ...
 
 The problem seems to be in balance_dirty_pages, which calculates
 dirty_thresh based on only ZONE_NORMAL. The same scenario works fine
 in 2.4. The dd processes finishes in no time.
 NFS file systems can go offline, due to multiple reasons, a failed
 switch, filer etc, but that should not effect other file systems in
 the machine.
 Can this behavior be fenced?, can the buffer cache be tuned so that
 other processes do not see the effect?

It's unrelated to the actual value of dirty_thresh: if the machine fills up
with dirty (or unstable) NFS pages then eventually new writers will block
until that condition clears.

2.4 doesn't have this problem at low levels of dirty data because 2.4
VFS/MM doesn't account for NFS pages at all.

I'm not sure what we can do about this from a design perspective, really. 
We have data floating about in memory which we're not allowed to discard
and if we allow it to increase without bound it will eventually either
wedge userspace _anyway_ or it will take the machine down, resulting in
data loss.

What it would be nice to do would be to write that data to local disk if
poss, then reclaim it.  Perhaps David Howells' fscache code can do that (or
could be tweaked to do so).

If you really want to fill all memory with pages whic are dirty against a
dead NFS server then you can manually increase
/proc/sys/vm/dirty_background_ratio and dirty_ratio - that should give you
the 2.4 behaviour.


thinks

Actually we perhaps could address this at the VFS level in another way. 
Processes which are writing to the dead NFS server will eventually block in
balance_dirty_pages() once they've exceeded the memory limits and will
remain blocked until the server wakes up - that's the behaviour we want.

What we _don't_ want to happen is for other processes which are writing to
other, non-dead devices to get collaterally blocked.  We have patches which
might fix that queued for 2.6.24.  Peter?
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?)

2007-09-28 Thread Peter Zijlstra
On Thu, 2007-09-27 at 23:50 -0700, Andrew Morton wrote:

 What we _don't_ want to happen is for other processes which are writing to
 other, non-dead devices to get collaterally blocked.  We have patches which
 might fix that queued for 2.6.24.  Peter?

Nasty problem, don't do that :-)

But yeah, with per BDI dirty limits we get stuck at whatever ratio that
NFS server/mount (?) has - which could be 100%. Other processes will
then work almost synchronously against their BDIs but it should work.

[ They will lower the NFS-BDI's ratio, but some fancy clipping code will
  limit the other BDIs their dirty limit to not exceed the total limit.
  And with all these NFS pages stuck, that will still be nothing. ]




signature.asc
Description: This is a digitally signed message part


A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?)

2007-09-28 Thread Chakri n
Hi,

In my testing, a unresponsive file system can hang all I/O in the system.
This is not seen in 2.4.

I started 20 threads doing I/O on a NFS share. They are just doing 4K
writes in a loop.

Now I stop NFS server hosting the NFS share and start a
dd process to write a file on local EXT3 file system.

# dd if=/dev/zero of=/tmp/x count=1000

This process never progresses.
There is plenty of HIGH MEMORY available in the system, but this
process never progresses.

# free
total   used  free
sharedbuffers cached
Mem:   3238004 6093402628664  0  15136
551024
-/+ buffers/cache:  431803194824
Swap:  4096532  04096532

vmstat on the machine:
# vmstat
procs ---memory-- ---swap-- -io --system--
-cpu--
 r  b   swpd   free   buff  cache   si   sobibo   in   cs us sy id wa st
 0 21  0 2628416  15152 55102400 0 0   28  344  0
0  0 100  0
 0 21  0 2628416  15152 55102400 0 08  340  0
0  0 100  0
 0 21  0 2628416  15152 55102400 0 0   26  343  0
0  0 100  0
 0 21  0 2628416  15152 55102400 0 08  341  0
0  0 100  0
 0 21  0 2628416  15152 55102400 0 0   26  357  0
0  0 100  0
 0 21  0 2628416  15152 55102400 0 08  325  0
0  0 100  0
 0 21  0 2628416  15152 55102400 0 0   26  343  0
0  0 100  0
 0 21  0 2628416  15152 55102400 0 08  325  0
0  0 100  0

The problem seems to be in balance_dirty_pages, which calculates
dirty_thresh based on only ZONE_NORMAL. The same scenario works fine
in 2.4. The dd processes finishes in no time.
NFS file systems can go offline, due to multiple reasons, a failed
switch, filer etc, but that should not effect other file systems in
the machine.
Can this behavior be fenced?, can the buffer cache be tuned so that
other processes do not see the effect?

The following is the back trace of the processes:
--
PID: 3552   TASK: cb1fc610  CPU: 0   COMMAND: dd
 #0 [f5c04c38] schedule at c0624a34
 #1 [f5c04cac] schedule_timeout at c06250ee
 #2 [f5c04cf0] io_schedule_timeout at c0624c15
 #3 [f5c04d04] congestion_wait at c045eb7d
 #4 [f5c04d28] balance_dirty_pages_ratelimited_nr at c045ab91
 #5 [f5c04d7c] generic_file_buffered_write at c0457148
 #6 [f5c04e10] __generic_file_aio_write_nolock at c04576e5
 #7 [f5c04e84] generic_file_aio_write at c0457799
 #8 [f5c04eb4] ext3_file_write at ffd7
 #9 [f5c04ed0] do_sync_write at c0472e27
#10 [f5c04f7c] vfs_write at c0473689
#11 [f5c04f98] sys_write at c0473c95
#12 [f5c04fb4] sysenter_entry at c0404ddf
--
PID: 3091  TASK: cb1f0100 CPU: 1   COMMAND: test
 #0 [f6050c10] schedule at c0624a34
 #1 [f6050c84] schedule_timeout at c06250ee
 #2 [f6050cc8] io_schedule_timeout at c0624c15
 #3 [f6050cdc] congestion_wait at c045eb7d
 #4 [f6050d00] balance_dirty_pages_ratelimited_nr at c045ab91
 #5 [f6050d54] generic_file_buffered_write at c0457148
 #6 [f6050de8] __generic_file_aio_write_nolock at c04576e5
 #7 [f6050e40] enqueue_entity at c042131f
 #8 [f6050e5c] generic_file_aio_write at c0457799
 #9 [f6050e8c] nfs_file_write at f8f90cee
#10 [f6050e9c] getnstimeofday at c043d3f7
#11 [f6050ed0] do_sync_write at c0472e27
#12 [f6050f7c] vfs_write at c0473689
#13 [f6050f98] sys_write at c0473c95
#14 [f6050fb4] sysenter_entry at c0404ddf

Thanks
--Chakri
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?)

2007-09-28 Thread Chakri n
It's 2.6.23-rc6.

Thanks
--Chakri

On 9/28/07, Peter Zijlstra [EMAIL PROTECTED] wrote:
 On Fri, 2007-09-28 at 02:01 -0700, Chakri n wrote:
  Thanks for explaining the adaptive logic.
 
   However other devices will at that moment try to maintain a limit of 0,
   which ends up being similar to a sync mount.
  
   So they'll not get stuck, but they will be slow.
  
  
 
  Sync should be ok, when the situation is bad like this and some one
  hijacked all the buffers.
 
  But, I see my simple dd to write 10blocks on local disk never
  completes even after 10 minutes.
 
  [EMAIL PROTECTED] ~]# dd if=/dev/zero of=/tmp/x count=10
 
  I think the process is completely stuck and is not progressing at all.
 
  Is something going wrong in the calculations where it does not fall
  back to sync mode.

 What kernel is that?



-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?)

2007-09-28 Thread Peter Zijlstra
On Fri, 2007-09-28 at 07:28 -0600, Jonathan Corbet wrote:
 Andrew wrote:
  It's unrelated to the actual value of dirty_thresh: if the machine fills up
  with dirty (or unstable) NFS pages then eventually new writers will block
  until that condition clears.
  
  2.4 doesn't have this problem at low levels of dirty data because 2.4
  VFS/MM doesn't account for NFS pages at all.
 
 Is it really NFS-related?  I was trying to back up my 2.6.23-rc8 system
 to an external USB drive the other day when something flaked and the
 drive fell off the bus.  That, too, was sufficient to wedge the entire
 system, even though the only thing which needed the dead drive was one
 rsync process.  It's kind of a bummer to have to hit the reset button
 after the failure of (what should be) a non-critical piece of hardware.
 
 Not that I have a fix to propose...:)

the per bdi work in -mm should make the system not drop dead.

Still, would a remove,re-insert of the usb media end up with the same
bdi? That is, would it recognise as the same and resume the transfer.

Anyway, it would be grand (and dangerous) if we could provide for a
button that would just kill off all outstanding pages against a dead
device.


signature.asc
Description: This is a digitally signed message part


Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?)

2007-09-28 Thread Jonathan Corbet
Andrew wrote:
 It's unrelated to the actual value of dirty_thresh: if the machine fills up
 with dirty (or unstable) NFS pages then eventually new writers will block
 until that condition clears.
 
 2.4 doesn't have this problem at low levels of dirty data because 2.4
 VFS/MM doesn't account for NFS pages at all.

Is it really NFS-related?  I was trying to back up my 2.6.23-rc8 system
to an external USB drive the other day when something flaked and the
drive fell off the bus.  That, too, was sufficient to wedge the entire
system, even though the only thing which needed the dead drive was one
rsync process.  It's kind of a bummer to have to hit the reset button
after the failure of (what should be) a non-critical piece of hardware.

Not that I have a fix to propose...:)

jon
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?)

2007-09-28 Thread Chakri n
It's works on .23-rc8-mm2 with out any problems.

dd process does not hang any more.

Thanks for all the help.

Cheers
--Chakri


On 9/28/07, Peter Zijlstra [EMAIL PROTECTED] wrote:
  [ and one copy for the list too ]

 On Fri, 2007-09-28 at 02:20 -0700, Chakri n wrote:
  It's 2.6.23-rc6.

 Could you try .23-rc8-mm2. It includes the per bdi stuff.


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [linux-pm] Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?)

2007-09-28 Thread Alan Stern
On Fri, 28 Sep 2007, Peter Zijlstra wrote:

 On Fri, 2007-09-28 at 07:28 -0600, Jonathan Corbet wrote:

  Is it really NFS-related?  I was trying to back up my 2.6.23-rc8 system
  to an external USB drive the other day when something flaked and the
  drive fell off the bus.  That, too, was sufficient to wedge the entire
  system, even though the only thing which needed the dead drive was one
  rsync process.  It's kind of a bummer to have to hit the reset button
  after the failure of (what should be) a non-critical piece of hardware.
  
  Not that I have a fix to propose...:)
 
 the per bdi work in -mm should make the system not drop dead.
 
 Still, would a remove,re-insert of the usb media end up with the same
 bdi? That is, would it recognise as the same and resume the transfer.

Removal and replacement of the media might work.  I have never tried 
it.

But Jon described removal of the device, not the media.  Replacing the 
device definitely will not work.

Alan Stern

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?)

2007-09-28 Thread Peter Zijlstra
 [ and one copy for the list too ]

On Fri, 2007-09-28 at 02:20 -0700, Chakri n wrote:
 It's 2.6.23-rc6.

Could you try .23-rc8-mm2. It includes the per bdi stuff.


signature.asc
Description: This is a digitally signed message part


Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?)

2007-09-28 Thread Chakri n
Thanks.

The BDI dirty limits sounds like a good idea.

Is there already a patch for this, which I could try?

I believe it works like this,

Each BDI, will have a limit. If the dirty_thresh exceeds the limit,
all the I/O on the block device will be synchronous.

so, if I have sda  a NFS mount, the dirty limit can be different for
each of them.

I can set dirty limit for
 -  sda to be 90% and
 -  NFS mount to be 50%.

So, if the dirty limit is greater than 50%, NFS does synchronously,
but sda can work asynchronously, till dirty limit reaches 90%.

Thanks
--Chakri

On 9/27/07, Peter Zijlstra [EMAIL PROTECTED] wrote:
 On Thu, 2007-09-27 at 23:50 -0700, Andrew Morton wrote:

  What we _don't_ want to happen is for other processes which are writing to
  other, non-dead devices to get collaterally blocked.  We have patches which
  might fix that queued for 2.6.24.  Peter?

 Nasty problem, don't do that :-)

 But yeah, with per BDI dirty limits we get stuck at whatever ratio that
 NFS server/mount (?) has - which could be 100%. Other processes will
 then work almost synchronously against their BDIs but it should work.

 [ They will lower the NFS-BDI's ratio, but some fancy clipping code will
   limit the other BDIs their dirty limit to not exceed the total limit.
   And with all these NFS pages stuck, that will still be nothing. ]




-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?)

2007-09-28 Thread Peter Zijlstra
[ please don't top-post! ]

On Fri, 2007-09-28 at 01:27 -0700, Chakri n wrote:

 On 9/27/07, Peter Zijlstra [EMAIL PROTECTED] wrote:
  On Thu, 2007-09-27 at 23:50 -0700, Andrew Morton wrote:
 
   What we _don't_ want to happen is for other processes which are writing to
   other, non-dead devices to get collaterally blocked.  We have patches 
   which
   might fix that queued for 2.6.24.  Peter?
 
  Nasty problem, don't do that :-)
 
  But yeah, with per BDI dirty limits we get stuck at whatever ratio that
  NFS server/mount (?) has - which could be 100%. Other processes will
  then work almost synchronously against their BDIs but it should work.
 
  [ They will lower the NFS-BDI's ratio, but some fancy clipping code will
limit the other BDIs their dirty limit to not exceed the total limit.
And with all these NFS pages stuck, that will still be nothing. ]
 
 Thanks.
 
 The BDI dirty limits sounds like a good idea.
 
 Is there already a patch for this, which I could try?

v2.6.23-rc8-mm2

 I believe it works like this,
 
 Each BDI, will have a limit. If the dirty_thresh exceeds the limit,
 all the I/O on the block device will be synchronous.
 
 so, if I have sda  a NFS mount, the dirty limit can be different for
 each of them.
 
 I can set dirty limit for
  -  sda to be 90% and
  -  NFS mount to be 50%.
 
 So, if the dirty limit is greater than 50%, NFS does synchronously,
 but sda can work asynchronously, till dirty limit reaches 90%.

Not quite, the system determines the limit itself in an adaptive
fashion.

  bdi_limit = total_limit * p_bdi

Where p is a faction [0,1], and is determined by the relative writeout
speed of the current BDI vs all other BDIs.

So if you were to have 3 BDIs (sda, sdb and 1 nfs mount), and sda is
idle, and the nfs mount gets twice as much traffic as sdb, the ratios
will look like:

 p_sda: 0
 p_sdb: 1/3
 p_nfs: 2/3

Once the traffic exceeds the write speed of the device we build up a
backlog and stuff gets throttled, so these proportions converge to the
relative write speed of the BDIs when saturated with data.

So what can happen in your case is that the NFS mount is the only one
with traffic is will get a fraction of 1. If it then disconnects like in
your case, it will still have all of the dirty limit pinned for NFS.

However other devices will at that moment try to maintain a limit of 0,
which ends up being similar to a sync mount.

So they'll not get stuck, but they will be slow.


signature.asc
Description: This is a digitally signed message part


Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?)

2007-09-28 Thread Trond Myklebust
On Thu, 2007-09-27 at 23:50 -0700, Andrew Morton wrote:

 Actually we perhaps could address this at the VFS level in another way. 
 Processes which are writing to the dead NFS server will eventually block in
 balance_dirty_pages() once they've exceeded the memory limits and will
 remain blocked until the server wakes up - that's the behaviour we want.
 
 What we _don't_ want to happen is for other processes which are writing to
 other, non-dead devices to get collaterally blocked.  We have patches which
 might fix that queued for 2.6.24.  Peter?

Do these patches also cause the memory reclaimers to steer clear of
devices that are congested (and stop waiting on a congested device if
they see that it remains congested for a long period of time)? Most of
the collateral blocking I see tends to happen in memory allocation...

Cheers
  Trond

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?)

2007-09-28 Thread Chakri n
Thanks for explaining the adaptive logic.

 However other devices will at that moment try to maintain a limit of 0,
 which ends up being similar to a sync mount.

 So they'll not get stuck, but they will be slow.



Sync should be ok, when the situation is bad like this and some one
hijacked all the buffers.

But, I see my simple dd to write 10blocks on local disk never
completes even after 10 minutes.

[EMAIL PROTECTED] ~]# dd if=/dev/zero of=/tmp/x count=10

I think the process is completely stuck and is not progressing at all.

Is something going wrong in the calculations where it does not fall
back to sync mode.

Thanks
--Chakri

On 9/28/07, Peter Zijlstra [EMAIL PROTECTED] wrote:
 [ please don't top-post! ]

 On Fri, 2007-09-28 at 01:27 -0700, Chakri n wrote:

  On 9/27/07, Peter Zijlstra [EMAIL PROTECTED] wrote:
   On Thu, 2007-09-27 at 23:50 -0700, Andrew Morton wrote:
  
What we _don't_ want to happen is for other processes which are writing 
to
other, non-dead devices to get collaterally blocked.  We have patches 
which
might fix that queued for 2.6.24.  Peter?
  
   Nasty problem, don't do that :-)
  
   But yeah, with per BDI dirty limits we get stuck at whatever ratio that
   NFS server/mount (?) has - which could be 100%. Other processes will
   then work almost synchronously against their BDIs but it should work.
  
   [ They will lower the NFS-BDI's ratio, but some fancy clipping code will
 limit the other BDIs their dirty limit to not exceed the total limit.
 And with all these NFS pages stuck, that will still be nothing. ]
  
  Thanks.
 
  The BDI dirty limits sounds like a good idea.
 
  Is there already a patch for this, which I could try?

 v2.6.23-rc8-mm2

  I believe it works like this,
 
  Each BDI, will have a limit. If the dirty_thresh exceeds the limit,
  all the I/O on the block device will be synchronous.
 
  so, if I have sda  a NFS mount, the dirty limit can be different for
  each of them.
 
  I can set dirty limit for
   -  sda to be 90% and
   -  NFS mount to be 50%.
 
  So, if the dirty limit is greater than 50%, NFS does synchronously,
  but sda can work asynchronously, till dirty limit reaches 90%.

 Not quite, the system determines the limit itself in an adaptive
 fashion.

   bdi_limit = total_limit * p_bdi

 Where p is a faction [0,1], and is determined by the relative writeout
 speed of the current BDI vs all other BDIs.

 So if you were to have 3 BDIs (sda, sdb and 1 nfs mount), and sda is
 idle, and the nfs mount gets twice as much traffic as sdb, the ratios
 will look like:

  p_sda: 0
  p_sdb: 1/3
  p_nfs: 2/3

 Once the traffic exceeds the write speed of the device we build up a
 backlog and stuff gets throttled, so these proportions converge to the
 relative write speed of the BDIs when saturated with data.

 So what can happen in your case is that the NFS mount is the only one
 with traffic is will get a fraction of 1. If it then disconnects like in
 your case, it will still have all of the dirty limit pinned for NFS.

 However other devices will at that moment try to maintain a limit of 0,
 which ends up being similar to a sync mount.

 So they'll not get stuck, but they will be slow.


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?)

2007-09-28 Thread Peter Zijlstra
On Fri, 2007-09-28 at 02:01 -0700, Chakri n wrote:
 Thanks for explaining the adaptive logic.
 
  However other devices will at that moment try to maintain a limit of 0,
  which ends up being similar to a sync mount.
 
  So they'll not get stuck, but they will be slow.
 
 
 
 Sync should be ok, when the situation is bad like this and some one
 hijacked all the buffers.
 
 But, I see my simple dd to write 10blocks on local disk never
 completes even after 10 minutes.
 
 [EMAIL PROTECTED] ~]# dd if=/dev/zero of=/tmp/x count=10
 
 I think the process is completely stuck and is not progressing at all.
 
 Is something going wrong in the calculations where it does not fall
 back to sync mode.

What kernel is that?



signature.asc
Description: This is a digitally signed message part


Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?)

2007-09-28 Thread Andrew Morton
On Fri, 28 Sep 2007 07:28:52 -0600 [EMAIL PROTECTED] (Jonathan Corbet) wrote:

 Andrew wrote:
  It's unrelated to the actual value of dirty_thresh: if the machine fills up
  with dirty (or unstable) NFS pages then eventually new writers will block
  until that condition clears.
  
  2.4 doesn't have this problem at low levels of dirty data because 2.4
  VFS/MM doesn't account for NFS pages at all.
 
 Is it really NFS-related?  I was trying to back up my 2.6.23-rc8 system
 to an external USB drive the other day when something flaked and the
 drive fell off the bus.  That, too, was sufficient to wedge the entire
 system, even though the only thing which needed the dead drive was one
 rsync process.  It's kind of a bummer to have to hit the reset button
 after the failure of (what should be) a non-critical piece of hardware.
 
 Not that I have a fix to propose...:)
 

That's a USB bug, surely.  What should happen is that the kernel attempts
writeback, gets an IO error and then your data gets lost.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?)

2007-09-28 Thread Peter Zijlstra

On Fri, 2007-09-28 at 11:49 -0700, Andrew Morton wrote:

 Do you know where the stalls are occurring?  throttle_vm_writeout(), or via
 direct calls to congestion_wait() from page_alloc.c and vmscan.c?  (running
 sysrq-w five or ten times will probably be enough to determine this)

would it make sense to instrument congestion_wait() callsites with
vmstats?

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?)

2007-09-28 Thread Andrew Morton
On Fri, 28 Sep 2007 13:00:53 -0400 Trond Myklebust [EMAIL PROTECTED] wrote:

 On Thu, 2007-09-27 at 23:50 -0700, Andrew Morton wrote:
 
  Actually we perhaps could address this at the VFS level in another way. 
  Processes which are writing to the dead NFS server will eventually block in
  balance_dirty_pages() once they've exceeded the memory limits and will
  remain blocked until the server wakes up - that's the behaviour we want.
  
  What we _don't_ want to happen is for other processes which are writing to
  other, non-dead devices to get collaterally blocked.  We have patches which
  might fix that queued for 2.6.24.  Peter?
 
 Do these patches also cause the memory reclaimers to steer clear of
 devices that are congested (and stop waiting on a congested device if
 they see that it remains congested for a long period of time)? Most of
 the collateral blocking I see tends to happen in memory allocation...
 

No, they don't attempt to do that, but I suspect they put in place
infrastructure which could be used to improve direct-reclaimer latency.  In
the throttle_vm_writeout() path, at least.

Do you know where the stalls are occurring?  throttle_vm_writeout(), or via
direct calls to congestion_wait() from page_alloc.c and vmscan.c?  (running
sysrq-w five or ten times will probably be enough to determine this)

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?)

2007-09-28 Thread Andrew Morton
On Fri, 28 Sep 2007 20:48:59 +0200 Peter Zijlstra [EMAIL PROTECTED] wrote:

 
 On Fri, 2007-09-28 at 11:49 -0700, Andrew Morton wrote:
 
  Do you know where the stalls are occurring?  throttle_vm_writeout(), or via
  direct calls to congestion_wait() from page_alloc.c and vmscan.c?  (running
  sysrq-w five or ten times will probably be enough to determine this)
 
 would it make sense to instrument congestion_wait() callsites with
 vmstats?

Better than nothing, but it isn't a great fit: we'd need one vmstat counter
per congestion_wait() callsite, and it's all rather specific to the
kernel-of-the-day.

taskstats delay accounting isn't useful either - it will aggregate all the
schedule() callsites.

profile=sleep is just about ideal for this, isn't it?  I suspect that most
people don't know it's there, or forgot about it.

It could be that profile=sleep just tells us you're spending a lot of time
in io_schedule() or congestion_wait(), so perhaps we need to teach it to
go for walk up the stack somehow.

But lockdep knows how to do that already so perhaps we (ie: you ;)) can
bolt sleep instrumentation onto lockdep as we (ie you ;)) did with the
lockstat stuff?

(Searches for the lockstat documentation)

Did we forget to do that?
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?)

2007-09-28 Thread Andrew Morton
On Fri, 28 Sep 2007 15:16:11 -0400 Trond Myklebust [EMAIL PROTECTED] wrote:

 On Fri, 2007-09-28 at 11:49 -0700, Andrew Morton wrote:
  On Fri, 28 Sep 2007 13:00:53 -0400 Trond Myklebust [EMAIL PROTECTED] 
  wrote:
   Do these patches also cause the memory reclaimers to steer clear of
   devices that are congested (and stop waiting on a congested device if
   they see that it remains congested for a long period of time)? Most of
   the collateral blocking I see tends to happen in memory allocation...
   
  
  No, they don't attempt to do that, but I suspect they put in place
  infrastructure which could be used to improve direct-reclaimer latency.  In
  the throttle_vm_writeout() path, at least.
  
  Do you know where the stalls are occurring?  throttle_vm_writeout(), or via
  direct calls to congestion_wait() from page_alloc.c and vmscan.c?  (running
  sysrq-w five or ten times will probably be enough to determine this)
 
 Looking back, they were getting caught up in
 balance_dirty_pages_ratelimited() and friends. See the attached
 example...

that one is nfs-on-loopback, which is a special case, isn't it?

NFS on loopback used to hang, but then we fixed it.  It looks like we
broke it again sometime in the intervening four years or so.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?)

2007-09-28 Thread Trond Myklebust
On Fri, 2007-09-28 at 11:49 -0700, Andrew Morton wrote:
 On Fri, 28 Sep 2007 13:00:53 -0400 Trond Myklebust [EMAIL PROTECTED] wrote:
  Do these patches also cause the memory reclaimers to steer clear of
  devices that are congested (and stop waiting on a congested device if
  they see that it remains congested for a long period of time)? Most of
  the collateral blocking I see tends to happen in memory allocation...
  
 
 No, they don't attempt to do that, but I suspect they put in place
 infrastructure which could be used to improve direct-reclaimer latency.  In
 the throttle_vm_writeout() path, at least.
 
 Do you know where the stalls are occurring?  throttle_vm_writeout(), or via
 direct calls to congestion_wait() from page_alloc.c and vmscan.c?  (running
 sysrq-w five or ten times will probably be enough to determine this)

Looking back, they were getting caught up in
balance_dirty_pages_ratelimited() and friends. See the attached
example...

Cheers
  Trond
---BeginMessage---
Hi,

I am testing NFS on loopback locks up entire system with 2.6.23-rc6 kernel.

I have mounted a local ext3 partition using loopback NFS (version 3)
and started my test program. The test program forks 20 threads
allocates 10MB for each thread, writes  reads a file on the loopback
NFS mount. After running for about 5 min, I cannot even login to the
machine. Commands like ps etc, hang in a live session.

The machine is a DELL 1950 with 4Gig of RAM, so there is plenty of RAM
 CPU to play around and no other io/heavy processes are running on
the system.

vmstat output shows no buffers are actually getting transferred in or
out and iowait is 100%.

[EMAIL PROTECTED] ~]# vmstat 1
procs ---memory-- ---swap-- -io --system--
-cpu--
 r  bswpd   free   buff   cache   si   so   bi   bo
in cs us sy id wa st
 0 24116 110080  11132 304566400 0 0   28  345  0
1  0 99  0
 0 24116 110080  11132 304566400 0 05  329  0
0  0 100  0
 0 24116 110080  11132 304566400 0 0   26  336  0
0  0 100  0
 0 24116 110080  11132 304566400 0 08  335  0
0  0 100  0
 0 24116 110080  11132 304566400 0 0   26  352  0
0  0 100  0
 0 24116 110080  11132 304566400 0 08  351  0
0  0 100  0
 0 24116 110080  11132 304566400 0 0   23  358  0
1  0 99  0
 0 24116 110080  11132 304566400 0 0   10  350  0
0  0 100  0
 0 24116 110080  11132 304566400 0 0   26  363  0
0  0 100  0
 0 24116 110080  11132 304566400 0 08  346  0
1  0 99  0
 0 24116 110080  11132 304566400 0 0   26  360  0
0  0 100  0
 0 24116 110080  11140 304565600 8 0   11  345  0
0  0 100  0
 0 24116 110080  11140 304566400 0 0   27  355  0
0  2 97  0
 0 24116 110080  11140 304566400 0 09  330  0
0  0 100  0
 0 24116 110080  11140 304566400 0 0   26  358  0
0  0 100  0


The following is the backtrace of
1. one of the threads of my test program
2. nfsd daemon and
3. a generic command like pstree, after the machine hangs:
-
crash bt 3252
PID: 3252   TASK: f6f3c610  CPU: 0   COMMAND: test
 #0 [f6bdcc10] schedule at c0624a34
 #1 [f6bdcc84] schedule_timeout at c06250ee
 #2 [f6bdccc8] io_schedule_timeout at c0624c15
 #3 [f6bdccdc] congestion_wait at c045eb7d
 #4 [f6bdcd00] balance_dirty_pages_ratelimited_nr at c045ab91
 #5 [f6bdcd54] generic_file_buffered_write at c0457148
 #6 [f6bdcde8] __generic_file_aio_write_nolock at c04576e5
 #7 [f6bdce40] try_to_wake_up at c042342b
 #8 [f6bdce5c] generic_file_aio_write at c0457799
 #9 [f6bdce8c] nfs_file_write at f8c25cee
#10 [f6bdced0] do_sync_write at c0472e27
#11 [f6bdcf7c] vfs_write at c0473689
#12 [f6bdcf98] sys_write at c0473c95
#13 [f6bdcfb4] sysenter_entry at c0404ddf
EAX: 0004  EBX: 0013  ECX: a4966008  EDX: 0098
DS:  007b  ESI: 0098  ES:  007b  EDI: a4966008
SS:  007b  ESP: a5ae6ec0  EBP: a5ae6ef0
CS:  0073  EIP: b7eed410  ERR: 0004  EFLAGS: 0246
crash bt 3188
PID: 3188   TASK: f74c4000  CPU: 1   COMMAND: nfsd
 #0 [f6836c7c] schedule at c0624a34
 #1 [f6836cf0] __mutex_lock_slowpath at c062543d
 #2 [f6836d0c] mutex_lock at c0625326
 #3 [f6836d18] generic_file_aio_write at c0457784
 #4 [f6836d48] ext3_file_write at ffd7
 #5 [f6836d64] do_sync_readv_writev at c0472d1f
 #6 [f6836e08] do_readv_writev at c0473486
 #7 [f6836e6c] vfs_writev at c047358e
 #8 [f6836e7c] nfsd_vfs_write at f8e7f8d7
 #9 [f6836ee0] nfsd_write at f8e80139
#10 [f6836f10] nfsd3_proc_write at f8e86afd
#11 [f6836f44] nfsd_dispatch at f8e7c20c
#12 [f6836f6c] svc_process at f89c18e0
#13 [f6836fbc] nfsd at f8e7c794
#14 [f6836fe4] kernel_thread_helper at c0405a35
crash ps|grep ps
234  2   3  cb194000  IN   0.0   0 

Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?)

2007-09-28 Thread Trond Myklebust
On Fri, 2007-09-28 at 12:26 -0700, Andrew Morton wrote:
 On Fri, 28 Sep 2007 15:16:11 -0400 Trond Myklebust [EMAIL PROTECTED] wrote:
  Looking back, they were getting caught up in
  balance_dirty_pages_ratelimited() and friends. See the attached
  example...
 
 that one is nfs-on-loopback, which is a special case, isn't it?

I'm not sure that the hang that is illustrated here is so special. It is
an example of a bog-standard ext3 write, that ends up calling the NFS
client, which is hanging. The fact that it happens to be hanging on the
nfsd process is more or less irrelevant here: the same thing could
happen to any other process in the case where we have an NFS server that
is down.

 NFS on loopback used to hang, but then we fixed it.  It looks like we
 broke it again sometime in the intervening four years or so.

It has been quirky all through the 2.6.x series because of this issue.

Cheers
  Trond

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?)

2007-09-28 Thread Daniel Phillips
On Friday 28 September 2007 12:52, Trond Myklebust wrote:
 I'm not sure that the hang that is illustrated here is so special. It
 is an example of a bog-standard ext3 write, that ends up calling the
 NFS client, which is hanging. The fact that it happens to be hanging
 on the nfsd process is more or less irrelevant here: the same thing
 could happen to any other process in the case where we have an NFS
 server that is down.

Hi Trond,

Could you clarify what you meant by calling the NFS client?  I don't 
see any direct call in the posted backtrace.

Regards,

Daniel
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?)

2007-09-28 Thread Andrew Morton
On Fri, 28 Sep 2007 16:32:18 -0400
Trond Myklebust [EMAIL PROTECTED] wrote:

 On Fri, 2007-09-28 at 13:10 -0700, Andrew Morton wrote:
  On Fri, 28 Sep 2007 15:52:28 -0400
  Trond Myklebust [EMAIL PROTECTED] wrote:
  
   On Fri, 2007-09-28 at 12:26 -0700, Andrew Morton wrote:
On Fri, 28 Sep 2007 15:16:11 -0400 Trond Myklebust [EMAIL PROTECTED] 
wrote:
 Looking back, they were getting caught up in
 balance_dirty_pages_ratelimited() and friends. See the attached
 example...

that one is nfs-on-loopback, which is a special case, isn't it?
   
   I'm not sure that the hang that is illustrated here is so special. It is
   an example of a bog-standard ext3 write, that ends up calling the NFS
   client, which is hanging. The fact that it happens to be hanging on the
   nfsd process is more or less irrelevant here: the same thing could
   happen to any other process in the case where we have an NFS server that
   is down.
  
  hm, so ext3 got stuck in nfs via __alloc_pages direct reclaim?
  
  We should be able to fix that by marking the backing device as
  write-congested.  That'll have small race windows, but it should be a 99.9%
  fix?
 
 No. The problem would rather appear to be that we're doing
 per-backing_dev writeback (if I read sync_sb_inodes() correctly), but
 we're measuring variables which are global to the VM. The backing device
 that we are selecting may not be writing out any dirty pages, in which
 case, we're just spinning in balance_dirty_pages_ratelimited().

OK, so it's unrelated to page reclaim.

 Should we therefore perhaps be looking at adding per-backing_dev stats
 too?

That's what mm-per-device-dirty-threshold.patch and friends are doing. 
Whether it works adequately is not really known at this time. 
Unfortunately kernel developers don't test -mm much.  
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?)

2007-09-28 Thread Trond Myklebust
On Fri, 2007-09-28 at 13:10 -0700, Andrew Morton wrote:
 On Fri, 28 Sep 2007 15:52:28 -0400
 Trond Myklebust [EMAIL PROTECTED] wrote:
 
  On Fri, 2007-09-28 at 12:26 -0700, Andrew Morton wrote:
   On Fri, 28 Sep 2007 15:16:11 -0400 Trond Myklebust [EMAIL PROTECTED] 
   wrote:
Looking back, they were getting caught up in
balance_dirty_pages_ratelimited() and friends. See the attached
example...
   
   that one is nfs-on-loopback, which is a special case, isn't it?
  
  I'm not sure that the hang that is illustrated here is so special. It is
  an example of a bog-standard ext3 write, that ends up calling the NFS
  client, which is hanging. The fact that it happens to be hanging on the
  nfsd process is more or less irrelevant here: the same thing could
  happen to any other process in the case where we have an NFS server that
  is down.
 
 hm, so ext3 got stuck in nfs via __alloc_pages direct reclaim?
 
 We should be able to fix that by marking the backing device as
 write-congested.  That'll have small race windows, but it should be a 99.9%
 fix?

No. The problem would rather appear to be that we're doing
per-backing_dev writeback (if I read sync_sb_inodes() correctly), but
we're measuring variables which are global to the VM. The backing device
that we are selecting may not be writing out any dirty pages, in which
case, we're just spinning in balance_dirty_pages_ratelimited().

Should we therefore perhaps be looking at adding per-backing_dev stats
too?

Trond

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?)

2007-09-28 Thread Chakri n
Here is a the snapshot of vmstats when the problem happened. I believe
this could help a little.

crash kmem -V
   NR_FREE_PAGES: 680853
 NR_INACTIVE: 95380
   NR_ACTIVE: 26891
   NR_ANON_PAGES: 2507
  NR_FILE_MAPPED: 1832
   NR_FILE_PAGES: 119779
   NR_FILE_DIRTY: 0
NR_WRITEBACK: 18272
 NR_SLAB_RECLAIMABLE: 1305
NR_SLAB_UNRECLAIMABLE: 2085
NR_PAGETABLE: 123
 NR_UNSTABLE_NFS: 0
   NR_BOUNCE: 0
 NR_VMSCAN_WRITE: 0

In my testing, I always saw the processes are waiting in
balance_dirty_pages_ratelimited(), never in throttle_vm_writeout()
path.

But this could be because I have about 4Gig of memory in the system
and plenty of mem is still available around.

I will rerun the test limiting memory to 1024MB and lets see if it
takes in any different path.

Thanks
--Chakri


On 9/28/07, Andrew Morton [EMAIL PROTECTED] wrote:
 On Fri, 28 Sep 2007 16:32:18 -0400
 Trond Myklebust [EMAIL PROTECTED] wrote:

  On Fri, 2007-09-28 at 13:10 -0700, Andrew Morton wrote:
   On Fri, 28 Sep 2007 15:52:28 -0400
   Trond Myklebust [EMAIL PROTECTED] wrote:
  
On Fri, 2007-09-28 at 12:26 -0700, Andrew Morton wrote:
 On Fri, 28 Sep 2007 15:16:11 -0400 Trond Myklebust [EMAIL 
 PROTECTED] wrote:
  Looking back, they were getting caught up in
  balance_dirty_pages_ratelimited() and friends. See the attached
  example...

 that one is nfs-on-loopback, which is a special case, isn't it?
   
I'm not sure that the hang that is illustrated here is so special. It is
an example of a bog-standard ext3 write, that ends up calling the NFS
client, which is hanging. The fact that it happens to be hanging on the
nfsd process is more or less irrelevant here: the same thing could
happen to any other process in the case where we have an NFS server that
is down.
  
   hm, so ext3 got stuck in nfs via __alloc_pages direct reclaim?
  
   We should be able to fix that by marking the backing device as
   write-congested.  That'll have small race windows, but it should be a 
   99.9%
   fix?
 
  No. The problem would rather appear to be that we're doing
  per-backing_dev writeback (if I read sync_sb_inodes() correctly), but
  we're measuring variables which are global to the VM. The backing device
  that we are selecting may not be writing out any dirty pages, in which
  case, we're just spinning in balance_dirty_pages_ratelimited().

 OK, so it's unrelated to page reclaim.

  Should we therefore perhaps be looking at adding per-backing_dev stats
  too?

 That's what mm-per-device-dirty-threshold.patch and friends are doing.
 Whether it works adequately is not really known at this time.
 Unfortunately kernel developers don't test -mm much.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?)

2007-09-28 Thread Chakri n
No change in behavior even in case of low memory systems. I confirmed
it running on 1Gig machine.

Thanks
--Chakri

On 9/28/07, Chakri n [EMAIL PROTECTED] wrote:
 Here is a the snapshot of vmstats when the problem happened. I believe
 this could help a little.

 crash kmem -V
NR_FREE_PAGES: 680853
  NR_INACTIVE: 95380
NR_ACTIVE: 26891
NR_ANON_PAGES: 2507
   NR_FILE_MAPPED: 1832
NR_FILE_PAGES: 119779
NR_FILE_DIRTY: 0
 NR_WRITEBACK: 18272
  NR_SLAB_RECLAIMABLE: 1305
 NR_SLAB_UNRECLAIMABLE: 2085
 NR_PAGETABLE: 123
  NR_UNSTABLE_NFS: 0
NR_BOUNCE: 0
  NR_VMSCAN_WRITE: 0

 In my testing, I always saw the processes are waiting in
 balance_dirty_pages_ratelimited(), never in throttle_vm_writeout()
 path.

 But this could be because I have about 4Gig of memory in the system
 and plenty of mem is still available around.

 I will rerun the test limiting memory to 1024MB and lets see if it
 takes in any different path.

 Thanks
 --Chakri


 On 9/28/07, Andrew Morton [EMAIL PROTECTED] wrote:
  On Fri, 28 Sep 2007 16:32:18 -0400
  Trond Myklebust [EMAIL PROTECTED] wrote:
 
   On Fri, 2007-09-28 at 13:10 -0700, Andrew Morton wrote:
On Fri, 28 Sep 2007 15:52:28 -0400
Trond Myklebust [EMAIL PROTECTED] wrote:
   
 On Fri, 2007-09-28 at 12:26 -0700, Andrew Morton wrote:
  On Fri, 28 Sep 2007 15:16:11 -0400 Trond Myklebust [EMAIL 
  PROTECTED] wrote:
   Looking back, they were getting caught up in
   balance_dirty_pages_ratelimited() and friends. See the attached
   example...
 
  that one is nfs-on-loopback, which is a special case, isn't it?

 I'm not sure that the hang that is illustrated here is so special. It 
 is
 an example of a bog-standard ext3 write, that ends up calling the NFS
 client, which is hanging. The fact that it happens to be hanging on 
 the
 nfsd process is more or less irrelevant here: the same thing could
 happen to any other process in the case where we have an NFS server 
 that
 is down.
   
hm, so ext3 got stuck in nfs via __alloc_pages direct reclaim?
   
We should be able to fix that by marking the backing device as
write-congested.  That'll have small race windows, but it should be a 
99.9%
fix?
  
   No. The problem would rather appear to be that we're doing
   per-backing_dev writeback (if I read sync_sb_inodes() correctly), but
   we're measuring variables which are global to the VM. The backing device
   that we are selecting may not be writing out any dirty pages, in which
   case, we're just spinning in balance_dirty_pages_ratelimited().
 
  OK, so it's unrelated to page reclaim.
 
   Should we therefore perhaps be looking at adding per-backing_dev stats
   too?
 
  That's what mm-per-device-dirty-threshold.patch and friends are doing.
  Whether it works adequately is not really known at this time.
  Unfortunately kernel developers don't test -mm much.
 

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?)

2007-09-28 Thread Daniel Phillips
On Thursday 27 September 2007 23:50, Andrew Morton wrote:
 Actually we perhaps could address this at the VFS level in another
 way. Processes which are writing to the dead NFS server will
 eventually block in balance_dirty_pages() once they've exceeded the
 memory limits and will remain blocked until the server wakes up -
 that's the behaviour we want.

It is not necessary to restrict total dirty pages at all.  Instead it is 
necessary to restrict total writeout in flight.  This is evident from 
the fact that making progress is the one and only reason our kernel 
exists, and writeout is how we make progress clearing memory.  In other 
words, if we guarantee the progress of writeout, we will live happily 
ever after and not have to sell the farm.

The current situation has an eerily similar feeling to the VM 
instability in early 2.4, which was never solved until we convinced 
ourselves that the only way to deal with Moore's law as applied to 
number of memory pages was to implement positive control of swapout in 
the form of reverse mapping[1].  This time round, we need to add 
positive control of writeout in the form of rate limiting.

I _think_ Peter is with me on this, and not only that, but between the 
too of us we already have patches for most of the subsystems that need 
it, and we have both been busy testing (different subsets of) these 
patches to destruction for the better part of a year.

Anyway, to fix the immediate bug before the one true dirty_limit removal 
patch lands (promise) I think you are on the right track by noticing 
that balance_dirty_pages has to become aware of how congested the 
involved block device is, since blocking a writeout process on an 
underused block device is clearly a bad idea.  Note how much this idea 
looks like rate limiting.

[1] We lost the scent for a number of reasons, not least because the 
experimental implementation of reverse mapping at the time was buggy 
for reasons entirely unrelated to the reverse mapping itself.

Regards,

Daniel
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?)

2007-09-28 Thread Daniel Phillips
On Friday 28 September 2007 06:35, Peter Zijlstra wrote:
 ,,,it would be grand (and dangerous) if we could provide for a
 button that would just kill off all outstanding pages against a dead
 device.

Substitute resources for pages and you begin to get an idea of how 
tricky that actually is.  That said, this is exactly what we have done 
with ddsnap, for the simple reason that our users, now emboldened by 
being able to stop or terminate the user space part, felt justified in 
expecting that the system continue as if nothing had happened, and 
furthermore, be able to restart ddsnap without a hiccup.  (Otherwise 
known as a sysop's diety-given right to kill.)

So this is what we do in the specific case of ddsnap:

   * When we detect some nasty state change such as our userspace
  control daemon disappearing on us, we go poking around and
  explicitly release every semaphore that the device driver could
  possibly wait on forever (interestingly they are all in our own
  driver except for BKL, which is just an artifact of device mapper
  not having gone over to unlock_ioctl for no good reason that I
  know of).

   * Then at the points were the driver falls through some lock thus
  released, we check our ready flag, and if it indicates busted,
  proceed with wherever cleanup is needed at that point.

Does not sound like an approach one would expect to work reliably, does 
it?  But there just may be some general principle to be ferretted out 
here.  (Anyone who has ideas on how bits of this procedure could be 
abstracted, please do not hesitate to step boldly forth into the 
limelight.)

Incidentally, only a small subset of locks needed special handling as 
above.  Most can be shown to have no way to block forever, short of an 
outright bug.

I shudder to think how much work it would be to bring every driver in 
the kernel up to such a standard, particularly if user space components 
are involved, as with USB.  On the other hand, every driver fixed is 
one less driver that sucks.  The next one to emerge from the pipeline 
will most likely be NBD, which we have been working on in fits and 
starts for a while.  Look for it to morph into ddbd, with cross-node 
distributed data awareness, in addition to perforning its current job 
without deadlocking.

Regards,

Daniel
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Linux 2.6.23-rc6-git3-krf1

2007-09-13 Thread Michal Piotrowski
Hi,

There are a few regression fixes in -krf tree

http://www.stardust.webpages.pl/files/patches/krf/2.6.23-rc6-git3/2.6.23-rc6-git3-krf1.patch.bz2
http://www.stardust.webpages.pl/files/patches/krf/2.6.23-rc6-git3/2.6.23-rc6-git3-krf1.tar.bz2

Vitaly Bordug:
oops-while-modprobing-phy-fixed-module-fix.patch

Dmitry Torokhov:
sysfs-change-of-input-event-devices-in-2.6.23rc-breaks-udev-fix.patch

J. Bruce Fields:
oops-in-nfs4_cb_recall-fix.patch

Jamal:
possible-irq-lock-inversion-dependency-fix.patch

 drivers/base/core.c   |   29 +++-
 drivers/net/phy/Kconfig   |   14 ++
 drivers/net/phy/fixed.c   |  310 +++---
 fs/nfsd/nfs4callback.c|1
 fs/nfsd/nfs4state.c   |   37 +++--
 include/linux/phy_fixed.h |   38 +
 net/sched/act_api.c   |8 -
 net/sched/act_police.c|4
 8 files changed, 258 insertions(+), 183 deletions(-)


Regards,
Michal

--
LOG
http://www.stardust.webpages.pl/log/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Linux 2.6.23-rc6-git3-krf1

2007-09-13 Thread Michal Piotrowski
Hi,

There are a few regression fixes in -krf tree

http://www.stardust.webpages.pl/files/patches/krf/2.6.23-rc6-git3/2.6.23-rc6-git3-krf1.patch.bz2
http://www.stardust.webpages.pl/files/patches/krf/2.6.23-rc6-git3/2.6.23-rc6-git3-krf1.tar.bz2

Vitaly Bordug:
oops-while-modprobing-phy-fixed-module-fix.patch

Dmitry Torokhov:
sysfs-change-of-input-event-devices-in-2.6.23rc-breaks-udev-fix.patch

J. Bruce Fields:
oops-in-nfs4_cb_recall-fix.patch

Jamal:
possible-irq-lock-inversion-dependency-fix.patch

 drivers/base/core.c   |   29 +++-
 drivers/net/phy/Kconfig   |   14 ++
 drivers/net/phy/fixed.c   |  310 +++---
 fs/nfsd/nfs4callback.c|1
 fs/nfsd/nfs4state.c   |   37 +++--
 include/linux/phy_fixed.h |   38 +
 net/sched/act_api.c   |8 -
 net/sched/act_police.c|4
 8 files changed, 258 insertions(+), 183 deletions(-)


Regards,
Michal

--
LOG
http://www.stardust.webpages.pl/log/
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Linux 2.6.23-rc6

2007-09-10 Thread Linus Torvalds

So last week was a bust, with a lot of core people away for the kernel 
summit, and with -rc5 having two rather nasty (and silly) one-liner 
problems that bit a number of people - a missing NULL pointer check in 
TCP, and a missing list terminator in ata_piix.

So the fixes for those things were both pretty trivial, and they've been 
in the -git trees for the last few days, but I just pushed out an -rc6 
that also merges up some other updates that did come in during the week. 

Most of the diff is just some m32r reorg, but see the appended shortlog 
and diffstat for details.

Linus
---
Alan Cox (1):
  [MIPS] tty: add the new ioctls and definitions.

Andi Kleen (1):
  x86_64: Remove CLFLUSH in text_poke()

Christian Ehrhardt (1):
  [x86 setup] Work around bug in Xen HVM

Christoph Hellwig (4):
  [XFS] Fix sparse NULL vs 0 warnings
  [XFS] Fix sparse warning in kmem_shake_allow
  [XFS] fix ASSERT and ASSERT_ALWAYS
  [XFS] fix sparse shadowed variable warnings

Chuck Lever (4):
  NFS: mount option parser chokes on proto=
  NFS: Return a real error code from mount(2)
  NFS: Off-by-one length error in string handling
  NFS: change NFS mount error return when hostname/pathname too long

Daniel Walker (1):
  i386: fix a hang on stuck nmi watchdog

David Brownell (2):
  i2c-gpio: Fix adapter number
  i2c-algo-bit: Read block data bugfix

David Chinner (1):
  [XFS] Set filestreams object timeout to something sane.

David Howells (1):
  [MTD] Initialise s_flags in get_sb_mtd_aux()

David S. Miller (1):
  [TCP]: 'dst' can be NULL in tcp_rto_min()

Eric Sandeen (1):
  [XFS] fix nasty quota hashtable allocation bug

Geert Uytterhoeven (1):
  [POWERPC] cell/PS3: Ignore storage devices that are still being probed

Herbert Xu (2):
  [CRYPTO] blkcipher: Fix handling of kmalloc page straddling
  [CRYPTO] blkcipher: Fix inverted test in blkcipher_get_spot

Hirokazu Takata (12):
  m32r: Move defconfig files to arch/m32r/configs/
  m32r: Update defconfig files for 2.6.23-rc1
  m32r: Add defconfig file for the usrv platform.
  m32r: Rearrange platform-dependent codes
  m32r: Move dot.gdbinit files
  m32r: Define symbols to unify platform-dependent ICU checks
  m32r: Simplify ei_handler code
  m32r: Exit ei_handler directly for no IRQ case or IPI operations
  m32r: Cosmetic updates of arch/m32r/kernel/entry.S
  m32r: Separate syscall table from entry.S
  m32r: build fix of entry.S
  m32r: Rename STI/CLI macros

Ingo Molnar (4):
  sched: fix niced_granularity() shift
  sched: debug: fix cfs_rq->wait_runtime accounting
  sched: debug: fix sum_exec_runtime clearing
  sched: fix xtensa build warning

Jason Lunz (1):
  [JFFS2] fix write deadlock regression

Jean Delvare (2):
  hwmon: End of I/O region off-by-one
  i2c-pxa: Fix adapter number

Jeff Dike (1):
  UML: Fix ELF_CORE_COPY_REGS build botch

Jeff Garzik (1):
  [libata] ata_piix: properly terminate DMI system list

Jeff Norden (1):
  pata_it821x: fix lost interrupt with atapi devices

Jeremy Kerr (1):
  [POWERPC] cell/PS3: Always set master run control bit in mfc_sr1_set

Jesper Juhl (1):
  [IA64] Remove unnecessary cast of allocation return value in 
sn_hwperf_enum_objects()

Joachim Fenkes (1):
  [POWERPC] ibmebus: Prevent bus_id collisions

John Keller (1):
  [IA64] SN: Add support for CPU disable

Joseph Chan (1):
  [libata, IDE] add new VIA bridge to VIA PATA drivers

Kenji Kaneshige (2):
  [IA64] Fix unexpected interrupt vector handling
  [IA64] Clear pending interrupts at CPU boot up time

Kyungmin Park (1):
  [MIPS] i8259: Add disable method.

Laurent Riffard (1):
  Fix broken pata_via cable detection

Linus Torvalds (1):
      Linux 2.6.23-rc6

Masato Noguchi (1):
  [POWERPC] cell/PS3: Fix a bug that causes the PS3 to hang on the SPU 
Class 0 interrupt.

Maxime Bizon (1):
  [MIPS] R1: Fix wrong test in dma-default.c

Neil Brown (2):
  knfsd: Fixed problem with NFS exporting directories which are mounted on.
  knfsd: Validate filehandle type in fsid_source

Ondrej Zary (1):
  Fix sata_via write errors on PATA drive connected to VT6421

Peter Chubb (2):
  [IA64] Enable early console for Ski simulator
  [IA64] Cleanup HPSIM code (was: Re: Enable early console for Ski 
simulator)

Peter Zijlstra (3):
  sched: simplify __check_preempt_curr_fair()
  sched: improve prev_sum_exec_runtime setting
  sched: fix ideal_runtime calculations for reniced tasks

Prarit Bhargava (1):
  [IA64] Stop bogus NMI & softlockup warnings in ia64 show_mem

Ralf Baechle (5):
  [MIPS] BCM1480: Fix computation of interrupt mask address register.
  [MIPS] PCI: Set need_domain_info if controller domain index is non-zero.
  [MIPS] Kconfig: whitespace cleanup.
  [MIPS] Sibyte: Remove broken dependency on EXPER

Linux 2.6.23-rc6

2007-09-10 Thread Linus Torvalds

So last week was a bust, with a lot of core people away for the kernel 
summit, and with -rc5 having two rather nasty (and silly) one-liner 
problems that bit a number of people - a missing NULL pointer check in 
TCP, and a missing list terminator in ata_piix.

So the fixes for those things were both pretty trivial, and they've been 
in the -git trees for the last few days, but I just pushed out an -rc6 
that also merges up some other updates that did come in during the week. 

Most of the diff is just some m32r reorg, but see the appended shortlog 
and diffstat for details.

Linus
---
Alan Cox (1):
  [MIPS] tty: add the new ioctls and definitions.

Andi Kleen (1):
  x86_64: Remove CLFLUSH in text_poke()

Christian Ehrhardt (1):
  [x86 setup] Work around bug in Xen HVM

Christoph Hellwig (4):
  [XFS] Fix sparse NULL vs 0 warnings
  [XFS] Fix sparse warning in kmem_shake_allow
  [XFS] fix ASSERT and ASSERT_ALWAYS
  [XFS] fix sparse shadowed variable warnings

Chuck Lever (4):
  NFS: mount option parser chokes on proto=
  NFS: Return a real error code from mount(2)
  NFS: Off-by-one length error in string handling
  NFS: change NFS mount error return when hostname/pathname too long

Daniel Walker (1):
  i386: fix a hang on stuck nmi watchdog

David Brownell (2):
  i2c-gpio: Fix adapter number
  i2c-algo-bit: Read block data bugfix

David Chinner (1):
  [XFS] Set filestreams object timeout to something sane.

David Howells (1):
  [MTD] Initialise s_flags in get_sb_mtd_aux()

David S. Miller (1):
  [TCP]: 'dst' can be NULL in tcp_rto_min()

Eric Sandeen (1):
  [XFS] fix nasty quota hashtable allocation bug

Geert Uytterhoeven (1):
  [POWERPC] cell/PS3: Ignore storage devices that are still being probed

Herbert Xu (2):
  [CRYPTO] blkcipher: Fix handling of kmalloc page straddling
  [CRYPTO] blkcipher: Fix inverted test in blkcipher_get_spot

Hirokazu Takata (12):
  m32r: Move defconfig files to arch/m32r/configs/
  m32r: Update defconfig files for 2.6.23-rc1
  m32r: Add defconfig file for the usrv platform.
  m32r: Rearrange platform-dependent codes
  m32r: Move dot.gdbinit files
  m32r: Define symbols to unify platform-dependent ICU checks
  m32r: Simplify ei_handler code
  m32r: Exit ei_handler directly for no IRQ case or IPI operations
  m32r: Cosmetic updates of arch/m32r/kernel/entry.S
  m32r: Separate syscall table from entry.S
  m32r: build fix of entry.S
  m32r: Rename STI/CLI macros

Ingo Molnar (4):
  sched: fix niced_granularity() shift
  sched: debug: fix cfs_rq-wait_runtime accounting
  sched: debug: fix sum_exec_runtime clearing
  sched: fix xtensa build warning

Jason Lunz (1):
  [JFFS2] fix write deadlock regression

Jean Delvare (2):
  hwmon: End of I/O region off-by-one
  i2c-pxa: Fix adapter number

Jeff Dike (1):
  UML: Fix ELF_CORE_COPY_REGS build botch

Jeff Garzik (1):
  [libata] ata_piix: properly terminate DMI system list

Jeff Norden (1):
  pata_it821x: fix lost interrupt with atapi devices

Jeremy Kerr (1):
  [POWERPC] cell/PS3: Always set master run control bit in mfc_sr1_set

Jesper Juhl (1):
  [IA64] Remove unnecessary cast of allocation return value in 
sn_hwperf_enum_objects()

Joachim Fenkes (1):
  [POWERPC] ibmebus: Prevent bus_id collisions

John Keller (1):
  [IA64] SN: Add support for CPU disable

Joseph Chan (1):
  [libata, IDE] add new VIA bridge to VIA PATA drivers

Kenji Kaneshige (2):
  [IA64] Fix unexpected interrupt vector handling
  [IA64] Clear pending interrupts at CPU boot up time

Kyungmin Park (1):
  [MIPS] i8259: Add disable method.

Laurent Riffard (1):
  Fix broken pata_via cable detection

Linus Torvalds (1):
  Linux 2.6.23-rc6

Masato Noguchi (1):
  [POWERPC] cell/PS3: Fix a bug that causes the PS3 to hang on the SPU 
Class 0 interrupt.

Maxime Bizon (1):
  [MIPS] R1: Fix wrong test in dma-default.c

Neil Brown (2):
  knfsd: Fixed problem with NFS exporting directories which are mounted on.
  knfsd: Validate filehandle type in fsid_source

Ondrej Zary (1):
  Fix sata_via write errors on PATA drive connected to VT6421

Peter Chubb (2):
  [IA64] Enable early console for Ski simulator
  [IA64] Cleanup HPSIM code (was: Re: Enable early console for Ski 
simulator)

Peter Zijlstra (3):
  sched: simplify __check_preempt_curr_fair()
  sched: improve prev_sum_exec_runtime setting
  sched: fix ideal_runtime calculations for reniced tasks

Prarit Bhargava (1):
  [IA64] Stop bogus NMI  softlockup warnings in ia64 show_mem

Ralf Baechle (5):
  [MIPS] BCM1480: Fix computation of interrupt mask address register.
  [MIPS] PCI: Set need_domain_info if controller domain index is non-zero.
  [MIPS] Kconfig: whitespace cleanup.
  [MIPS] Sibyte: Remove broken dependency on EXPERIMENTAL