Re: 2.6.24-rc6 reproducible raid5 hang

2008-02-14 Thread Burkhard Carstens
Am Dienstag, 29. Januar 2008 23:58 schrieb Bill Davidsen:
 Carlos Carvalho wrote:
  Tim Southerwood ([EMAIL PROTECTED]) wrote on 28 January 2008 17:29:
   Subtitle: Patch to mainline yet?
   
   Hi
   
   I don't see evidence of Neil's patch in 2.6.24, so I applied it
by hand on my server.
 
  I applied all 4 pending patches to .24. It's been better than .22
  and .23... Unfortunately the bitmap and rai1 patch don't go in
  .22.16.

 Neil, have these been sent up against 24-stable and 23-stable?

.. and .22-stable ?

Also, is this a xfs-on-raid5 bug or would it also happen with 
ext3-on-raid5 ?

regards
 Burkhard

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.24-rc6 reproducible raid5 hang

2008-01-29 Thread Carlos Carvalho
Tim Southerwood ([EMAIL PROTECTED]) wrote on 28 January 2008 17:29:
 Subtitle: Patch to mainline yet?
 
 Hi
 
 I don't see evidence of Neil's patch in 2.6.24, so I applied it by hand
 on my server.

I applied all 4 pending patches to .24. It's been better than .22 and
.23... Unfortunately the bitmap and rai1 patch don't go in .22.16.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.24-rc6 reproducible raid5 hang

2008-01-29 Thread Bill Davidsen

Carlos Carvalho wrote:

Tim Southerwood ([EMAIL PROTECTED]) wrote on 28 January 2008 17:29:
 Subtitle: Patch to mainline yet?
 
 Hi
 
 I don't see evidence of Neil's patch in 2.6.24, so I applied it by hand
 on my server.

I applied all 4 pending patches to .24. It's been better than .22 and
.23... Unfortunately the bitmap and rai1 patch don't go in .22.16.


Neil, have these been sent up against 24-stable and 23-stable?

--
Bill Davidsen [EMAIL PROTECTED]
 Woe unto the statesman who makes war without a reason that will still
 be valid when the war is over... Otto von Bismark 



-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.24-rc6 reproducible raid5 hang

2008-01-28 Thread Tim Southerwood

Subtitle: Patch to mainline yet?

Hi

I don't see evidence of Neil's patch in 2.6.24, so I applied it by hand
on my server.

Was that the correct thing to do, or did this issue get fixed in a 
different way that I wouldn't have spotted? I had a look at the git logs 
but it was not obvious - please pardon my ignorance, I'm not familiar 
enough with the code.


Many thanks,

Tim

Tim Southerwood wrote:

Carlos Carvalho wrote:

Tim Southerwood ([EMAIL PROTECTED]) wrote on 23 January 2008 13:37:
 Sorry if this breaks threaded mail readers, I only just subscribed 
to  the list so don;t have the original post to reply to.

 
 I believe I'm having the same problem.
 
 Regarding XFS on a raid5 md array:
 
 Kernels 2.6.22-14 (Ubuntu Gutsy generic and server builds) *and* 
 2.6.24-rc8 (pure build from virgin sources) compiled for amd64 arch.


This has been corrected already, install Neil's patches. It worked for
several people under high stress, including us.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Hi

I just coerced the patch into 2.6.23.14, reset 
/sys/block/md1/md/stripe_cache_size to default (256) and rebooted.


I can confirm that after 2 hours of heavy bashing[1] the system has not 
hung. Looks good - many thanks. But I will run with a stripe_cache_size 
of 4096 in practise as it improves write speen on my configuration about 
2.5 times.


Cheers

Tim



[1] Rsync  50GB to raid pluf xfs_fsr + dd 11GB of /dev/zero to same 
filesystem.

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.24-rc6 reproducible raid5 hang

2008-01-24 Thread Tim Southerwood

Carlos Carvalho wrote:

Tim Southerwood ([EMAIL PROTECTED]) wrote on 23 January 2008 13:37:
 Sorry if this breaks threaded mail readers, I only just subscribed to 
 the list so don;t have the original post to reply to.

 
 I believe I'm having the same problem.
 
 Regarding XFS on a raid5 md array:
 
 Kernels 2.6.22-14 (Ubuntu Gutsy generic and server builds) *and* 
 2.6.24-rc8 (pure build from virgin sources) compiled for amd64 arch.


This has been corrected already, install Neil's patches. It worked for
several people under high stress, including us.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Hi

I just coerced the patch into 2.6.23.14, reset 
/sys/block/md1/md/stripe_cache_size to default (256) and rebooted.


I can confirm that after 2 hours of heavy bashing[1] the system has not 
hung. Looks good - many thanks. But I will run with a stripe_cache_size 
of 4096 in practise as it improves write speen on my configuration about 
2.5 times.


Cheers

Tim



[1] Rsync  50GB to raid pluf xfs_fsr + dd 11GB of /dev/zero to same 
filesystem.

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.24-rc6 reproducible raid5 hang

2008-01-23 Thread Tim Southerwood
Sorry if this breaks threaded mail readers, I only just subscribed to 
the list so don;t have the original post to reply to.


I believe I'm having the same problem.

Regarding XFS on a raid5 md array:

Kernels 2.6.22-14 (Ubuntu Gutsy generic and server builds) *and* 
2.6.24-rc8 (pure build from virgin sources) compiled for amd64 arch.


Raid 5 configured across 4 x 500GB SATA disks (Nforce nv_sata driver, 
Asus M2N-E mobo, Athlon X64, 4GB RAM


MD Chunk size is 1024k. This is allocated to an LVM2 PV, then sliced up.
Taking one sample logical volume of 150GB I ran

mkfs.xfs -d su=1024k,sw=3 -L vol_linux /dev/vg00/vol_linux

I then found that putting high write load on that filesystem cause a 
hang. High load could be a little as a single rsync of a mirror of 
Ubunty Gutsy (many 10's of GB) from my old server to here. Hang would 
happen in a few hours typically.


I could generate relatively quick hangs by running xfs_fsr (defragger) 
in parallel.


Trying the workaround up upping /sys/block/md1/md/stripe_cache_size to 
4096 seems (fingers crossed) to have helped. Been running the rsync 
again, plus xfs_fst + a few dd's of 11 GB to the same filesystem.


I did notice also that the write speed increased dramatically with a 
bigger stripe_cache_size.


A more detailed analysis of the problem indicated that, after the hang:

I could log in;

One CPU core was stuck in 100% IO wait.
The other core was useable, with care. So I managed to get a SysRQ T and 
 one place the system appeared blocked was via this path:


[ 2039.466258] xfs_fsr   D  0  7324   7308
[ 2039.466260]  810119399858 0082  
0046
[ 2039.466263]  810110d6c680 8101102ba998 8101102ba770 
8054e5e0
[ 2039.466265]  8101102ba998 00010014a1e6  
810110ddcb30

[ 2039.466268] Call Trace:
[ 2039.466277]  [8808a26b] :raid456:get_active_stripe+0x1cb/0x610
[ 2039.466282]  [80234000] default_wake_function+0x0/0x10
[ 2039.466289]  [88090ff8] :raid456:make_request+0x1f8/0x610
[ 2039.466293]  [80251c20] autoremove_wake_function+0x0/0x30
[ 2039.466295]  [80331121] __up_read+0x21/0xb0
[ 2039.466300]  [8031f336] generic_make_request+0x1d6/0x3d0
[ 2039.466303]  [80280bad] vm_normal_page+0x3d/0xc0
[ 2039.466307]  [8031f59f] submit_bio+0x6f/0xf0
[ 2039.466311]  [802c98cc] dio_bio_submit+0x5c/0x90
[ 2039.466313]  [802c9943] dio_send_cur_page+0x43/0xa0
[ 2039.466316]  [802c99ee] submit_page_section+0x4e/0x150
[ 2039.466319]  [802ca2e2] __blockdev_direct_IO+0x742/0xb50
[ 2039.466342]  [8832e9a2] :xfs:xfs_vm_direct_IO+0x182/0x190
[ 2039.466357]  [8832edb0] :xfs:xfs_get_blocks_direct+0x0/0x20
[ 2039.466370]  [8832e350] :xfs:xfs_end_io_direct+0x0/0x80
[ 2039.466375]  [80444fb5] __wait_on_bit_lock+0x65/0x80
[ 2039.466380]  [80272883] generic_file_direct_IO+0xe3/0x190
[ 2039.466385]  [802729a4] generic_file_direct_write+0x74/0x150
[ 2039.466402]  [88336db2] :xfs:xfs_write+0x492/0x8f0
[ 2039.466421]  [883099bc] :xfs:xfs_iunlock+0x2c/0xb0
[ 2039.466437]  [88336866] :xfs:xfs_read+0x186/0x240
[ 2039.466443]  [8029e5b9] do_sync_write+0xd9/0x120
[ 2039.466448]  [80251c20] autoremove_wake_function+0x0/0x30
[ 2039.466457]  [8029eead] vfs_write+0xdd/0x190
[ 2039.466461]  [8029f5b3] sys_write+0x53/0x90
[ 2039.466465]  [8020c29e] system_call+0x7e/0x83


However, I'm of the opinion that the system should not deadlock, even if 
tunable parameters are unfavourable. I'm happy with the workaround 
(indeed the system performs better).


However, it will take me a week's worth of testing before I'm willing to 
commission this as my new fileserver.


So, if there is anything anyone would like me to try, I'm happy to 
volunteer as a guinea pig :)


Yes, I can build and patch kernels. But I'm not hot at debugging kernels 
so if kernel core dumps or whatever are needed, please point me at the 
right document or hint as to which commands I need to read about.


Cheers

Tim
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.24-rc6 reproducible raid5 hang

2008-01-23 Thread Carlos Carvalho
Tim Southerwood ([EMAIL PROTECTED]) wrote on 23 January 2008 13:37:
 Sorry if this breaks threaded mail readers, I only just subscribed to 
 the list so don;t have the original post to reply to.
 
 I believe I'm having the same problem.
 
 Regarding XFS on a raid5 md array:
 
 Kernels 2.6.22-14 (Ubuntu Gutsy generic and server builds) *and* 
 2.6.24-rc8 (pure build from virgin sources) compiled for amd64 arch.

This has been corrected already, install Neil's patches. It worked for
several people under high stress, including us.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.24-rc6 reproducible raid5 hang

2008-01-10 Thread dean gaudet
On Thu, 10 Jan 2008, Neil Brown wrote:

 On Wednesday January 9, [EMAIL PROTECTED] wrote:
  On Sun, 2007-12-30 at 10:58 -0700, dean gaudet wrote:
   i have evidence pointing to d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1
   
   http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1
   
   which was Neil's change in 2.6.22 for deferring generic_make_request 
   until there's enough stack space for it.
   
  
  Commit d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1 reduced stack utilization
  by preventing recursive calls to generic_make_request.  However the
  following conditions can cause raid5 to hang until 'stripe_cache_size' is
  increased:
  
 
 Thanks for pursuing this guys.  That explanation certainly sounds very
 credible.
 
 The generic_make_request_immed is a good way to confirm that we have
 found the bug,  but I don't like it as a long term solution, as it
 just reintroduced the problem that we were trying to solve with the
 problematic commit.
 
 As you say, we could arrange that all request submission happens in
 raid5d and I think this is the right way to proceed.  However we can
 still take some of the work into the thread that is submitting the
 IO by calling raid5d() at the end of make_request, like this.
 
 Can you test it please?  Does it seem reasonable?
 
 Thanks,
 NeilBrown
 
 
 Signed-off-by: Neil Brown [EMAIL PROTECTED]

it has passed 11h of the untar/diff/rm linux.tar.gz workload... that's 
pretty good evidence it works for me.  thanks!

Tested-by: dean gaudet [EMAIL PROTECTED]

 
 ### Diffstat output
  ./drivers/md/md.c|2 +-
  ./drivers/md/raid5.c |4 +++-
  2 files changed, 4 insertions(+), 2 deletions(-)
 
 diff .prev/drivers/md/md.c ./drivers/md/md.c
 --- .prev/drivers/md/md.c 2008-01-07 13:32:10.0 +1100
 +++ ./drivers/md/md.c 2008-01-10 11:08:02.0 +1100
 @@ -5774,7 +5774,7 @@ void md_check_recovery(mddev_t *mddev)
   if (mddev-ro)
   return;
  
 - if (signal_pending(current)) {
 + if (current == mddev-thread-tsk  signal_pending(current)) {
   if (mddev-pers-sync_request) {
   printk(KERN_INFO md: %s in immediate safe mode\n,
  mdname(mddev));
 
 diff .prev/drivers/md/raid5.c ./drivers/md/raid5.c
 --- .prev/drivers/md/raid5.c  2008-01-07 13:32:10.0 +1100
 +++ ./drivers/md/raid5.c  2008-01-10 11:06:54.0 +1100
 @@ -3432,6 +3432,7 @@ static int chunk_aligned_read(struct req
   }
  }
  
 +static void raid5d (mddev_t *mddev);
  
  static int make_request(struct request_queue *q, struct bio * bi)
  {
 @@ -3547,7 +3548,7 @@ static int make_request(struct request_q
   goto retry;
   }
   finish_wait(conf-wait_for_overlap, w);
 - handle_stripe(sh, NULL);
 + set_bit(STRIPE_HANDLE, sh-state);
   release_stripe(sh);
   } else {
   /* cannot get stripe for read-ahead, just give-up */
 @@ -3569,6 +3570,7 @@ static int make_request(struct request_q
 test_bit(BIO_UPTODATE, bi-bi_flags)
   ? 0 : -EIO);
   }
 + raid5d(mddev);
   return 0;
  }
  
 -
 To unsubscribe from this list: send the line unsubscribe linux-raid in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.24-rc6 reproducible raid5 hang

2008-01-10 Thread Dan Williams
On Jan 10, 2008 12:13 AM, dean gaudet [EMAIL PROTECTED] wrote:
 w.r.t. dan's cfq comments -- i really don't know the details, but does
 this mean cfq will misattribute the IO to the wrong user/process?  or is
 it just a concern that CPU time will be spent on someone's IO?  the latter
 is fine to me... the former seems sucky because with today's multicore
 systems CPU time seems cheap compared to IO.


I do not see this affecting the time slicing feature of cfq, because
as Neil says the work has to get done at some point.   If I give up
some of my slice working on someone else's I/O chances are the favor
will be returned in kind since the code does not discriminate.  The
io-priority capability of cfq currently does not work as advertised
with current MD since the priority is tied to the current thread and
the thread that actually submits the i/o on a stripe is
non-deterministic.  So I do not see this change making the situation
any worse.  In fact, it may make it a bit better since there is a
higher chance for the thread submitting i/o to MD to do its own i/o to
the backing disks.

Reviewed-by: Dan Williams [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.24-rc6 reproducible raid5 hang

2008-01-10 Thread Neil Brown
On Thursday January 10, [EMAIL PROTECTED] wrote:
 On Jan 10, 2008 12:13 AM, dean gaudet [EMAIL PROTECTED] wrote:
  w.r.t. dan's cfq comments -- i really don't know the details, but does
  this mean cfq will misattribute the IO to the wrong user/process?  or is
  it just a concern that CPU time will be spent on someone's IO?  the latter
  is fine to me... the former seems sucky because with today's multicore
  systems CPU time seems cheap compared to IO.
 
 
 I do not see this affecting the time slicing feature of cfq, because
 as Neil says the work has to get done at some point.   If I give up
 some of my slice working on someone else's I/O chances are the favor
 will be returned in kind since the code does not discriminate.  The
 io-priority capability of cfq currently does not work as advertised
 with current MD since the priority is tied to the current thread and
 the thread that actually submits the i/o on a stripe is
 non-deterministic.  So I do not see this change making the situation
 any worse.  In fact, it may make it a bit better since there is a
 higher chance for the thread submitting i/o to MD to do its own i/o to
 the backing disks.
 
 Reviewed-by: Dan Williams [EMAIL PROTECTED]

Thanks.
But I suspect you didn't test it with a bitmap :-)
I ran the mdadm test suite and it hit a problem - easy enough to fix.

I'll look out for any other possible related problem (due to raid5d
running in different processes) and then submit it.

Thanks,
NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.24-rc6 reproducible raid5 hang

2008-01-10 Thread dean gaudet
On Fri, 11 Jan 2008, Neil Brown wrote:

 Thanks.
 But I suspect you didn't test it with a bitmap :-)
 I ran the mdadm test suite and it hit a problem - easy enough to fix.

damn -- i lost my bitmap 'cause it was external and i didn't have things 
set up properly to pick it up after a reboot :)

if you send an updated patch i'll give it another spin...

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.24-rc6 reproducible raid5 hang

2008-01-09 Thread Dan Williams
On Sun, 2007-12-30 at 10:58 -0700, dean gaudet wrote:
 On Sat, 29 Dec 2007, Dan Williams wrote:
 
  On Dec 29, 2007 1:58 PM, dean gaudet [EMAIL PROTECTED] wrote: 
   On Sat, 29 Dec 2007, Dan Williams wrote: 
   
On Dec 29, 2007 9:48 AM, dean gaudet [EMAIL PROTECTED] wrote: 
 hmm bummer, i'm doing another test (rsync 3.5M inodes from another 
 box) on 
 the same 64k chunk array and had raised the stripe_cache_size to 
 1024... 
 and got a hang.  this time i grabbed stripe_cache_active before 
 bumping 
 the size again -- it was only 905 active.  as i recall the bug we 
 were 
 debugging a year+ ago the active was at the size when it would hang.  
 so 
 this is probably something new. 

I believe I am seeing the same issue and am trying to track down 
whether XFS is doing something unexpected, i.e. I have not been able 
to reproduce the problem with EXT3.  MD tries to increase throughput 
by letting some stripe work build up in batches.  It looks like every 
time your system has hung it has been in the 'inactive_blocked' state 
i.e.  3/4 of stripes active.  This state should automatically 
clear... 
   
   cool, glad you can reproduce it :) 
   
   i have a bit more data... i'm seeing the same problem on debian's 
   2.6.22-3-amd64 kernel, so it's not new in 2.6.24. 
   
  
  This is just brainstorming at this point, but it looks like xfs can 
  submit more requests in the bi_end_io path such that it can lock 
  itself out of the RAID array.  The sequence that concerns me is: 
  
  return_io-xfs_buf_end_io-xfs_buf_io_end-xfs_buf_iodone_work-xfs_buf_iorequest-make_request-hang
   
  
  I need verify whether this path is actually triggering, but if we are 
  in an inactive_blocked condition this new request will be put on a 
  wait queue and we'll never get to the release_stripe() call after 
  return_io().  It would be interesting to see if this is new XFS 
  behavior in recent kernels.
 
 
 i have evidence pointing to d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1
 
 http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1
 
 which was Neil's change in 2.6.22 for deferring generic_make_request 
 until there's enough stack space for it.
 
 with my git tree sync'd to that commit my test cases fail in under 20 
 minutes uptime (i rebooted and tested 3x).  sync'd to the commit previous 
 to it i've got 8h of run-time now without the problem.
 
 this isn't definitive of course since it does seem to be timing 
 dependent, but since all failures have occured much earlier than that 
 for me so far i think this indicates this change is either the cause of 
 the problem or exacerbates an existing raid5 problem.
 
 given that this problem looks like a very rare problem i saw with 2.6.18 
 (raid5+xfs there too) i'm thinking Neil's commit may just exacerbate an 
 existing problem... not that i have evidence either way.
 
 i've attached a new kernel log with a hang at d89d87965d... and the 
 reduced config file i was using for the bisect.  hopefully the hang 
 looks the same as what we were seeing at 2.6.24-rc6.  let me know.
 

Dean could you try the below patch to see if it fixes your failure
scenario?  It passes my test case.

Thanks,
Dan

---
md: add generic_make_request_immed to prevent raid5 hang

From: Dan Williams [EMAIL PROTECTED]

Commit d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1 reduced stack utilization
by preventing recursive calls to generic_make_request.  However the
following conditions can cause raid5 to hang until 'stripe_cache_size' is
increased:

1/ stripe_cache_active is N stripes away from the 'inactive_blocked' limit
   (3/4 * stripe_cache_size)
2/ a bio is submitted that requires M stripes to be processed where M  N
3/ stripes 1 through N are up-to-date and ready for immediate processing,
   i.e. no trip through raid5d required

This results in the calling thread hanging while waiting for resources to
process stripes N through M.  This means we never return from make_request.
All other raid5 users pile up in get_active_stripe.  Increasing
stripe_cache_size temporarily resolves the blockage by allowing the blocked
make_request to return to generic_make_request.

Another way to solve this is to move all i/o submission to raid5d context.

Thanks to Dean Gaudet for bisecting this down to d89d8796.

Signed-off-by: Dan Williams [EMAIL PROTECTED]
---

 block/ll_rw_blk.c  |   16 +---
 drivers/md/raid5.c |4 ++--
 include/linux/blkdev.h |1 +
 3 files changed, 16 insertions(+), 5 deletions(-)

diff --git a/block/ll_rw_blk.c b/block/ll_rw_blk.c
index 8b91994..bff40c2 100644
--- a/block/ll_rw_blk.c
+++ b/block/ll_rw_blk.c
@@ -3287,16 +3287,26 @@ end_io:
 }
 
 /*
- * We only want one -make_request_fn to be active at a time,
- * else stack usage with stacked devices could be a problem.
+ * In the general case we only want one -make_request_fn to be 

Re: 2.6.24-rc6 reproducible raid5 hang

2008-01-09 Thread Neil Brown
On Wednesday January 9, [EMAIL PROTECTED] wrote:
 On Sun, 2007-12-30 at 10:58 -0700, dean gaudet wrote:
  i have evidence pointing to d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1
  
  http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1
  
  which was Neil's change in 2.6.22 for deferring generic_make_request 
  until there's enough stack space for it.
  
 
 Commit d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1 reduced stack utilization
 by preventing recursive calls to generic_make_request.  However the
 following conditions can cause raid5 to hang until 'stripe_cache_size' is
 increased:
 

Thanks for pursuing this guys.  That explanation certainly sounds very
credible.

The generic_make_request_immed is a good way to confirm that we have
found the bug,  but I don't like it as a long term solution, as it
just reintroduced the problem that we were trying to solve with the
problematic commit.

As you say, we could arrange that all request submission happens in
raid5d and I think this is the right way to proceed.  However we can
still take some of the work into the thread that is submitting the
IO by calling raid5d() at the end of make_request, like this.

Can you test it please?  Does it seem reasonable?

Thanks,
NeilBrown


Signed-off-by: Neil Brown [EMAIL PROTECTED]

### Diffstat output
 ./drivers/md/md.c|2 +-
 ./drivers/md/raid5.c |4 +++-
 2 files changed, 4 insertions(+), 2 deletions(-)

diff .prev/drivers/md/md.c ./drivers/md/md.c
--- .prev/drivers/md/md.c   2008-01-07 13:32:10.0 +1100
+++ ./drivers/md/md.c   2008-01-10 11:08:02.0 +1100
@@ -5774,7 +5774,7 @@ void md_check_recovery(mddev_t *mddev)
if (mddev-ro)
return;
 
-   if (signal_pending(current)) {
+   if (current == mddev-thread-tsk  signal_pending(current)) {
if (mddev-pers-sync_request) {
printk(KERN_INFO md: %s in immediate safe mode\n,
   mdname(mddev));

diff .prev/drivers/md/raid5.c ./drivers/md/raid5.c
--- .prev/drivers/md/raid5.c2008-01-07 13:32:10.0 +1100
+++ ./drivers/md/raid5.c2008-01-10 11:06:54.0 +1100
@@ -3432,6 +3432,7 @@ static int chunk_aligned_read(struct req
}
 }
 
+static void raid5d (mddev_t *mddev);
 
 static int make_request(struct request_queue *q, struct bio * bi)
 {
@@ -3547,7 +3548,7 @@ static int make_request(struct request_q
goto retry;
}
finish_wait(conf-wait_for_overlap, w);
-   handle_stripe(sh, NULL);
+   set_bit(STRIPE_HANDLE, sh-state);
release_stripe(sh);
} else {
/* cannot get stripe for read-ahead, just give-up */
@@ -3569,6 +3570,7 @@ static int make_request(struct request_q
  test_bit(BIO_UPTODATE, bi-bi_flags)
? 0 : -EIO);
}
+   raid5d(mddev);
return 0;
 }
 
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.24-rc6 reproducible raid5 hang

2008-01-09 Thread Dan Williams
On Jan 9, 2008 5:09 PM, Neil Brown [EMAIL PROTECTED] wrote:
 On Wednesday January 9, [EMAIL PROTECTED] wrote:
  On Sun, 2007-12-30 at 10:58 -0700, dean gaudet wrote:
   i have evidence pointing to d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1
  
   http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1
  
   which was Neil's change in 2.6.22 for deferring generic_make_request
   until there's enough stack space for it.
  
 
  Commit d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1 reduced stack utilization
  by preventing recursive calls to generic_make_request.  However the
  following conditions can cause raid5 to hang until 'stripe_cache_size' is
  increased:
 

 Thanks for pursuing this guys.  That explanation certainly sounds very
 credible.

 The generic_make_request_immed is a good way to confirm that we have
 found the bug,  but I don't like it as a long term solution, as it
 just reintroduced the problem that we were trying to solve with the
 problematic commit.

 As you say, we could arrange that all request submission happens in
 raid5d and I think this is the right way to proceed.  However we can
 still take some of the work into the thread that is submitting the
 IO by calling raid5d() at the end of make_request, like this.

 Can you test it please?

This passes my failure case.

However, my test is different from Dean's in that I am using tiobench
and the latest rev of my 'get_priority_stripe' patch. I believe the
failure mechanism is the same, but it would be good to get
confirmation from Dean.  get_priority_stripe has the effect of
increasing the frequency of
make_request-handle_stripe-generic_make_request sequences.

 Does it seem reasonable?

What do you think about limiting the number of stripes the submitting
thread handles to be equal to what it submitted?  If I'm a stripe that
only submits 1 stripe worth of work should I get stuck handling the
rest of the cache?

Regards,
Dan
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.24-rc6 reproducible raid5 hang

2008-01-09 Thread Neil Brown
On Wednesday January 9, [EMAIL PROTECTED] wrote:
 On Jan 9, 2008 5:09 PM, Neil Brown [EMAIL PROTECTED] wrote:
  On Wednesday January 9, [EMAIL PROTECTED] wrote:
 
  Can you test it please?
 
 This passes my failure case.

Thanks!

 
  Does it seem reasonable?
 
 What do you think about limiting the number of stripes the submitting
 thread handles to be equal to what it submitted?  If I'm a stripe that
 only submits 1 stripe worth of work should I get stuck handling the
 rest of the cache?

Dunno
Someone has to do the work, and leaving it all to raid5d means that it
all gets done on one CPU.
I expect that most of the time the queue of ready stripes is empty so
make_request will mostly only handle it's own stripes anyway.
The times that it handles other thread's stripes will probably balance
out with the times that other threads handle this threads stripes.

So I'm incline to leave it as do as much work as is available to be
done as that is simplest.  But I can probably be talked out of it
with a convincing argument

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.24-rc6 reproducible raid5 hang

2008-01-09 Thread Dan Williams
On Wed, 2008-01-09 at 20:57 -0700, Neil Brown wrote:
 So I'm incline to leave it as do as much work as is available to be
 done as that is simplest.  But I can probably be talked out of it
 with a convincing argument

Well, in an age of CFS and CFQ it smacks of 'unfairness'.  But does that
trump KISS...? Probably not.

--
Dan

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.24-rc6 reproducible raid5 hang

2007-12-30 Thread dean gaudet
On Sat, 29 Dec 2007, Dan Williams wrote:

 On Dec 29, 2007 1:58 PM, dean gaudet [EMAIL PROTECTED] wrote:
  On Sat, 29 Dec 2007, Dan Williams wrote:
 
   On Dec 29, 2007 9:48 AM, dean gaudet [EMAIL PROTECTED] wrote:
hmm bummer, i'm doing another test (rsync 3.5M inodes from another box) 
on
the same 64k chunk array and had raised the stripe_cache_size to 1024...
and got a hang.  this time i grabbed stripe_cache_active before bumping
the size again -- it was only 905 active.  as i recall the bug we were
debugging a year+ ago the active was at the size when it would hang.  so
this is probably something new.
  
   I believe I am seeing the same issue and am trying to track down
   whether XFS is doing something unexpected, i.e. I have not been able
   to reproduce the problem with EXT3.  MD tries to increase throughput
   by letting some stripe work build up in batches.  It looks like every
   time your system has hung it has been in the 'inactive_blocked' state
   i.e.  3/4 of stripes active.  This state should automatically
   clear...
 
  cool, glad you can reproduce it :)
 
  i have a bit more data... i'm seeing the same problem on debian's
  2.6.22-3-amd64 kernel, so it's not new in 2.6.24.
 
 
 This is just brainstorming at this point, but it looks like xfs can
 submit more requests in the bi_end_io path such that it can lock
 itself out of the RAID array.  The sequence that concerns me is:
 
 return_io-xfs_buf_end_io-xfs_buf_io_end-xfs_buf_iodone_work-xfs_buf_iorequest-make_request-hang
 
 I need verify whether this path is actually triggering, but if we are
 in an inactive_blocked condition this new request will be put on a
 wait queue and we'll never get to the release_stripe() call after
 return_io().  It would be interesting to see if this is new XFS
 behavior in recent kernels.


i have evidence pointing to d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1

which was Neil's change in 2.6.22 for deferring generic_make_request
until there's enough stack space for it.

with my git tree sync'd to that commit my test cases fail in under 20
minutes uptime (i rebooted and tested 3x).  sync'd to the commit previous
to it i've got 8h of run-time now without the problem.

this isn't definitive of course since it does seem to be timing
dependent, but since all failures have occured much earlier than that
for me so far i think this indicates this change is either the cause of
the problem or exacerbates an existing raid5 problem.

given that this problem looks like a very rare problem i saw with 2.6.18
(raid5+xfs there too) i'm thinking Neil's commit may just exacerbate an
existing problem... not that i have evidence either way.

i've attached a new kernel log with a hang at d89d87965d... and the
reduced config file i was using for the bisect.  hopefully the hang
looks the same as what we were seeing at 2.6.24-rc6.  let me know.

-dean

kern.log.d89d87965d.bz2
Description: Binary data


config-2.6.21-b1.bz2
Description: Binary data


Re: 2.6.24-rc6 reproducible raid5 hang

2007-12-29 Thread dean gaudet
hmm bummer, i'm doing another test (rsync 3.5M inodes from another box) on 
the same 64k chunk array and had raised the stripe_cache_size to 1024... 
and got a hang.  this time i grabbed stripe_cache_active before bumping 
the size again -- it was only 905 active.  as i recall the bug we were 
debugging a year+ ago the active was at the size when it would hang.  so 
this is probably something new.

anyhow raising it to 2048 got it unstuck, but i'm guessing i'll be able to 
hit that limit too if i try harder :)

btw what units are stripe_cache_size/active in?  is the memory consumed 
equal to (chunk_size * raid_disks * stripe_cache_size) or (chunk_size * 
raid_disks * stripe_cache_active)?

-dean

On Thu, 27 Dec 2007, dean gaudet wrote:

 hmm this seems more serious... i just ran into it with chunksize 64KiB and 
 while just untarring a bunch of linux kernels in parallel... increasing 
 stripe_cache_size did the trick again.
 
 -dean
 
 On Thu, 27 Dec 2007, dean gaudet wrote:
 
  hey neil -- remember that raid5 hang which me and only one or two others 
  ever experienced and which was hard to reproduce?  we were debugging it 
  well over a year ago (that box has 400+ day uptime now so at least that 
  long ago :)  the workaround was to increase stripe_cache_size... i seem to 
  have a way to reproduce something which looks much the same.
  
  setup:
  
  - 2.6.24-rc6
  - system has 8GiB RAM but no swap
  - 8x750GB in a raid5 with one spare, chunksize 1024KiB.
  - mkfs.xfs default options
  - mount -o noatime
  - dd if=/dev/zero of=/mnt/foo bs=4k count=2621440
  
  that sequence hangs for me within 10 seconds... and i can unhang / rehang 
  it by toggling between stripe_cache_size 256 and 1024.  i detect the hang 
  by watching iostat -kx /dev/sd? 5.
  
  i've attached the kernel log where i dumped task and timer state while it 
  was hung... note that you'll see at some point i did an xfs mount with 
  external journal but it happens with internal journal as well.
  
  looks like it's using the raid456 module and async api.
  
  anyhow let me know if you need more info / have any suggestions.
  
  -dean
 -
 To unsubscribe from this list: send the line unsubscribe linux-raid in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.24-rc6 reproducible raid5 hang

2007-12-29 Thread Dan Williams
On Dec 29, 2007 9:48 AM, dean gaudet [EMAIL PROTECTED] wrote:
 hmm bummer, i'm doing another test (rsync 3.5M inodes from another box) on
 the same 64k chunk array and had raised the stripe_cache_size to 1024...
 and got a hang.  this time i grabbed stripe_cache_active before bumping
 the size again -- it was only 905 active.  as i recall the bug we were
 debugging a year+ ago the active was at the size when it would hang.  so
 this is probably something new.

I believe I am seeing the same issue and am trying to track down
whether XFS is doing something unexpected, i.e. I have not been able
to reproduce the problem with EXT3.  MD tries to increase throughput
by letting some stripe work build up in batches.  It looks like every
time your system has hung it has been in the 'inactive_blocked' state
i.e.  3/4 of stripes active.  This state should automatically
clear...


 anyhow raising it to 2048 got it unstuck, but i'm guessing i'll be able to
 hit that limit too if i try harder :)

Once you hang if 'stripe_cache_size' is increased such that
stripe_cache_active  3/4 * stripe_cache_size things will start
flowing again.


 btw what units are stripe_cache_size/active in?  is the memory consumed
 equal to (chunk_size * raid_disks * stripe_cache_size) or (chunk_size *
 raid_disks * stripe_cache_active)?


memory_consumed = PAGE_SIZE * raid_disks * stripe_cache_size


 -dean


--
Dan
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.24-rc6 reproducible raid5 hang

2007-12-29 Thread dean gaudet
On Sat, 29 Dec 2007, Dan Williams wrote:

 On Dec 29, 2007 9:48 AM, dean gaudet [EMAIL PROTECTED] wrote:
  hmm bummer, i'm doing another test (rsync 3.5M inodes from another box) on
  the same 64k chunk array and had raised the stripe_cache_size to 1024...
  and got a hang.  this time i grabbed stripe_cache_active before bumping
  the size again -- it was only 905 active.  as i recall the bug we were
  debugging a year+ ago the active was at the size when it would hang.  so
  this is probably something new.
 
 I believe I am seeing the same issue and am trying to track down
 whether XFS is doing something unexpected, i.e. I have not been able
 to reproduce the problem with EXT3.  MD tries to increase throughput
 by letting some stripe work build up in batches.  It looks like every
 time your system has hung it has been in the 'inactive_blocked' state
 i.e.  3/4 of stripes active.  This state should automatically
 clear...

cool, glad you can reproduce it :)

i have a bit more data... i'm seeing the same problem on debian's 
2.6.22-3-amd64 kernel, so it's not new in 2.6.24.

i'm doing some more isolation but just grabbing kernels i have precompiled 
so far -- a 2.6.19.7 kernel doesn't show the problem, and early 
indications are a 2.6.21.7 kernel also doesn't have the problem but i'm 
giving it longer to show its head.

i'll try a stock 2.6.22 next depending on how the 2.6.21 test goes, just 
so we get the debian patches out of the way.

i was tempted to blame async api because it's newish :)  but according to 
the dmesg output it doesn't appear the 2.6.22-3-amd64 kernel used async 
API, and it still hung, so async is probably not to blame.

anyhow the test case i'm using is the dma_thrasher script i attached... it 
takes about an hour to give me confidence there's no problems so this will 
take a while.

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.24-rc6 reproducible raid5 hang

2007-12-29 Thread Justin Piszcz



On Sat, 29 Dec 2007, dean gaudet wrote:


On Sat, 29 Dec 2007, Dan Williams wrote:


On Dec 29, 2007 9:48 AM, dean gaudet [EMAIL PROTECTED] wrote:

hmm bummer, i'm doing another test (rsync 3.5M inodes from another box) on
the same 64k chunk array and had raised the stripe_cache_size to 1024...
and got a hang.  this time i grabbed stripe_cache_active before bumping
the size again -- it was only 905 active.  as i recall the bug we were
debugging a year+ ago the active was at the size when it would hang.  so
this is probably something new.


I believe I am seeing the same issue and am trying to track down
whether XFS is doing something unexpected, i.e. I have not been able
to reproduce the problem with EXT3.  MD tries to increase throughput
by letting some stripe work build up in batches.  It looks like every
time your system has hung it has been in the 'inactive_blocked' state
i.e.  3/4 of stripes active.  This state should automatically
clear...


cool, glad you can reproduce it :)

i have a bit more data... i'm seeing the same problem on debian's
2.6.22-3-amd64 kernel, so it's not new in 2.6.24.

i'm doing some more isolation but just grabbing kernels i have precompiled
so far -- a 2.6.19.7 kernel doesn't show the problem, and early
indications are a 2.6.21.7 kernel also doesn't have the problem but i'm
giving it longer to show its head.

i'll try a stock 2.6.22 next depending on how the 2.6.21 test goes, just
so we get the debian patches out of the way.

i was tempted to blame async api because it's newish :)  but according to
the dmesg output it doesn't appear the 2.6.22-3-amd64 kernel used async
API, and it still hung, so async is probably not to blame.

anyhow the test case i'm using is the dma_thrasher script i attached... it
takes about an hour to give me confidence there's no problems so this will
take a while.

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html



Dean,

Curious btw what kind of filesystem size/raid type (5, but defaults 
I assume, nothing special right? (right-symmetric vs. 
left-symmetric, etc?)/cache size/chunk size(s) are you using/testing with?


The script you sent out earlier, you are able to reproduce it easily with 
31 or so kernel tar decompressions?


Justin.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.24-rc6 reproducible raid5 hang

2007-12-29 Thread Dan Williams
On Dec 29, 2007 1:58 PM, dean gaudet [EMAIL PROTECTED] wrote:
 On Sat, 29 Dec 2007, Dan Williams wrote:

  On Dec 29, 2007 9:48 AM, dean gaudet [EMAIL PROTECTED] wrote:
   hmm bummer, i'm doing another test (rsync 3.5M inodes from another box) on
   the same 64k chunk array and had raised the stripe_cache_size to 1024...
   and got a hang.  this time i grabbed stripe_cache_active before bumping
   the size again -- it was only 905 active.  as i recall the bug we were
   debugging a year+ ago the active was at the size when it would hang.  so
   this is probably something new.
 
  I believe I am seeing the same issue and am trying to track down
  whether XFS is doing something unexpected, i.e. I have not been able
  to reproduce the problem with EXT3.  MD tries to increase throughput
  by letting some stripe work build up in batches.  It looks like every
  time your system has hung it has been in the 'inactive_blocked' state
  i.e.  3/4 of stripes active.  This state should automatically
  clear...

 cool, glad you can reproduce it :)

 i have a bit more data... i'm seeing the same problem on debian's
 2.6.22-3-amd64 kernel, so it's not new in 2.6.24.


This is just brainstorming at this point, but it looks like xfs can
submit more requests in the bi_end_io path such that it can lock
itself out of the RAID array.  The sequence that concerns me is:

return_io-xfs_buf_end_io-xfs_buf_io_end-xfs_buf_iodone_work-xfs_buf_iorequest-make_request-hang

I need verify whether this path is actually triggering, but if we are
in an inactive_blocked condition this new request will be put on a
wait queue and we'll never get to the release_stripe() call after
return_io().  It would be interesting to see if this is new XFS
behavior in recent kernels.

--
Dan
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.24-rc6 reproducible raid5 hang

2007-12-29 Thread dean gaudet
On Sat, 29 Dec 2007, Justin Piszcz wrote:

 Curious btw what kind of filesystem size/raid type (5, but defaults I assume,
 nothing special right? (right-symmetric vs. left-symmetric, etc?)/cache
 size/chunk size(s) are you using/testing with?

mdadm --create --level=5 --chunk=64 -n7 -x1 /dev/md2 /dev/sd[a-h]1
mkfs.xfs -f /dev/md2

otherwise defaults

 The script you sent out earlier, you are able to reproduce it easily with 31
 or so kernel tar decompressions?

not sure, the point of the script is to untar more than there is RAM.  it 
happened with a single rsync running though -- 3.5M indoes from a remote 
box.  it also happens with the single 10GB dd write... although i've been 
using the tar method for testing different kernel revs.

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.24-rc6 reproducible raid5 hang

2007-12-29 Thread dean gaudet

On Sat, 29 Dec 2007, dean gaudet wrote:

 On Sat, 29 Dec 2007, Justin Piszcz wrote:
 
  Curious btw what kind of filesystem size/raid type (5, but defaults I 
  assume,
  nothing special right? (right-symmetric vs. left-symmetric, etc?)/cache
  size/chunk size(s) are you using/testing with?
 
 mdadm --create --level=5 --chunk=64 -n7 -x1 /dev/md2 /dev/sd[a-h]1
 mkfs.xfs -f /dev/md2
 
 otherwise defaults

hmm i missed a few things, here's exactly how i created the array:

mdadm --create --level=5 --chunk=64 -n7 -x1 --assume-clean /dev/md2 
/dev/sd[a-h]1

it's reassembled automagically each reboot, but i do this each reboot:

mkfs.xfs -f /dev/md2
mount -o noatime /dev/md2 /mnt/new
./dma_thrasher linux.tar.gz /mnt/new

the --assume-clean and noatime probably make no difference though...

on the bisection front it looks like it's new behaviour between 2.6.21.7 
and 2.6.22.15 (stock kernels now, not debian).

i've got to step out for a while, but i'll go at it again later, probably 
with git bisect unless someone has some cherry picked changes to suggest.

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.24-rc6 reproducible raid5 hang

2007-12-27 Thread dean gaudet
hmm this seems more serious... i just ran into it with chunksize 64KiB and 
while just untarring a bunch of linux kernels in parallel... increasing 
stripe_cache_size did the trick again.

-dean

On Thu, 27 Dec 2007, dean gaudet wrote:

 hey neil -- remember that raid5 hang which me and only one or two others 
 ever experienced and which was hard to reproduce?  we were debugging it 
 well over a year ago (that box has 400+ day uptime now so at least that 
 long ago :)  the workaround was to increase stripe_cache_size... i seem to 
 have a way to reproduce something which looks much the same.
 
 setup:
 
 - 2.6.24-rc6
 - system has 8GiB RAM but no swap
 - 8x750GB in a raid5 with one spare, chunksize 1024KiB.
 - mkfs.xfs default options
 - mount -o noatime
 - dd if=/dev/zero of=/mnt/foo bs=4k count=2621440
 
 that sequence hangs for me within 10 seconds... and i can unhang / rehang 
 it by toggling between stripe_cache_size 256 and 1024.  i detect the hang 
 by watching iostat -kx /dev/sd? 5.
 
 i've attached the kernel log where i dumped task and timer state while it 
 was hung... note that you'll see at some point i did an xfs mount with 
 external journal but it happens with internal journal as well.
 
 looks like it's using the raid456 module and async api.
 
 anyhow let me know if you need more info / have any suggestions.
 
 -dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.24-rc6 reproducible raid5 hang

2007-12-27 Thread Justin Piszcz



On Thu, 27 Dec 2007, dean gaudet wrote:


hey neil -- remember that raid5 hang which me and only one or two others
ever experienced and which was hard to reproduce?  we were debugging it
well over a year ago (that box has 400+ day uptime now so at least that
long ago :)  the workaround was to increase stripe_cache_size... i seem to
have a way to reproduce something which looks much the same.

setup:

- 2.6.24-rc6
- system has 8GiB RAM but no swap
- 8x750GB in a raid5 with one spare, chunksize 1024KiB.
- mkfs.xfs default options
- mount -o noatime
- dd if=/dev/zero of=/mnt/foo bs=4k count=2621440

that sequence hangs for me within 10 seconds... and i can unhang / rehang
it by toggling between stripe_cache_size 256 and 1024.  i detect the hang
by watching iostat -kx /dev/sd? 5.

i've attached the kernel log where i dumped task and timer state while it
was hung... note that you'll see at some point i did an xfs mount with
external journal but it happens with internal journal as well.

looks like it's using the raid456 module and async api.

anyhow let me know if you need more info / have any suggestions.

-dean


With that high of a stripe size the stripe_cache_size needs to be greater 
than the default to handle it.


Justin.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.24-rc6 reproducible raid5 hang

2007-12-27 Thread dean gaudet
On Thu, 27 Dec 2007, Justin Piszcz wrote:

 With that high of a stripe size the stripe_cache_size needs to be greater than
 the default to handle it.

i'd argue that any deadlock is a bug...

regardless i'm still seeing deadlocks with the default chunk_size of 64k 
and stripe_cache_size of 256... in this case it's with a workload which is 
untarring 34 copies of the linux kernel at the same time.  it's a variant 
of doug ledford's memtest, and i've attached it.

-dean#!/usr/bin/perl

# Copyright (c) 2007 dean gaudet [EMAIL PROTECTED]
#
# Permission is hereby granted, free of charge, to any person obtaining a
# copy of this software and associated documentation files (the Software),
# to deal in the Software without restriction, including without limitation
# the rights to use, copy, modify, merge, publish, distribute, sublicense,
# and/or sell copies of the Software, and to permit persons to whom the
# Software is furnished to do so, subject to the following conditions:
#
# The above copyright notice and this permission notice shall be included
# in all copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED AS IS, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
# THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR
# OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
# ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
# OTHER DEALINGS IN THE SOFTWARE.

# this idea shamelessly stolen from doug ledford

use warnings;
use strict;

# ensure stdout is not buffered
select(STDOUT); $| = 1;

my $usage = usage: $0 linux.tar.gz /path1 [/path2 ...]\n;
defined(my $tarball = shift) or die $usage;
-f $tarball or die $tarball does not exist or is not a file\n;

my @paths = @ARGV;
$#paths = 0 or die $usage;

# determine size of uncompressed tarball
open(GZIP, -|) || exec gzip, --quiet, --list, $tarball;
my $line = GZIP;
my ($tarball_size) = $line =~ m#^\s*\d+\s*(\d+)#;
defined($tarball_size) or die unexpected result from gzip --quiet --list 
$tarball\n;
close(GZIP);

# determine amount of memory
open(MEMINFO, /proc/meminfo)
or die unable to open /proc/meminfo for read: $!\n;
my $total_mem;
while (MEMINFO) {
  if (/^MemTotal:\s*(\d+)\s*kB/) {
$total_mem = $1;
last;
  }
}
defined($total_mem) or die did not find MemTotal line in /proc/meminfo\n;
close(MEMINFO);
$total_mem *= 1024;

print total memory: $total_mem\n;
print uncompressed tarball: $tarball_size\n;
my $nr_simultaneous = int(1.2 * $total_mem / $tarball_size);
print nr simultaneous processes: $nr_simultaneous\n;

sub system_or_die {
  my @args = @_;
  system(@args);
  if ($? == 1) {
my $msg = sprintf(%s failed to exec %s: $!\n, scalar(localtime), 
$args[0]);
  }
  elsif ($?  127) {
my $msg = sprintf(%s %s died with signal %d, %s coredump\n,
scalar(localtime), $args[0], ($?  127), ($?  128) ? with : 
without);
die $msg;
  }
  elsif (($?  8) != 0) {
my $msg = sprintf(%s %s exited with non-zero exit code %d\n,
scalar(localtime), $args[0], $?  8);
die $msg;
  }
}

sub untar($) {
  mkdir($_[0]) or die localtime(). unable to mkdir($_[0]): $!\n;
  system_or_die(tar, -xzf, $tarball, -C, $_[0]);
}

print localtime(). untarring golden copy\n;
my $golden = $paths[0]./dma_tmp.$$.gold;
untar($golden);

my $pass_no = 0;
while (1) {
  print localtime(). pass $pass_no: extracting\n;
  my @outputs;
  foreach my $n (1..$nr_simultaneous) {
# treat paths in a round-robin manner
my $dir = shift(@paths);
push(@paths, $dir);

$dir .= /dma_tmp.$$.$n;
push(@outputs, $dir);

my $pid = fork;
defined($pid) or die localtime(). unable to fork: $!\n;
if ($pid == 0) {
  untar($dir);
  exit(0);
}
  }

  # wait for the children
  while (wait != -1) {}

  print localtime(). pass $pass_no: diffing\n;
  foreach my $dir (@outputs) {
my $pid = fork;
defined($pid) or die localtime(). unable to fork: $!\n;
if ($pid == 0) {
  system_or_die(diff, -U, 3, -rN, $golden, $dir);
  system_or_die(rm, -fr, $dir);
  exit(0);
}
  }

  # wait for the children
  while (wait != -1) {}

  ++$pass_no;
}