Re: 2.6.20-rc5: cp 18gb 18gb.2 = OOM killer, reproducible just like 2.16.19.2

2007-01-24 Thread Bill Cizek

Justin Piszcz wrote:

On Mon, 22 Jan 2007, Andrew Morton wrote:
  

On Sun, 21 Jan 2007 14:27:34 -0500 (EST) Justin Piszcz [EMAIL PROTECTED] 
wrote:
Why does copying an 18GB on a 74GB raptor raid1 cause the kernel to invoke 
the OOM killer and kill all of my processes?
  
Running with PREEMPT OFF lets me copy the file!!  The machine LAGS 
occasionally every 5-30-60 seconds or so VERY BADLY, talking 5-10 seconds 
of lag, but hey, it does not crash!! I will boot the older kernel with 
preempt on and see if I can get you that information you requested.
  

Justin,

According to your kernel_ring_buffer.txt (attached to another email), 
you are using anticipatory as your io scheduler:
  289  Jan 24 18:35:25 p34 kernel: [0.142130] io scheduler noop 
registered
  290  Jan 24 18:35:25 p34 kernel: [0.142194] io scheduler 
anticipatory registered (default)


I had a problem with this scheduler where my system would occasionally 
lockup during heavy I/O.  Sometimes it would fix itself, sometimes I had 
to reboot.  I changed to the CFQ io scheduler and my system has worked 
fine since then.


CFQ has to be built into the kernel (under BlockLayer/IOSchedulers).  It 
can be selected as default or you can set it during runtime:


echo cfq  /sys/block/disk/queue/scheduler
...

Hope this helps,
Bill
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


raid5 reshape bug with XFS

2006-11-04 Thread Bill Cizek

Hi,

I'm setting up a raid 5 system and I ran across a bug when reshaping an 
array with a mounted XFS filesystem on it.  This is under linux 2.6.18.2 
and mdadm 2.5.5


I have a test array with 3 10 GB disks and a fourth 10 GB spare disk, 
and a mounted xfs filesystem on it:


[EMAIL PROTECTED] $ mdadm --detail /dev/md4
/dev/md4:
   Version : 00.90.03
 Creation Time : Sat Nov  4 18:58:59 2006
Raid Level : raid5
Array Size : 20964480 (19.99 GiB 21.47 GB)
   Device Size : 10482240 (10.00 GiB 10.73 GB)
  Raid Devices : 3
 Total Devices : 4
Preferred Minor : 4
   Persistence : Superblock is persistent
[snip]

...I Grow it:

[EMAIL PROTECTED] $ mdadm -G /dev/md4 -n4
mdadm: Need to backup 384K of critical section..
mdadm: ... critical section passed.
[EMAIL PROTECTED] $ mdadm --detail /dev/md4
/dev/md4:
   Version : 00.91.03
 Creation Time : Sat Nov  4 18:58:59 2006
Raid Level : raid5
Array Size : 20964480 (19.99 GiB 21.47 GB)
   Device Size : 10482240 (10.00 GiB 10.73 GB)
  Raid Devices : 4
 Total Devices : 4
Preferred Minor : 4
   Persistence : Superblock is persistent
---

It goes along and reshapes fine (from /proc/mdstat):

md4 : active raid5 dm-67[3] dm-66[2] dm-65[1] dm-64[0]
 20964480 blocks super 0.91 level 5, 64k chunk, algorithm 2 [4/4] 
[]
 []  reshape = 22.0% (2314624/10482240) 
finish=16.7min

speed=8128K/sec



When the reshape completes, the full array size gets corrupted:
/proc/mdstat:
md4 : active raid5 dm-67[3] dm-66[2] dm-65[1] dm-64[0]
 31446720 blocks level 5, 64k chunk, algorithm 2 [4/4] []

- looks good, but-

[EMAIL PROTECTED] $ mdadm --detail /dev/md4
/dev/md4:
   Version : 00.90.03
 Creation Time : Sat Nov  4 18:58:59 2006
Raid Level : raid5

Array Size : 2086592 (2038.03 MiB 2136.67 MB)

   Device Size : 10482240 (10.00 GiB 10.73 GB)
  Raid Devices : 4
 Total Devices : 4
Preferred Minor : 4
   Persistence : Superblock is persistent

(2086592 != 31446720 -- Bad, much too small)

-
xfs_growfs /dev/md4 barfs horribly - something about reading past the 
end of the device.


If I unmount the XFS filesystem, things work ok:

[EMAIL PROTECTED] $ umount /dev/md4

[EMAIL PROTECTED] $ mdadm --detail /dev/md4
/dev/md4:
   Version : 00.90.03
 Creation Time : Sat Nov  4 18:58:59 2006
Raid Level : raid5
Array Size : 31446720 (29.99 GiB 32.20 GB)
   Device Size : 10482240 (10.00 GiB 10.73 GB)
  Raid Devices : 4
 Total Devices : 4
Preferred Minor : 4
   Persistence : Superblock is persistent

(31446720 == 31446720 -- Good)

If I remount the fs, I can use xfs_growfs with no ill effects.

It's a pretty easy work-around to not have the fs mounted during the 
resize, but it doesn't seem right for the array size to get borked like 
this. If there's anything I can provide to debug this let me know.


Thanks,
Bill





-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: IBM xSeries stop responding during RAID1 reconstruction

2006-06-14 Thread Bill Cizek

Niccolo Rigacci wrote:


Hi to all,

I have a new IBM xSeries 206m with two SATA drives, I installed a 
Debian Testing (Etch) and configured a software RAID as shown:


Personalities : [raid1]
md1 : active raid1 sdb5[1] sda5[0]
 1951744 blocks [2/2] [UU]

md2 : active raid1 sdb6[1] sda6[0]
 2931712 blocks [2/2] [UU]

md3 : active raid1 sdb7[1] sda7[0]
 39061952 blocks [2/2] [UU]

md0 : active raid1 sdb1[1] sda1[0]
 582 blocks [2/2] [UU]

I experience this problem: whenever a volume is reconstructing 
(syncing), the system stops responding. The machine is alive, 
because it responds to the ping, the console is responsive but I 
cannot pass the login prompt. It seems that every disk activity 
is delayed and blocking.


When the sync is complete, the machine start to respond again 
perfectly.


Any hints on how to start debugging?
 



I ran into a similar problem using kernel 2.6.16.14 on an ASUS 
motherboard:  When I
mirrored two SATA drives it seemed to block all other disk I/O until the 
sync was complete.


My symptoms were the same:  all consoles were non-responsive and when I 
tried to login

it just sat there until the sync was complete.

I was able to work around this by lowering 
/proc/sys/dev/raid/speed_limit_max to a value

below my disk thruput value (~ 50 MB/s) as follows:

$ echo 45000  /proc/sys/dev/raid/speed_limit_max

That kept my system usable but didn't address the underlying problem of 
the raid
resync not being appropriately throttled.  I ended up configuring my 
system differently

so this became a moot point for me.

Hope this helps,
Bill




-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RAID1 Array corruption when adding an extra device with mdadm

2006-01-13 Thread Bill Cizek


I've got a system running 2.6.14.6 with a raid1 array of 2 disks.

The size of the array is as follows (from mdadm --detail):
Raid Level : raid1
Array Size : 28314496 (27.00 GiB 28.99 GB)
   Device Size : 28314496 (27.00 GiB 28.99 GB)
  Raid Devices : 2
 Total Devices : 2

I'm trying to add an extra disk to make a three-way mirror using mdadm:

mdadm --grow /dev/md0 -n 3

When I do this, the disk gets added (so there are 3 raid devices) --BUT--
also, the Array Size changes to 3.0 GB.  If I immediately reboot, things 
end up ok,

but if I let it run it destroys the array contents.

This happened under mdadm v2.1 and 2.2. I hacked mdadm to print out what 
it's doing,
and things look ok in Manage_resize() until the mdu_array_info_t 
structure is updated using

ioctl (SET_ARRAY_INFO), then the above mentioned size change happens.

Does anyone know what's up with this?

Thanks,
-Bill





-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html