[PATCH 000 of 2] md: Two more bugfixes.

2007-05-10 Thread NeilBrown
Following are two bugfixes for md in current kernels.
The first is suitable for -stable is it can allow drive errors
through to the filesystem wrongly.
Both are suitable for 2.6.22.

Thanks,
NeilBrown


 [PATCH 001 of 2] md: Avoid a possibility that a read error can wrongly 
propagate through md/raid1 to a filesystem.
 [PATCH 002 of 2] md: Improve the is_mddev_idle test
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 001 of 2] md: Avoid a possibility that a read error can wrongly propagate through md/raid1 to a filesystem.

2007-05-10 Thread NeilBrown

When a raid1 has only one working drive, we want read error to
propagate up to the filesystem as there is no point failing the last
drive in an array.

Currently the code perform this check is racy.  If a write and a read
a both submitted to a device on a 2-drive raid1, and the write fails
followed by the read failing, the read will see that there is only one
working drive and will pass the failure up, even though the one
working drive is actually the *other* one.

So, tighten up the locking.

Cc: [EMAIL PROTECTED]
Signed-off-by: Neil Brown [EMAIL PROTECTED]

### Diffstat output
 ./drivers/md/raid1.c |   33 +++--
 1 file changed, 19 insertions(+), 14 deletions(-)

diff .prev/drivers/md/raid1.c ./drivers/md/raid1.c
--- .prev/drivers/md/raid1.c2007-05-10 15:51:54.0 +1000
+++ ./drivers/md/raid1.c2007-05-10 15:51:58.0 +1000
@@ -271,21 +271,25 @@ static int raid1_end_read_request(struct
 */
update_head_pos(mirror, r1_bio);
 
-   if (uptodate || (conf-raid_disks - conf-mddev-degraded) = 1) {
-   /*
-* Set R1BIO_Uptodate in our master bio, so that
-* we will return a good error code for to the higher
-* levels even if IO on some other mirrored buffer fails.
-*
-* The 'master' represents the composite IO operation to
-* user-side. So if something waits for IO, then it will
-* wait for the 'master' bio.
+   if (uptodate)
+   set_bit(R1BIO_Uptodate, r1_bio-state);
+   else {
+   /* If all other devices have failed, we want to return
+* the error upwards rather than fail the last device.
+* Here we redefine uptodate to mean Don't want to retry
 */
-   if (uptodate)
-   set_bit(R1BIO_Uptodate, r1_bio-state);
+   unsigned long flags;
+   spin_lock_irqsave(conf-device_lock, flags);
+   if (r1_bio-mddev-degraded == conf-raid_disks ||
+   (r1_bio-mddev-degraded == conf-raid_disks-1 
+!test_bit(Faulty, conf-mirrors[mirror].rdev-flags)))
+   uptodate = 1;
+   spin_unlock_irqrestore(conf-device_lock, flags);
+   }
 
+   if (uptodate)
raid_end_bio_io(r1_bio);
-   } else {
+   else {
/*
 * oops, read error:
 */
@@ -992,13 +996,14 @@ static void error(mddev_t *mddev, mdk_rd
unsigned long flags;
spin_lock_irqsave(conf-device_lock, flags);
mddev-degraded++;
+   set_bit(Faulty, rdev-flags);
spin_unlock_irqrestore(conf-device_lock, flags);
/*
 * if recovery is running, make sure it aborts.
 */
set_bit(MD_RECOVERY_ERR, mddev-recovery);
-   }
-   set_bit(Faulty, rdev-flags);
+   } else
+   set_bit(Faulty, rdev-flags);
set_bit(MD_CHANGE_DEVS, mddev-flags);
printk(KERN_ALERT raid1: Disk failure on %s, disabling device. \n
   Operation continuing on %d devices\n,
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 002 of 2] md: Improve the is_mddev_idle test

2007-05-10 Thread NeilBrown

During a 'resync' or similar activity, md checks if the devices in the
array are otherwise active and winds back resync activity when they
are.  This test in done in is_mddev_idle, and it is somewhat fragile -
it sometimes thinks there is non-sync io when there isn't.

The test compares the total sectors of io (disk_stat_read) with the sectors
of resync io (disk-sync_io).
This has problems because total sectors gets updated when a request completes,
while resync io gets updated when the request is submitted.  The time difference
can cause large differenced between the two which do not actually imply 
non-resync activity.  The test currently allows for some fuzz (+/- 4096)
but there are some cases when it is not enough.

The test currently looks for any (non-fuzz) difference, either
positive or negative.  This clearly is not needed.  Any non-sync
activity will cause the total sectors to grow faster than the sync_io
count (never slower) so we only need to look for a positive differences.

If we do this then the amount of in-flight sync io will never cause
the appearance of non-sync IO.  Once enough non-sync IO to worry about
starts happening, resync will be slowed down and the measurements will
thus be more precise (as there is less in-flight) and control of resync
will still be suitably responsive.


Signed-off-by: Neil Brown [EMAIL PROTECTED]

### Diffstat output
 ./drivers/md/md.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff .prev/drivers/md/md.c ./drivers/md/md.c
--- .prev/drivers/md/md.c   2007-05-10 15:51:54.0 +1000
+++ ./drivers/md/md.c   2007-05-10 16:05:10.0 +1000
@@ -5095,7 +5095,7 @@ static int is_mddev_idle(mddev_t *mddev)
 *
 * Note: the following is an unsigned comparison.
 */
-   if ((curr_events - rdev-last_events + 4096)  8192) {
+   if ((long)curr_events - (long)rdev-last_events  4096) {
rdev-last_events = curr_events;
idle = 0;
}
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: removed disk md-device

2007-05-10 Thread Neil Brown
On Wednesday May 9, [EMAIL PROTECTED] wrote:
 
 Neil Brown [EMAIL PROTECTED] [2007.04.02.0953 +0200]:
 Hmmm... this is somewhat awkward.  You could argue that udev should be
 taught to remove the device from the array before removing the device
 from /dev.  But I'm not convinced that you always want to 'fail' the
 device.   It is possible in this case that the array is quiescent and
 you might like to shut it down without registering a device failure...
 
 Hmm, the the kernel advised hotplug to remove the device from /dev, but you 
 don't want to remove it from md? Do you have an example for that case?

Until there is known to be an inconsistency among the devices in an
array, you don't want to record that there is.

Suppose I have two USB drives with a mounted but quiescent filesystem
on a raid1 across them.
I pull them both out, one after the other, to take them to my friends
place.

I plug them both in and find that the array is degraded, because as
soon as I unplugged on, the other was told that it was now the only
one. 
Not good.  Best to wait for an IO request that actually returns an
errors. 

 
 Maybe an mdadm command that will do that for a given device, or for
 all components of a given array if the 'dev' link is 'broken', or even
 for all devices for all array.
 
mdadm --fail-unplugged --scan
 or
mdadm --fail-unplugged /dev/md3
 
 Ok, so one could run this as cron script. Neil, may I ask if you already 
 started to work on this? Since we have the problem on a customer system, we 
 should fix it ASAP, but at least within the next 2 or 3 weeks. If you didn't 
 start work on it yet, I will do...

No, I haven't, but it is getting near the top of my list.
If you want a script that does this automatically for every array,
something like:

  for a in /sys/block/md*/md/dev-*
  do
if [ -f $a/block/dev ]
then : still there
else
echo faulty  $a/state
echo remove  $a/state
fi
  done

should do what you want. (I haven't tested it though).

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 002 of 2] md: Improve the is_mddev_idle test

2007-05-10 Thread Andrew Morton
On Thu, 10 May 2007 16:22:31 +1000 NeilBrown [EMAIL PROTECTED] wrote:

 The test currently looks for any (non-fuzz) difference, either
 positive or negative.  This clearly is not needed.  Any non-sync
 activity will cause the total sectors to grow faster than the sync_io
 count (never slower) so we only need to look for a positive differences.
 
 ...

 --- .prev/drivers/md/md.c 2007-05-10 15:51:54.0 +1000
 +++ ./drivers/md/md.c 2007-05-10 16:05:10.0 +1000
 @@ -5095,7 +5095,7 @@ static int is_mddev_idle(mddev_t *mddev)
*
* Note: the following is an unsigned comparison.
*/
 - if ((curr_events - rdev-last_events + 4096)  8192) {
 + if ((long)curr_events - (long)rdev-last_events  4096) {
   rdev-last_events = curr_events;
   idle = 0;

In which case would unsigned counters be more appropriate?
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 002 of 2] md: Improve the is_mddev_idle test

2007-05-10 Thread Neil Brown
On Thursday May 10, [EMAIL PROTECTED] wrote:
 On Thu, 10 May 2007 16:22:31 +1000 NeilBrown [EMAIL PROTECTED] wrote:
 
  The test currently looks for any (non-fuzz) difference, either
  positive or negative.  This clearly is not needed.  Any non-sync
  activity will cause the total sectors to grow faster than the sync_io
  count (never slower) so we only need to look for a positive differences.
  
  ...
 
  --- .prev/drivers/md/md.c   2007-05-10 15:51:54.0 +1000
  +++ ./drivers/md/md.c   2007-05-10 16:05:10.0 +1000
  @@ -5095,7 +5095,7 @@ static int is_mddev_idle(mddev_t *mddev)
   *
   * Note: the following is an unsigned comparison.
   */
  -   if ((curr_events - rdev-last_events + 4096)  8192) {
  +   if ((long)curr_events - (long)rdev-last_events  4096) {
  rdev-last_events = curr_events;
  idle = 0;
 
 In which case would unsigned counters be more appropriate?

I guess.

It is really the comparison that I want to be signed, I don't much
care about the counted - they are expected to wrap (though they might
not).
So maybe I really want

 if ((signed long)(curr_events - rdev-last_events)  4096) {

to make it clear...
But people expect number to be signed by default, so that probably
isn't necessary.

Yeah, I'll make them signed one day.

Thanks,
NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 002 of 2] md: Improve the is_mddev_idle test

2007-05-10 Thread Jan Engelhardt

On May 10 2007 16:22, NeilBrown wrote:

diff .prev/drivers/md/md.c ./drivers/md/md.c
--- .prev/drivers/md/md.c  2007-05-10 15:51:54.0 +1000
+++ ./drivers/md/md.c  2007-05-10 16:05:10.0 +1000
@@ -5095,7 +5095,7 @@ static int is_mddev_idle(mddev_t *mddev)
*
* Note: the following is an unsigned comparison.
*/
-  if ((curr_events - rdev-last_events + 4096)  8192) {
+  if ((long)curr_events - (long)rdev-last_events  4096) {
   rdev-last_events = curr_events;
   idle = 0;
   }

What did really change? Unless I am seriously mistaken,

curr_events - last_evens + 4096  8192

is mathematically equivalent to

curr_events - last_evens 4096

The casting to (long) may however force a signed comparison which turns
things quite upside down, and the comment does not apply anymore.


Jan
-- 
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 002 of 2] md: Improve the is_mddev_idle test

2007-05-10 Thread Neil Brown
On Thursday May 10, [EMAIL PROTECTED] wrote:
 
 On May 10 2007 16:22, NeilBrown wrote:
 
 diff .prev/drivers/md/md.c ./drivers/md/md.c
 --- .prev/drivers/md/md.c2007-05-10 15:51:54.0 +1000
 +++ ./drivers/md/md.c2007-05-10 16:05:10.0 +1000
 @@ -5095,7 +5095,7 @@ static int is_mddev_idle(mddev_t *mddev)
   *
   * Note: the following is an unsigned comparison.
   */
 -if ((curr_events - rdev-last_events + 4096)  8192) {
 +if ((long)curr_events - (long)rdev-last_events  4096) {
  rdev-last_events = curr_events;
  idle = 0;
  }
 
 What did really change? Unless I am seriously mistaken,
 
 curr_events - last_evens + 4096  8192
 
 is mathematically equivalent to
 
 curr_events - last_evens 4096
 
 The casting to (long) may however force a signed comparison which turns
 things quite upside down, and the comment does not apply anymore.

Yes, the use of a signed comparison is the significant difference.
And yes, the comment becomes wrong.  I'm in the process of redrafting
that.  It currently stands at:

/* sync IO will cause sync_io to increase before the disk_stats
 * as sync_io is counted when a request starts, and 
 * disk_stats is counted when it completes.
 * So resync activity will cause curr_events to be smaller than
 * when there was no such activity.
 * non-sync IO will cause disk_stat to increase without
 * increasing sync_io so curr_events will (eventually)
 * be larger than it was before.  Once it becomes
 * substantially larger, the test below will cause
 * the array to appear non-idle, and resync will slow
 * down.
 * If there is a lot of outstanding resync activity when
 * we set last_event to curr_events, then all that activity
 * completing might cause the array to appear non-idle
 * and resync will be slowed down even though there might
 * not have been non-resync activity.  This will only
 * happen once though.  'last_events' will soon reflect
 * the state where there is little or no outstanding
 * resync requests, and further resync activity will
 * always make curr_events less than last_events.
 *
 */


Does that read at all well?

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 002 of 2] md: Improve the is_mddev_idle test

2007-05-10 Thread Jan Engelhardt

On May 10 2007 20:04, Neil Brown wrote:
 -   if ((curr_events - rdev-last_events + 4096)  8192) {
 +   if ((long)curr_events - (long)rdev-last_events  4096) {
 rdev-last_events = curr_events;
 idle = 0;
 }
 
/* sync IO will cause sync_io to increase before the disk_stats
 * as sync_io is counted when a request starts, and 
 * disk_stats is counted when it completes.
 * So resync activity will cause curr_events to be smaller than
 * when there was no such activity.
 * non-sync IO will cause disk_stat to increase without
 * increasing sync_io so curr_events will (eventually)
 * be larger than it was before.  Once it becomes
 * substantially larger, the test below will cause
 * the array to appear non-idle, and resync will slow
 * down.
 * If there is a lot of outstanding resync activity when
 * we set last_event to curr_events, then all that activity
 * completing might cause the array to appear non-idle
 * and resync will be slowed down even though there might
 * not have been non-resync activity.  This will only
 * happen once though.  'last_events' will soon reflect
 * the state where there is little or no outstanding
 * resync requests, and further resync activity will
 * always make curr_events less than last_events.
 *
 */

Does that read at all well?

It is a more verbose explanation of your patch description, yes.


Jan
-- 
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: removed disk md-device

2007-05-10 Thread David Greaves
Neil Brown wrote:
 On Wednesday May 9, [EMAIL PROTECTED] wrote:
 Neil Brown [EMAIL PROTECTED] [2007.04.02.0953 +0200]:
 Hmmm... this is somewhat awkward.  You could argue that udev should be
 taught to remove the device from the array before removing the device
 from /dev.  But I'm not convinced that you always want to 'fail' the
 device.   It is possible in this case that the array is quiescent and
 you might like to shut it down without registering a device failure...
 Hmm, the the kernel advised hotplug to remove the device from /dev, but you 
 don't want to remove it from md? Do you have an example for that case?
 
 Until there is known to be an inconsistency among the devices in an
 array, you don't want to record that there is.
 
 Suppose I have two USB drives with a mounted but quiescent filesystem
 on a raid1 across them.
 I pull them both out, one after the other, to take them to my friends
 place.
 
 I plug them both in and find that the array is degraded, because as
 soon as I unplugged on, the other was told that it was now the only
 one.
And, in truth, so it was.

Who updated the event count though?

 Not good.  Best to wait for an IO request that actually returns an
 errors. 
Ah, now would that be a good time to update the event count?


Maybe you should allow drives to be removed even if they aren't faulty or spare?
A write to a removed device would mark it faulty in the other devices without
waiting for a timeout.

But joggling a usb stick (similar to your use case) would probably be OK since
it would be hot-removed and then hot-added.

David
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 00/16] raid acceleration and asynchronous offload api for 2.6.22

2007-05-10 Thread Tomasz Chmielewski

Ronen Shitrit wrote:


The resync numbers you sent, looks very promising :)
Do you have any performance numbers that you can share for these set of
patches, which shows the Rd/Wr IO bandwidth.


I have some simple tests made with hdparm, with the results I don't 
understand.


We see hdparm results are fine if we access the whole device:

thecus:~# hdparm -Tt /dev/sdd

/dev/sdd:
 Timing cached reads:   392 MB in  2.00 seconds = 195.71 MB/sec
 Timing buffered disk reads:  146 MB in  3.01 seconds =  48.47 MB/sec


But are 10 times worse (Timing buffered disk reads) when we access 
partitions:


thecus:/# hdparm -Tt /dev/sdc1 /dev/sdd1

/dev/sdc1:
 Timing cached reads:   396 MB in  2.01 seconds = 197.18 MB/sec
 Timing buffered disk reads:   16 MB in  3.32 seconds =   4.83 MB/sec

/dev/sdd1:
 Timing cached reads:   394 MB in  2.00 seconds = 196.89 MB/sec
 Timing buffered disk reads:   16 MB in  3.13 seconds =   5.11 MB/sec


Why is it so much worse?


I used 2.6.21-iop1 patches from http://sf.net/projects/xscaleiop; right 
now I use 2.6.17-iop1, for which the results are ~35 MB/s when accessing 
a device (/dev/sdd) or a partition (/dev/sdd1).



In kernel config, I enabled Intel DMA engines.

The device I use is Thecus n4100, it is Platform: IQ31244 (XScale), 
and has 600 MHz CPU.



--
Tomasz Chmielewski
http://wpkg.org

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Please revert 5b479c91da90eef605f851508744bfe8269591a0 (md partition rescan)

2007-05-10 Thread Jan Engelhardt

On May 9 2007 18:51, Linus Torvalds wrote:

(But Andrew never saw your email, I suspect: [EMAIL PROTECTED] is probably 
some strange mixup of Andrew Morton and Andi Kleen in your mind ;)

What do the letters kp stand for?


Jan
-- 
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Please revert 5b479c91da90eef605f851508744bfe8269591a0 (md partition rescan)

2007-05-10 Thread Xavier Bestel
On Thu, 2007-05-10 at 16:51 +0200, Jan Engelhardt wrote:
 (But Andrew never saw your email, I suspect: [EMAIL PROTECTED] is
 probably 
 some strange mixup of Andrew Morton and Andi Kleen in your mind ;)
 
 What do the letters kp stand for?

Keep Patching ?



-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/16] raid acceleration and asynchronous offload api for 2.6.22

2007-05-10 Thread Tomasz Chmielewski

Tomasz Chmielewski schrieb:

Ronen Shitrit wrote:


The resync numbers you sent, looks very promising :)
Do you have any performance numbers that you can share for these set of
patches, which shows the Rd/Wr IO bandwidth.


I have some simple tests made with hdparm, with the results I don't 
understand.


We see hdparm results are fine if we access the whole device:

thecus:~# hdparm -Tt /dev/sdd

/dev/sdd:
 Timing cached reads:   392 MB in  2.00 seconds = 195.71 MB/sec
 Timing buffered disk reads:  146 MB in  3.01 seconds =  48.47 MB/sec


But are 10 times worse (Timing buffered disk reads) when we access 
partitions:


There seems to be another side effect when comparing DMA engine in 
2.6.17-iop1 to 2.6.21-iop1: network performance.



For simple network tests, I use netperf tool to measure network 
performance.


With 2.6.17-iop1 and all DMA offloading options enabled (selectable in 
System type --- IOP3xx Implementation Options  ---), I get nearly 25 
MB/s throughput.


With 2.6.21-iop1 and all DMA offloading optons enabled (moved to Device 
Drivers  --- DMA Engine support  ---), I get only about 10 MB/s 
throughput.
Additionally, on 2.6.21-iop1, I get lots of dma_cookie  0 printed by 
the kernel.



--
Tomasz Chmielewski
http://wpkg.org
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Please revert 5b479c91da90eef605f851508744bfe8269591a0 (md partition rescan)

2007-05-10 Thread Satyam Sharma

On 5/10/07, Xavier Bestel [EMAIL PROTECTED] wrote:

On Thu, 2007-05-10 at 16:51 +0200, Jan Engelhardt wrote:
 (But Andrew never saw your email, I suspect: [EMAIL PROTECTED] is
 probably
 some strange mixup of Andrew Morton and Andi Kleen in your mind ;)

 What do the letters kp stand for?


Heh ... I've always wanted to know that myself. It's funny, no one
seems to have asked that on lkml during all these years (at least none
that a Google search would throw up).


Keep Patching ?


Unlikely. akpm seems to be a pre-Linux-kernel nick.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: removed disk md-device

2007-05-10 Thread Bernd Schubert
On Thursday 10 May 2007 09:12:54 Neil Brown wrote:
 On Wednesday May 9, [EMAIL PROTECTED] wrote:
  Neil Brown [EMAIL PROTECTED] [2007.04.02.0953 +0200]:
  Hmmm... this is somewhat awkward.  You could argue that udev should be
  taught to remove the device from the array before removing the device
  from /dev.  But I'm not convinced that you always want to 'fail' the
  device.   It is possible in this case that the array is quiescent and
  you might like to shut it down without registering a device failure...
 
  Hmm, the the kernel advised hotplug to remove the device from /dev, but
  you don't want to remove it from md? Do you have an example for that
  case?

 Until there is known to be an inconsistency among the devices in an
 array, you don't want to record that there is.

 Suppose I have two USB drives with a mounted but quiescent filesystem
 on a raid1 across them.
 I pull them both out, one after the other, to take them to my friends
 place.

 I plug them both in and find that the array is degraded, because as
 soon as I unplugged on, the other was told that it was now the only
 one.
 Not good.  Best to wait for an IO request that actually returns an
 errors.

Ok, keeping the raid working in this case would be a good idea, so we would 
need it configurable if it should degrade or not.
However, have you tested if pulling and hotplugging the drive works? Actually 
thats what our customer did. As long as md keeps the old device information, 
the re-plugged-in device will get another device name (and of course also 
another major number) and so the md-device will still keeps the old device 
information and it will never automagically add the new device.  
Probably thats even a good idea, how should the md-layer know if it is really 
the very same device and even if it would know that, how should it know that 
no data have been modified on it, while it was plugged out?


  Maybe an mdadm command that will do that for a given device, or for
  all components of a given array if the 'dev' link is 'broken', or even
  for all devices for all array.
  
 mdadm --fail-unplugged --scan
  or
 mdadm --fail-unplugged /dev/md3
 
  Ok, so one could run this as cron script. Neil, may I ask if you already
  started to work on this? Since we have the problem on a customer system,
  we should fix it ASAP, but at least within the next 2 or 3 weeks. If you
  didn't start work on it yet, I will do...

 No, I haven't, but it is getting near the top of my list.
 If you want a script that does this automatically for every array,
 something like:

I have never looked into the mdadm sources before, but I will try during the 
weekend (without any promises).


   for a in /sys/block/md*/md/dev-*
   do
 if [ -f $a/block/dev ]
 then : still there
 else
   echo faulty  $a/state
   echo remove  $a/state
 fi
   done

 should do what you want. (I haven't tested it though).

Thanks a lot, we will test that here. Do you propose the same logic for mdadm?


Thanks,
Bernd


-- 
Bernd Schubert
Q-Leap Networks GmbH
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Chaining sg lists for big I/O commands: Question

2007-05-10 Thread Jan Engelhardt
On May 9 2007 15:38, Jens Axboe wrote:
 I am a mdadm/disk/hard drive fanatic, I was curious:
 
 On i386, we can at most fit 256 scatterlist elements into a page,
 and on x86-64 we are stuck with 128. So that puts us somewhere
 between 512kb and 1024kb for a single IO.
 
 How come 32bit is 256 and 64 is only 128?

 I am sure it is something very fundamental/simple but I was curious, I 
 would think x86_64 would fit/support more scatterlists in a page.

Because of the size of the scatterlist structure. As pointers are bigger
on 64-bit archs, the scatterlist structure ends up being bigger. The
page size on x86-64 is 4kb, hence the number of structures you can fit
in a page is smaller.

I take it this problem goes away on arches with 8KB page_size?


Jan
-- 
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Please revert 5b479c91da90eef605f851508744bfe8269591a0 (md partition rescan)

2007-05-10 Thread Andrew Morton
On Thu, 10 May 2007 16:51:31 +0200 (MEST) Jan Engelhardt [EMAIL PROTECTED] 
wrote:

 On May 9 2007 18:51, Linus Torvalds wrote:
 
 (But Andrew never saw your email, I suspect: [EMAIL PROTECTED] is probably 
 some strange mixup of Andrew Morton and Andi Kleen in your mind ;)
 
 What do the letters kp stand for?
 

Some say Kernel Programmer.  My parents said Keith Paul.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Questions about the speed when MD-RAID array is being initialized.

2007-05-10 Thread Liang Yang

Hi,

I created a MD-RAID5 array using 8 Maxtor SAS Disk Drives (chunk size is 
256k). I have measured the data transfer speed for single SAS disk drive 
(physical drive, not filesystem on it), it is roughly about 80~90MB/s.


However, I notice MD also reports the speed for the RAID5 array when it is 
being initialized (cat /proc/mdstat). The speed reported by MD is not 
constant which is roughly from 70MB/s to 90MB/s (average is 85MB/s which is 
very close to the single disk data transfer speed).


I just have three questions:
1. What is the exact meaning of the array speed reported by MD? Is that 
mesured for the whole array (I used 8 disks) or for just single underlying 
disk? If it is for the whole array, then 70~90B/s seems too low considering 
8 disks are used for this array.


2. How is this speed measured and what is the I/O packet size being used 
when the speed is measured?


3. From the beginning when MD-RAID 5 array is initialized to the end when 
the intialization is done, the speed reports by MD gradually decrease from 
90MB/s down to 70MB/s. Why does the speed change? Why does the speed 
gradually decrease?


Could anyone give me some explanation?

I'm using RHEL 4U4 with 2.6.18 kernel. MDADM version is 1.6.

Thanks a lot,

Liang




-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Questions about the speed when MD-RAID array is being initialized.

2007-05-10 Thread Justin Piszcz



On Thu, 10 May 2007, Liang Yang wrote:


Hi,

I created a MD-RAID5 array using 8 Maxtor SAS Disk Drives (chunk size is 
256k). I have measured the data transfer speed for single SAS disk drive 
(physical drive, not filesystem on it), it is roughly about 80~90MB/s.


However, I notice MD also reports the speed for the RAID5 array when it is 
being initialized (cat /proc/mdstat). The speed reported by MD is not 
constant which is roughly from 70MB/s to 90MB/s (average is 85MB/s which is 
very close to the single disk data transfer speed).


I just have three questions:
1. What is the exact meaning of the array speed reported by MD? Is that 
mesured for the whole array (I used 8 disks) or for just single underlying 
disk? If it is for the whole array, then 70~90B/s seems too low considering 8 
disks are used for this array.


2. How is this speed measured and what is the I/O packet size being used when 
the speed is measured?


3. From the beginning when MD-RAID 5 array is initialized to the end when the 
intialization is done, the speed reports by MD gradually decrease from 90MB/s 
down to 70MB/s. Why does the speed change? Why does the speed gradually 
decrease?


Could anyone give me some explanation?

I'm using RHEL 4U4 with 2.6.18 kernel. MDADM version is 1.6.

Thanks a lot,

Liang




-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html



For no 3. because it starts from the fast end of the disk and works its 
way to the slower part (slower speeds).



-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Questions about the speed when MD-RAID array is being initialized.

2007-05-10 Thread Liang Yang
Could you please give me more details about this? What do you mean the fast 
end and slow end part of disk? Do you mean the location in each disk 
platter?


Thanks,

Liang


- Original Message - 
From: Justin Piszcz [EMAIL PROTECTED]

To: Liang Yang [EMAIL PROTECTED]
Cc: linux-raid@vger.kernel.org
Sent: Thursday, May 10, 2007 2:33 PM
Subject: Re: Questions about the speed when MD-RAID array is being 
initialized.






On Thu, 10 May 2007, Liang Yang wrote:


Hi,

I created a MD-RAID5 array using 8 Maxtor SAS Disk Drives (chunk size is 
256k). I have measured the data transfer speed for single SAS disk drive 
(physical drive, not filesystem on it), it is roughly about 80~90MB/s.


However, I notice MD also reports the speed for the RAID5 array when it 
is being initialized (cat /proc/mdstat). The speed reported by MD is not 
constant which is roughly from 70MB/s to 90MB/s (average is 85MB/s which 
is very close to the single disk data transfer speed).


I just have three questions:
1. What is the exact meaning of the array speed reported by MD? Is that 
mesured for the whole array (I used 8 disks) or for just single 
underlying disk? If it is for the whole array, then 70~90B/s seems too 
low considering 8 disks are used for this array.


2. How is this speed measured and what is the I/O packet size being used 
when the speed is measured?


3. From the beginning when MD-RAID 5 array is initialized to the end when 
the intialization is done, the speed reports by MD gradually decrease 
from 90MB/s down to 70MB/s. Why does the speed change? Why does the speed 
gradually decrease?


Could anyone give me some explanation?

I'm using RHEL 4U4 with 2.6.18 kernel. MDADM version is 1.6.

Thanks a lot,

Liang




-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html



For no 3. because it starts from the fast end of the disk and works its 
way to the slower part (slower speeds).






-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Questions about the speed when MD-RAID array is being initialized.

2007-05-10 Thread Justin Piszcz

http://partition.radified.com/partitioning_2.htm

System and program files that wind up at the far end of the drive take 
longer to access, and are transferred at a slower rate, which translates 
into a less-responsive system. If you look at the graph of sustained 
transfer rates (STRs) from the HD Tach benchmark posted here, you'll see 
clearly that the outermost sectors of the drive transfer data the fastest.



On Thu, 10 May 2007, Liang Yang wrote:

Could you please give me more details about this? What do you mean the fast 
end and slow end part of disk? Do you mean the location in each disk platter?


Thanks,

Liang


- Original Message - From: Justin Piszcz [EMAIL PROTECTED]
To: Liang Yang [EMAIL PROTECTED]
Cc: linux-raid@vger.kernel.org
Sent: Thursday, May 10, 2007 2:33 PM
Subject: Re: Questions about the speed when MD-RAID array is being 
initialized.






On Thu, 10 May 2007, Liang Yang wrote:


Hi,

I created a MD-RAID5 array using 8 Maxtor SAS Disk Drives (chunk size is 
256k). I have measured the data transfer speed for single SAS disk drive 
(physical drive, not filesystem on it), it is roughly about 80~90MB/s.


However, I notice MD also reports the speed for the RAID5 array when it is 
being initialized (cat /proc/mdstat). The speed reported by MD is not 
constant which is roughly from 70MB/s to 90MB/s (average is 85MB/s which 
is very close to the single disk data transfer speed).


I just have three questions:
1. What is the exact meaning of the array speed reported by MD? Is that 
mesured for the whole array (I used 8 disks) or for just single underlying 
disk? If it is for the whole array, then 70~90B/s seems too low 
considering 8 disks are used for this array.


2. How is this speed measured and what is the I/O packet size being used 
when the speed is measured?


3. From the beginning when MD-RAID 5 array is initialized to the end when 
the intialization is done, the speed reports by MD gradually decrease from 
90MB/s down to 70MB/s. Why does the speed change? Why does the speed 
gradually decrease?


Could anyone give me some explanation?

I'm using RHEL 4U4 with 2.6.18 kernel. MDADM version is 1.6.

Thanks a lot,

Liang




-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html



For no 3. because it starts from the fast end of the disk and works its way 
to the slower part (slower speeds).






-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Chaining sg lists for big I/O commands: Question

2007-05-10 Thread Jens Axboe
On Thu, May 10 2007, Jan Engelhardt wrote:
 On May 9 2007 15:38, Jens Axboe wrote:
  I am a mdadm/disk/hard drive fanatic, I was curious:
  
  On i386, we can at most fit 256 scatterlist elements into a page,
  and on x86-64 we are stuck with 128. So that puts us somewhere
  between 512kb and 1024kb for a single IO.
  
  How come 32bit is 256 and 64 is only 128?
 
  I am sure it is something very fundamental/simple but I was curious, I 
  would think x86_64 would fit/support more scatterlists in a page.
 
 Because of the size of the scatterlist structure. As pointers are bigger
 on 64-bit archs, the scatterlist structure ends up being bigger. The
 page size on x86-64 is 4kb, hence the number of structures you can fit
 in a page is smaller.
 
 I take it this problem goes away on arches with 8KB page_size?

Not really, the 8kb page size just doubles the sg size. On a 64-bit
arch, that would still only get you 1mb IO size.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Questions about the speed when MD-RAID array is being initialized.

2007-05-10 Thread Robin Hill
On Thu May 10, 2007 at 05:33:17PM -0400, Justin Piszcz wrote:

 
 
 On Thu, 10 May 2007, Liang Yang wrote:
 
 Hi,
 
 I created a MD-RAID5 array using 8 Maxtor SAS Disk Drives (chunk size is 
 256k). I have measured the data transfer speed for single SAS disk drive 
 (physical drive, not filesystem on it), it is roughly about 80~90MB/s.
 
 However, I notice MD also reports the speed for the RAID5 array when it is 
 being initialized (cat /proc/mdstat). The speed reported by MD is not 
 constant which is roughly from 70MB/s to 90MB/s (average is 85MB/s which 
 is very close to the single disk data transfer speed).
 
 I just have three questions:
 1. What is the exact meaning of the array speed reported by MD? Is that 
 mesured for the whole array (I used 8 disks) or for just single underlying 
 disk? If it is for the whole array, then 70~90B/s seems too low 
 considering 8 disks are used for this array.
 
 2. How is this speed measured and what is the I/O packet size being used 
 when the speed is measured?
 
 3. From the beginning when MD-RAID 5 array is initialized to the end when 
 the intialization is done, the speed reports by MD gradually decrease from 
 90MB/s down to 70MB/s. Why does the speed change? Why does the speed 
 gradually decrease?
 
 Could anyone give me some explanation?
 
 I'm using RHEL 4U4 with 2.6.18 kernel. MDADM version is 1.6.
 
 Thanks a lot,
 
 Liang
 
 
 For no 3. because it starts from the fast end of the disk and works its 
 way to the slower part (slower speeds).
 
And I'd assume for no 1 it's because it's only writing to a single disk
at this point, so will obviously be limited to the transfer rate of a
single disk.  RAID5 arrays are created as a degraded array, then the
final disk is recovered - this is done so that the array is ready for
use very quickly.  So what you're seeing in /proc/mdstat is the speed in
calculating and writing the data for the final drive (and is, unless
computationally limited, going to be the write speed of the single
drive).

HTH,
Robin

-- 
 ___
( ' } |   Robin Hill[EMAIL PROTECTED] |
   / / )  | Little Jim says |
  // !!   |  He fallen in de water !! |


pgpuxGkc8VmMd.pgp
Description: PGP signature


Re: Please revert 5b479c91da90eef605f851508744bfe8269591a0 (md partition rescan)

2007-05-10 Thread H. Peter Anvin
Satyam Sharma wrote:
 On 5/10/07, Xavier Bestel [EMAIL PROTECTED] wrote:
 On Thu, 2007-05-10 at 16:51 +0200, Jan Engelhardt wrote:
  (But Andrew never saw your email, I suspect: [EMAIL PROTECTED] is
  probably
  some strange mixup of Andrew Morton and Andi Kleen in your mind ;)
 
  What do the letters kp stand for?
 
 Heh ... I've always wanted to know that myself. It's funny, no one
 seems to have asked that on lkml during all these years (at least none
 that a Google search would throw up).
 
 Keep Patching ?
 
 Unlikely. akpm seems to be a pre-Linux-kernel nick.

http://en.wikipedia.org/wiki/Andrew_Morton_%28computer_programmer%29

-hpa

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: removed disk md-device

2007-05-10 Thread Neil Brown
On Thursday May 10, [EMAIL PROTECTED] wrote:
 Neil Brown wrote:
  On Wednesday May 9, [EMAIL PROTECTED] wrote:
  Neil Brown [EMAIL PROTECTED] [2007.04.02.0953 +0200]:
  Hmmm... this is somewhat awkward.  You could argue that udev should be
  taught to remove the device from the array before removing the device
  from /dev.  But I'm not convinced that you always want to 'fail' the
  device.   It is possible in this case that the array is quiescent and
  you might like to shut it down without registering a device failure...
  Hmm, the the kernel advised hotplug to remove the device from /dev, but 
  you 
  don't want to remove it from md? Do you have an example for that case?
  
  Until there is known to be an inconsistency among the devices in an
  array, you don't want to record that there is.
  
  Suppose I have two USB drives with a mounted but quiescent filesystem
  on a raid1 across them.
  I pull them both out, one after the other, to take them to my friends
  place.
  
  I plug them both in and find that the array is degraded, because as
  soon as I unplugged on, the other was told that it was now the only
  one.
 And, in truth, so it was.

So what was?
It is true that now one drive is the only one plugged in, but is
that relevant?
Is it true that the one drive is the only drive in the array??
That depends on what you mean by the array.  If I am moving the
array to another computer, then the one drive still plugged into the
first computer is not the only drive in the array from my
perspective.

If there is a write request, and it can only be written to one drive
(because the other is unplugged), then it becomes appropriate to tell
the still-present drive that it is the only drive in the array.

 
 Who updated the event count though?

Sorry, not enough words.  I don't know what you are asking.

 
  Not good.  Best to wait for an IO request that actually returns an
  errors. 
 Ah, now would that be a good time to update the event count?

Yes.  Of course.  It is an event (IO failed).  That makes it a good
time to update the event count.. am I missing something here?

 
 
 Maybe you should allow drives to be removed even if they aren't faulty or 
 spare?
 A write to a removed device would mark it faulty in the other devices without
 waiting for a timeout.

Maybe, but I'm not sure what the real gain would be.

 
 But joggling a usb stick (similar to your use case) would probably be OK since
 it would be hot-removed and then hot-added.

This still needs user-space interaction.
If the USB layer detects a removal and a re-insert, sdb may well come
back a something different (sdp?) - though I'm not completely familiar
with how USB storage works.

In any case, it should really be a user-space decision what happens
then.  A hot re-add may well be appropriate, but I wouldn't want to
have the kernel make that decision.

NeilBrown

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html