Re: raid1 with 1.2 superblock never marked healthy?

2006-02-20 Thread Janos Farkas
On 2006-02-20 at 09:30:22, Neil Brown wrote:
 If you use an 'internal' bitmap (which is mirrored across all drives
 much like the superblock) then you don't need to specify a file name.
 However if you want the bitmap on a separate file, you have to have
 that name 'hard coded' in mdadm.conf (or similar).

I was under the impression that mdadm.conf is not extended for this :)
That would be a nice place compared to /etc/rc.d/whatever..

I was considering using external bitmap only because I've been bitten a
few times with journaling filesystems vs. cheap hard disks.  I'm not
sure if a few hard disks are a significant sample, but some of them
started developing bad sectors where the journal is stored.  I hoped
that having an external bitmap would reduce the wear on the mirrored
parts.  This way, if the bitmap (even on the same hard disk) is getting
flawed, probably both of the whole mirrors are intact enough for a last
(additional) backup.

  I remember stopping/starting the array correctly does a resync again,
  even without a reboot.
 Hmm... it seems to work for me... How exactly to you start it again.

Oops, I did not mean resync, but that spare confusion stuff.  When I do
mdadm -S /dev/md0, and then mdadm -A /dev/md0, I get:
raid1: raid set md0 active with 1 out of 2 mirrors
And I have to -r /dev/hda3, -a /dev/hda3, and that results in another
resync.

 No.  mdadm does not record the name of the bitmap file in the
 superblock.  Just like it does not record the names of component
 devices in the superblock.

Would it be a bad idea (apart from someone having to do the work :)?
(But probably a bit better would be doing it as the jfs/ext3 external
journals store uuid connecting the journal with the device itself).

 Array State : uu 1 failed
 Something is definitely wrong here... hda3 looks like a spare, but
 isn't I'll have a look and see what I can find out.

The only unusual thing is how it got set up, because on a semi-live
system, I started with the magic missing component to create another
half mirror while the previous one is running. Unusual because I never
thought of it as a bad idea, but maybe somehow it did cause what I'm
seeing.

The original command (then, trying to use bitmaps :) was:

# mdadm --create /dev/md1 --level 1 -n 2 -d 4 -e 1.2 \
  -b /etc/md/test1.bin --bitmap-chunk 64 missing /dev/hdc3

At some later time, I added /dev/hda3, which is the troubling spare
now.

Janos
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NVRAM support

2006-02-20 Thread Mirko Benz

Hello,

We have applications were large data sets (e.g. 100 MB) are sequentially 
written.
Software RAID could do a full stripe update (without reading/using 
existing data).
Does this happen in parallel? If yes, isn't that data vulnerable when a 
crash occurs?


Thanks,
Mirko

Neil Brown schrieb:

On Wednesday February 15, [EMAIL PROTECTED] wrote:
  

Hi,

My intention was not to use a NVRAM device for swap.

Enterprise storage systems use NVRAM for better data protection/faster 
recovery in case of a crash.
Modern CPUs can do RAID calculation very fast. But Linux RAID is 
vulnerable when a crash during a write operation occurs.
E.g. Data and parity write requests are issued in parallel but only one 
finishes. This will
lead to inconsistent data. It will be undetected and can not be 
repaired. Right?



Wrong.  Well, maybe 5% right.

If the array is degraded, that the inconsistency cannot be detected.
If the array is fully functioning, then any inconsistency will be
corrected by a 'resync'.

  

How can journaling be implemented within linux-raid?



With a fair bit of work. :-)

  

I have seen a paper that tries this in cooperation with a file system:
?Journal-guided Resynchronization for Software RAID?
www.cs.wisc.edu/adsl/Publications



This is using the ext3 journal to make the 'resync' (mentioned above)
faster.  Write-intent bitmaps can achieve similar speedups with
different costs.

  
But I would rather see a solution within md so that other file systems 
or LVM can be used on top of md.



Currently there is no solution to the crash while writing and
degraded on restart means possible silent data corruption problem.
However is it, in reality, a very small problem (unless you regularly
run with a degraded array - don't do that).

The only practical fix at the filesystem level is, as you suggest,
journalling to NVRAM.  There is work underway to restructure md/raid5
to be able to off-load the xor and raid6 calculations to dedicated
hardware.  This restructure would also make it a lot easier to journal
raid5 updates thus closing this hole (and also improving write
latency).

NeilBrown

  


-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid1 with 1.2 superblock never marked healthy?

2006-02-20 Thread Mr. James W. Laferriere

Hello Neil  All ,

On Mon, 20 Feb 2006, Janos Farkas wrote:

On 2006-02-20 at 09:30:22, Neil Brown wrote:

If you use an 'internal' bitmap (which is mirrored across all drives
much like the superblock) then you don't need to specify a file name.
However if you want the bitmap on a separate file, you have to have
that name 'hard coded' in mdadm.conf (or similar).


I was under the impression that mdadm.conf is not extended for this :)
That would be a nice place compared to /etc/rc.d/whatever..

I was considering using external bitmap only because I've been bitten a
few times with journaling filesystems vs. cheap hard disks.  I'm not
sure if a few hard disks are a significant sample, but some of them
started developing bad sectors where the journal is stored.  I hoped
that having an external bitmap would reduce the wear on the mirrored
parts.  This way, if the bitmap (even on the same hard disk) is getting
flawed, probably both of the whole mirrors are intact enough for a last
(additional) backup.

...snip...
How hard would it be for mdadm  md to allow use of both internal
and external bitmaps ?  And then mdadm be extended to do a compare
of an external against a internal when the opertaor askes to do
so .  Soes this sound reasonable ?  Thoughts ?
This is probably an edge case I guess .  But it might save
someones bacon if this functionality was available .
Tia ,  JimL
--
+--+
| James   W.   Laferriere | SystemTechniques | Give me VMS |
| NetworkEngineer | 3542 Broken Yoke Dr. |  Give me Linux  |
| [EMAIL PROTECTED] | Billings , MT. 59105 |   only  on  AXP |
|  http://www.asteriskhelpdesk.com/cgi-bin/astlance/r.cgi?babydr   |
+--+
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Avoiding resync of RAID1 during creation

2006-02-20 Thread Bryan Wann
With FC2, when installing a fresh new system we would create a RAID10 
array by creating several RAID1s, then adding all of those to a RAID0 
array.  To make the RAID1 devices, we'd use the command:

  /sbin/mkraid --really-force --dangerous-no-resync /dev/mdX

Then we'd set up the RAID0 and mke2fs our filesystems on top of it. 
This worked well for us, never had any problems later.  As soon as the 
kickstart was finished, the system was ready to go.


Now with FC4, raidtools is gone and I'm left with mdadm tools.  As far 
as I can tell, mdadm has nothing resembling --dangerous-no-resync.  I've 
updated my kickstart to use mdadm instead of mkraid using:


  /sbin/mdadm --create /dev/md4 --force --run --level=1 --chunk=256 \
  --raid-disks=2 --spare-devices=0 /dev/sda5 /dev/sde5

This causes all of the newly created RAID1 devices to start syncing.  On 
a system with many large disks and RAID1 arrays, syncing takes a 
considerably long time.


Is there any way to avoid the sync after creation when using mdadm like 
I could with mkraid?


The compelling argument I've read in the archives indicates this would 
run counter to ensuring both partitions were completely clean at a block 
level.  I would think creation of the filesystem on top of the array 
would ensure they're clean, at least on that level for all intents and 
purposes.



--bryan
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: block level vs. file level

2006-02-20 Thread Molle Bestefich
it wrote:
 Ouch.

 How does hardware raid deal with this? Does it?

Hardware RAID controllers deal with this by rounding the size of
participant devices down to nearest GB, on the assumption that no
drive manufacturers would have the guts to actually sell eg. a 250 GB
drive with less than exactly 250.000.000.000 bytes of space on it.

(It would be nice if the various flavors of Linux fdisk had an option
to do this. It would be very nice if anaconda had an option to do
this.)
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Avoiding resync of RAID1 during creation

2006-02-20 Thread Bryan Wann

Tuomas Leikola wrote:

mdadm --assume-clean

from the man page:

It can also be used when creating a RAID1 or RAID10 if you want to
avoid the initial resync, however this  practice - while normally safe
- is not recommended.


What version of mdadm was that from?  From mdadm(8) in 
mdadm-1.11.0-4.fc4 on my systems:


  --assume-clean
Tell  mdadm that the array pre-existed and is known to be clean.
This is only really useful for Building RAID1 array.   Only  use
this  if  you really know what you are doing.  This is currently
only supported for --build.

I tried with --assume-clean, it still wanted to sync.  From what my man 
page was telling me, it only works with --build.  If I use --build it'll 
go ahead without syncing, but I need per-device superblocks.  Why mdadm 
didn't error when I used --assume-clean with --create, I don't know.


--bryan
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Avoiding resync of RAID1 during creation

2006-02-20 Thread Tuomas Leikola
On 2/20/06, Bryan Wann [EMAIL PROTECTED] wrote:
  mdadm --assume-clean
 
 What version of mdadm was that from?  From mdadm(8) in
 mdadm-1.11.0-4.fc4 on my systems:

cut
 I tried with --assume-clean, it still wanted to sync.

The man page i quoted was from 2.3.1 (6 feb) - relatively new.

I tested this with 2 boxes: 1.9.0 starts the resync and 2.3.1 doesn't.
Used kernel 2.6.14 - altough I don't expect that to make much of a
difference.

-tuomas
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NVRAM support

2006-02-20 Thread Neil Brown
On Monday February 20, [EMAIL PROTECTED] wrote:
 Hello,
 
 We have applications were large data sets (e.g. 100 MB) are sequentially 
 written.
 Software RAID could do a full stripe update (without reading/using 
 existing data).
 Does this happen in parallel? If yes, isn't that data vulnerable when a 
 crash occurs?

md/raid5 does full stripe writes about 80% of the time when I've
measured it while doing large writes.  I'm don't know why it is not
closer to 100%.  I suspect some subtle scheduling issue that I
haven't managed to get to the bottom of yet (I should get back to
that).

Data is only vulnerable if, after the crash, the array is degraded.
If the array is still complete after the crash, then there is no loss
of data.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Bigendian issue with mdadm

2006-02-20 Thread Luca Berra

On Tue, Feb 21, 2006 at 10:44:22AM +1100, Neil Brown wrote:

On Monday February 20, [EMAIL PROTECTED] wrote:

Hi All,

Please, Help !

I've created a raid5 array on a x86 platform, and now wish to use it
on a mac mini (g4 based). But the problem is : the first is
little-endian, the second big-endian...
And it seams like md superblock disk format is hostendian, so how
should I say mdadm to use a endianness ?



Read the man page several times?

Look for --update=byteorder

You need mdadm-2.0 or later.


besides IIRC version 1 super block is always little-endan.

L.

--
Luca Berra -- [EMAIL PROTECTED]
   Communication Media  Services S.r.l.
/\
\ / ASCII RIBBON CAMPAIGN
 XAGAINST HTML MAIL
/ \
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Bigendian issue with mdadm

2006-02-20 Thread Neil Brown
On Tuesday February 21, [EMAIL PROTECTED] wrote:
 On Tue, Feb 21, 2006 at 10:44:22AM +1100, Neil Brown wrote:
 On Monday February 20, [EMAIL PROTECTED] wrote:
  Hi All,
  
  Please, Help !
  
  I've created a raid5 array on a x86 platform, and now wish to use it
  on a mac mini (g4 based). But the problem is : the first is
  little-endian, the second big-endian...
  And it seams like md superblock disk format is hostendian, so how
  should I say mdadm to use a endianness ?
  
 
 Read the man page several times?
 
 Look for --update=byteorder
 
 You need mdadm-2.0 or later.
 
 besides IIRC version 1 super block is always little-endan.
 

True.  v1 is little-endian, not host-endian so this issue won't appear
if using v1 metadata.  However the default is 0.90, and I'm still
finding occasional bugs with the v1 code, so I'm not likely to change
the default soon any time soon... probably not for 1 year after I'm as
confident of v1 code as of v0.90.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html