Re: RAID1 and data safety?

2005-04-10 Thread Peter T. Breuer
Doug Ledford [EMAIL PROTECTED] wrote:
   Now, if I recall correctly, Peter posted a patch that changed this
   semantic in the raid1 code.  The raid1 code does not complete a write to
   the upper layers of the kernel until it's been completed on all devices
   and his patch made it such that as soon as it hit 1 device it returned
   the write to the upper layers of the kernel.
  
  I am glad to hear, that the behaviour is such, that the barrier stops, 
  until 
  *all* media got written. That was one of the things that really made me 
  worrying. I hope, the patch is backed out and didn't went into any distros.
 
 No it never went anywhere.  It was just a Hey guys, I played with this
 optimization, here's the patch type posting and no one picked it up for
 inclusion in any upstream or distro kernels.

I'll just remark that the patch depended on a bitmap, so it _couldn't_
have been picked up (until now?).

And anyway, async writes (that's the name) were switched on by a module
/kernel parameter, and were off by default.

I suppose maybe Paul's 2.6 patches also offer the possibility of async
writes (I haven't checked).

It isn't very dangerous - the bitmap marks the write as not done until
all the components have been written, even though the write is acked
back to the kernel after the first of the components have been written.

There are extra openings for data loss if you choose that mode, but
they're relatively improbable.  You're likely to lose data under several
circumstances during normal raid1 operation (see for example the split
brain discussion!).  Choosing to decrease write latency by half against
some minor extra opportunity for data loss is an admin decision that
should be available to you, I think.

Umm ... what's the extra vulnerability? Well, I suppose that with ONE
bitmap, writes could be somewhat delayed to TWO DIFFERENT components in
turn.  Then if we lose the array node at that point, writes will be
outstanding to both components, and when we resync neither will have
perfect data to copy back over the other.  And we won't even be able to
know which was right, because of the single bitmap.

Shrug. We probably wouldn't have known which mirror component was the
good one in any case.

But with TWO bitmaps, we'd know which components were lacking what, and
we could maybe do a better recovery job. Or not. We'd always choose one
component to copy from, and that would overwrite the right data that
the other had.

Even with sync (not async) writes, we could get an array node crash
that left BOTH components of the mirror without some info that the other
component already had had written to it, and then copying from either
component over the other would lose data. Yer pays yer money and yer
takes yer choice.

So I don't see it as a big thing. It's a question of evaluating
probabilities, and benefits.


BTW - async writes without the presence of a bitmap also seems to me to
be a valid admin choice. Surely if a single component dies, and the
array stays up, everything will be fine. The problem is when the array
node crashes on its own. And that may cause data loss anyway.

Peter

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


AW: AW: RAID1 and data safety?

2005-04-08 Thread Schuett Thomas EXT
Hi Doug,

(in case you are short in time, please just answer to the last part)

 (or at the very
 absolute least, it should be in a small enough queue of pending writes
 that should the power get lost, it can still write the last bits out
 during spin down).

Whow. I have heard about this before, but was not sure, if it was fiction talk.
I imagine it might be a problem, that most buyers can't see their benefits of 
such
features, and so there is no big incentive for the manifactures to build them 
in.
I never saw HD descriptions mentioning anything like this (on the other hand, 
I am just a usual end user).


 It [write barriers] doesn't bring down system performance because the 
 journaling
 filesystem isn't single task so to speak.  What this means is that when
 ...
 Make sense?

Makes perfect sense! (As 7 universes and time backwards ;-)

 So event counters are the 2nd type of information, that gets written with 
 write 
 barriers. [...]

Not really.  The event counter is *much* courser grained than journal
entries.

I see. As long, as the HDs are fine, write barriers for jfs writes (which 
block until both media got written) are all you need to make it power failure
save. Only when the HDs are not fine any more, then the event counter comes 
into play.

 You don't have to wait for *all* writes to
 the drive to complete, just the journal writes.  This is why performance
 isn't killed by journaling.  The filesystem proper writes for previous
 journal transactions can be taking place while you are doing this
 waiting.

I wonder, if an logic error sliped in here, or if I just still don't have
the right understanding of what writing with write barrier really means.
The end-of-transaction cannot be written, before the *data* is *really* written.
I thought, this is called the data must be witten with write barrier. 
But probably a write with write barrier means:
1. wait-for-completion-event 2. write 3. wait-for-completion-event again.
Then it be the same after all.
In any case, for the reasons you explained earlier, it wouldn't kill the
performance at all. 

 I use ext3 personally.  But that's as much because it's the default
 filesystem and I know Stephen Tweedie will fix it if it's broken ;-)

Good point :-)



 Of course, if it's supported on your system, you could
 also just enable the SMART daemon and have it tell the drives to do
 continuous background media checks to detect sectors that are either
 already bad or getting ready to go bad (corrected error conditions).

This is something of very very big interest to me!
I have asked several times in several NGs about it with no answer:
Afaik a CD player can correct one-bit-errors, but not two-bit-errors
(or let it be two-bit vs. three-bit, it doesn't make a difference.)
And: A CD player doesn't tell you, when it had to correct one-bit-errors.

Big problem: Your CD (=valuable backup) becomes worse and worse, with
more and more one-bit-errors, but you never notice, until one rainy day
you suddenly cannot read the CD (a special sector/file) at all any more.
And then it is too late. 

Now you are just saying, that errors get corrected on HDs too (what I 
didn't know), but that it is possible to find out about it (so either the
HDs tell it, when they make a correction, or the correction happens
somewhere more uplayer in the driver or so). My question: Do CD players
offer an interface to get noticed about error corrections too, or do
they really hide this whithout any chance to get that info?

(And of course I will have a look at the SMART daemon.)

best regards,
  Thomas
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID1 and data safety?

2005-04-08 Thread Peter T. Breuer
I forgot to say thanks! Thanks for the breakdown.

Doug Ledford [EMAIL PROTECTED] wrote:
(of event count increment)
 I think the best explanation is this:  any change in array state that

OK ..

 would necessitate kicking a drive out of the array if it didn't also
 make this change in state with the rest of the drives in the array

Hmmm.  

 results in an increment to the event counter and a flush of the
 superblocks.


 Transition from ro - rw or from rw - ro, transition from clean to
 dirty or dirty to clean, any change in the distribution of disks in the
 superblock (aka, change in number of working disks, active disks, spare
 disks, failed disks, etc.), or any ordering updates of disk devices in
 the rdisk array (for example, when a spare is done being rebuilt to
 replace a failed device, it gets moved from it's current position in the
 array to the position it was just rebuilt to replace as part of the
 final transition from being rebuilt to being an active, live component
 in the array).

I still see about 8-10 changes in the event count between faulting a
disk out and bringing it back into the array for hot-repair, even if
nothing is written meantime. I suppose I could investigate!

Of concern to me (only) is that I observe that a faulted disk seems to
have an event count that is 1-2 counts behind that stamped on the bitmap
left behind on the array as it starts up in response to the fault.  The
number behind varies.  Something races.

Peter

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: AW: RAID1 and data safety?

2005-04-07 Thread Doug Ledford
On Thu, 2005-04-07 at 17:35 +0200, Schuett Thomas EXT wrote:
 [Please excuse, my mailtool breaks threads ...]
 Reply to mail from 2005-04-05
 
 Hello Doug,
 
 many thanks for this highly detailed and structured posting.

You're welcome.

 A few questions are left: Is it common today, that a (eide) HD does
 not state a write as finished (aka send completion events, if I got this 
 right), before it was written to *media*?

Depends on the state of the Write Cache bit in the drive's configuration
page.  If this bit is enabled, the drive is allowed to cache writes in
the on board RAM and complete the command.  Should the drive have a
power failure event before the data is written to drive, then it might
get lost.  If the bit is not set, then the drive is suppossed to
actually have the data on media before returning (or at the very
absolute least, it should be in a small enough queue of pending writes
that should the power get lost, it can still write the last bits out
during spin down).

 I am happy to hear about this write barriers, even as I am astonished, that 
 it doesn't bring down the whole system performance (at least for raid1).

It doesn't bring down system performance because the journaling
filesystem isn't single task so to speak.  What this means is that when
you have a large number of writes queued up to be flushed, the
journaling fs can create a journal transaction for just some of the
writes, then issue an end of journal transaction, wait for that to
complete, then it can proceed to release all those writes to the
filesystem proper.  At the same time that the filesystem proper writes
are getting under way, it can issue another stream of writes to start
the next journal transaction.  As soon as all the journal writes are
complete, it can issue an end of journal transaction, wait for it to
complete, then issue all those writes to the filesystem proper.  So you
see, it's not that writes to the filesystem and the journal are
exclusive of each other so that one waits entirely on the other, it's
that writes from a single journal transaction are exclusive to writes to
the filesystem for *that particular transaction*.  By keeping ongoing
journal transactions in process, the journaling filesystem is able to
also stream data to the filesystem proper without much degradation, it's
just that the filesystem proper writes are delayed somewhat from the
corresponding journal transaction writes.  Make sense?

 
  This is where the event counters
  come into play.  That's what md uses to be able to tell which drives in
  an array are up to date versus those that aren't, which is what's needed
  to satisfy C.
 
 So event counters are the 2nd type of information, that gets written with 
 write 
 barriers. One is the journal data from the (j)fs (and actually the real data 
 too, to make it gain sence, otherwise the end-of-transaction-write is like a 
 semaphor with only one of the two parties using it), and the other is the 
 event 
 counter.

Not really.  The event counter is *much* courser grained than journal
entries.  A raid array may be in use for years and never have the event
counter get above 20 or so if it stays up most of the time and doesn't
suffer disk add/remove events.  It's really only intended to mark events
like drive failures so that if you have a drive fail on shutdown, then
on reboot we know that it failed because we did an immediate superblock
event counter update on all drives except the failed one when the
failure happened.

  Now, if I recall correctly, Peter posted a patch that changed this
  semantic in the raid1 code.  The raid1 code does not complete a write to
  the upper layers of the kernel until it's been completed on all devices
  and his patch made it such that as soon as it hit 1 device it returned
  the write to the upper layers of the kernel.
 
 I am glad to hear, that the behaviour is such, that the barrier stops, until 
 *all* media got written. That was one of the things that really made me 
 worrying. I hope, the patch is backed out and didn't went into any distros.

No it never went anywhere.  It was just a Hey guys, I played with this
optimization, here's the patch type posting and no one picked it up for
inclusion in any upstream or distro kernels.

  had in its queue.  Being a nice, smart SCSI disk with tagged queuing
  enabled, it then proceeds to complete the whole queue of writes in
  whatever order is most efficient for it.
 
 But just to make sure: Your previous statement ...when the linux block layer 
 did not provide any means of write barriers. As a result, they used 
 completion 
 events as write barriers. indicates, that even nice, smart SCSI disk with 
 tagged queuing enabled will act as demanded, because the special way of 
 write 
 with appended completion events testing will make sure they do?

Yes.  We know that drives are allowed to reorder writes, so anytime we
want a barrier for a given write (say you want all journal transactions
complete before 

Re: RAID1 and data safety?

2005-04-04 Thread Doug Ledford
On Tue, 2005-03-29 at 13:26 +0200, Peter T. Breuer wrote:
 Neil Brown [EMAIL PROTECTED] wrote:
  On Tuesday March 29, [EMAIL PROTECTED] wrote:
   
   Don't put the journal on the raid device, then - I'm not ever sure why
   people do that!  (they probably have a reason that is good - to them).
  
  Not good advice.  DO put the journal on a raid device.  It is much
  safer there.
 
 Two journals means two possible sources of unequal information - plus the
 two datasets.  We have been through this before. You get the journal
 you deserve.

No, you don't.  You've been through this before and it wasn't any more
correct than it is now.  Most of this seems to center on the fact that
you aren't aware of a few constraints that the linux md subsystem and
the various linux journaling filesystems were written under and how each
of those meets those constraints at an implementation level, so allow me
to elucidate that for you.

1) All linux filesystems are designed to work on actual, physical hard
drives.

2) The md subsystem is designed to provide fault tolerance for hard
drive failures via redundant storage of information (except raid0 and
linear, those are ignored throughout the rest of this email).

3) The md subsystem is designed to seamlessly operate underneath any
linux filesystem.  This implies that it must *act* like an actual,
physical hard drive in order to not violate assumptions made at the
filesystem level.

So here's how those constraints are satisfied in linux.

For constraint #1, specifically as it relates to journaling filesystems,
all journaling filesystem currently in use started their lives at a time
when the linux block layer did not provide any means of write barriers.
As a result, they used completion events as write barriers.  That is to
say, if you needed a write barrier between the end of journal
transaction write and the start of the actual data writes to the drive,
you simply waited for the drive to say that the actual end of journal
transaction data had been written prior to issuing any of the writes to
the actual filesystem.  You then waited for all filesystem writes to
complete before allowing that journal transaction to be overwritten.

Additionally, people have mentioned the concept of rollbacks relating to
journaling filesystems.  At least ext3, and likely all journaling
filesystems on linux, don't do rollbacks.  They do replays.  In order to
do a rollback, you would have to first read the data you are going to
update, save it somewhere, then start the update and if you crash
somewhere in the update you then read the saved data and put it back in
place of the partially completed update.  Obviously, this has
performance impact because it means that any update requires a
corresponding read/write cycle to save the old data.  What they actually
do is transactional updates where they write the update to the journal,
wait for all of the journal writes relevant to a specific transaction
group to complete, then start the writes to the actual filesystem.  If
you crash during the update to the filesystem, you replay any and all
whole journal transactions in the ext3 journal which simply re-issues
the writes so that any that didn't complete, get completed.  You never
start the writes until you know they are already committed to the
journal, and you never remove them from the journal until you know they
are all committed to the filesystem proper.  That way you are 100%
guaranteed to be able to complete whatever group of filesystem proper
writes were in process at the time of a crash, returning you to a
consistent state.  The main assumption that the filesystem relies upon
to make this true is that an issued write request(s) is not returned
until it is complete and on media (or in the drive buffer and the drive
is claiming that even in the event of a power failure it will still make
media).

OK, that's the filesystem issues.  For constraint #2, md satisfies this
by storing data in a way that any single drive failure can be
compensated for transparently (or more if using a more than 2 disk raid1
array or using raid6).  The primary thing here is that on a recoverable
failure scenario, the layers above md must A) not know the error
occurred and B) must get the right data when reading and C) must be able
to continue writing to the device and those writes must be preserved
across reboots and other recovery operations that might take place to
bring the array out of degraded mode.  This is where the event counters
come into play.  That's what md uses to be able to tell which drives in
an array are up to date versus those that aren't, which is what's needed
to satisfy C.

Now, what Peter has been saying can happen on a raid1 array (but which
can't) is creeping data corruption that's only noticed later because a
write to md array gets completed on one device but not the other and it
isn't until you read it later that this shows up.  Under normal failure
scenarios (aka, not the rather unlikely one posted 

Re: AW: RAID1 and data safety?

2005-03-29 Thread Molle Bestefich
 Does this sound reasonable?

Does to me.  Great example!
Thanks for painting the pretty picture :-).

Seeing as you're clearly the superior thinker, I'll address your brain
instead of wasting wattage on my own.

Let's say that MD had the feature to read from both disks in a mirror
and perform a comparison on read.
Let's say that I had that feature turned on for 2 mirror arrays (4 disks).
I want to get a bit of performance back though, so I stripe the two
mirrored arrays.

Do you see any problem in this scenario?
Are we back to corruption could happen then or are we still OK?
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID1 and data safety?

2005-03-29 Thread Peter T. Breuer
Schuett Thomas EXT [EMAIL PROTECTED] wrote:
 And here the fault happens:
 By chance, it reads the transaction log from hda, then sees, that the
 transaction was finished, and clears the overall unclean bit. 
 This cleaning is a write, so it goes to *both* HDs.

Don't put the journal on the raid device, then - I'm not ever sure why
people do that!  (they probably have a reason that is good - to them).

Or put it on another raid partition/device on the same media, but one
set to error unless replication is perfect (there does seem to be a use
for that policy!).

Peter

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: AW: AW: RAID1 and data safety?

2005-03-29 Thread Neil Brown
On Tuesday March 29, [EMAIL PROTECTED] wrote:
 But:
 If you have a raid1 and a journaling fs, see the following:
 If the system chrashes at the end of a write transaction,
 then the end-of-transaction information may got written 
 to hda already, but not to hdb. On the next boot, the 
 journaling fs may see an overall unclean bit (*probably* a transaction
 is pending), so it reads the transaction log. 
 
 And here the fault happens:
 By chance, it reads the transaction log from hda, then sees, that the
 transaction was finished, and clears the overall unclean bit. 
 This cleaning is a write, so it goes to *both* HDs.
 
 Situation now: On hdb there is a pending transaction in the transaction 
 log, but the overall unclean bit is cleared. This may not be realised,
 until by chance a year later hda chrashes, and you finaly face the fact,
 that there is a corrupt situation on the left HD.

Wrong.  There is nothing of the sort on hdb.  
Due to the system crash the data on hdb is completely ignored.  Data
from hda is copied over onto it.  Until that copy has completed,
nothing is read from hdb.

You could possibly come up with a scenario where the above happens but
while the copy from hda-hdb is happening, hda dies completely so
reads have to start happening from hdb.  
md could possibly handle this situation better (ensure a copy has
happened for any block before a read succeeds of that block), but I
don't think it is at all likely to be a re-life problem.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID1 and data safety?

2005-03-29 Thread Peter T. Breuer
Neil Brown [EMAIL PROTECTED] wrote:
 Due to the system crash the data on hdb is completely ignored.  Data

Neil - can you explain the algorithm that stamps the superblocks with
an event count, once and for all? (until further amendment :-).

It goes without saying that sb's are not stamped at every write, and the
event count is not incremented at every write, so when and when?

Thanks

Peter

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID1 and data safety?

2005-03-29 Thread Peter T. Breuer
Neil Brown [EMAIL PROTECTED] wrote:
 On Tuesday March 29, [EMAIL PROTECTED] wrote:
  
  Don't put the journal on the raid device, then - I'm not ever sure why
  people do that!  (they probably have a reason that is good - to them).
 
 Not good advice.  DO put the journal on a raid device.  It is much
 safer there.

Two journals means two possible sources of unequal information - plus the
two datasets.  We have been through this before. You get the journal
you deserve.

Peter

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


AW: AW: RAID1 and data safety?

2005-03-29 Thread Schuett Thomas EXT
 Does this sound reasonable?

Does to me.  Great example!

Thanks for the flowers :)
However, I am sure, the raid developers have thought through
all this over and over, and still have some asses in their hands.

I'd like to hear from them about the event count in the superblock 
Peter mentioned, and the algorithm, that decides, which blocks still 
needs to be synced.
As Luca wrote:
  there isn't one [non-volatile storage about blocks needing sync] for 
  lack of a non-volatile storage for dirty cache
but probably Neil knows a bit more about that?


Probably, to be on the save side, one would have to perform 
real HD internal write cache flushes after each
- write of start-of-transaction-info
- write of data
- write of end-of-transaction-info
I think, this is necessary, because otherwise the HD write cache
flush might start with a write, that came in later, so it might
first write the end-of-transaction-info, then the data, and then
the start-of-transaction-info. A chrash in between would
smash everything. 

Actually this should be a problem for journaling fs writers in the 
first place, but as raid subsystems in between do some caching on 
there own in a very special way, it becomes a topic for raid designers 
too. What do I mean with very special way. I mean, that they write,
and then say, that they have written o.k. And if you read back the 
written data (after a crash in between), you may by chance (=by having the
faster HD choosen for read) find everything fine, even if it actually 
did write to one of the HDs only. 

I still believe, that things would be better, if reads would go to both HDs,
and compare the results. Even if a difference would not be solvable for data 
(and so would not improve that situation), it would improve the situation for 
reading transaction-info:

difference in start-of-transaction-info
 - the data write has not started jet, so just 
delete the start-of-transaction-info

difference in end-of-transaction-info
 - The data write has finished already, so just
update the end-of-transaction-info

difference in data
 - can not happen,because the jfs would have rolled back
at boot after crash



Thomas


PS:
Do you see any problem in this [more complex 4 HD] scenario?
It looks like the easier example is still not clarified, so we stay with
that one for now :-)
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID1 and data safety?

2005-03-29 Thread Luca Berra
On Tue, Mar 29, 2005 at 01:29:22PM +0200, Peter T. Breuer wrote:
Neil Brown [EMAIL PROTECTED] wrote:
Due to the system crash the data on hdb is completely ignored.  Data
Neil - can you explain the algorithm that stamps the superblocks with
an event count, once and for all? (until further amendment :-).
IIRC it is updated at every event (start, stop, add, remove, fail etc...)
It goes without saying that sb's are not stamped at every write, and the
event count is not incremented at every write, so when and when?
the event count is not incremented at every write, but the dirty flag
is, and it is cleared lazily after some idle time.
in older code it was set at array start and cleared only at stop.
so in case of a disk failure the other disks get updated about the
failure.
in case of a restart (crash) the array will be dirty and a coin tossed
to chose which mirror to use as an authoritative source (the coin is
biased, but it doesn't matter). At this point any possible parallel
reality is squashed out of existance.
L.
--
Luca Berra -- [EMAIL PROTECTED]
   Communication Media  Services S.r.l.
/\
\ / ASCII RIBBON CAMPAIGN
 XAGAINST HTML MAIL
/ \
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID1 and data safety?

2005-03-29 Thread Peter T. Breuer
Luca Berra [EMAIL PROTECTED] wrote:
 On Tue, Mar 29, 2005 at 01:29:22PM +0200, Peter T. Breuer wrote:
 Neil Brown [EMAIL PROTECTED] wrote:
  Due to the system crash the data on hdb is completely ignored.  Data
 
 Neil - can you explain the algorithm that stamps the superblocks with
 an event count, once and for all? (until further amendment :-).
 
 IIRC it is updated at every event (start, stop, add, remove, fail etc...)

Hmm .. I see it updated sometimes twice and sometimes once between a
setfaulty and a hotadd (no writes between). There may be a race.

It's a bit of a problem because when I start a bitmap (which is when a
disk is faulted from the array), I copy the event count at that time to
the bitmap.  When the disk is re-inserted, I look at the event count on
its sb, and see that it may sometimes be one, sometimes two behind the
count on the bitmap.

And then sometimes the array event count jumps by ten or so.

Here's an example:

  md0: repairing old mirror component 300015 (disk 306 = bitmap 294)

I had done exactly one write on the degraded array. And maybe a
setfaulty and a hotadd. The test cycle before that (exactly the same)
I got:

  md0: repairing old mirror component 300015 (disk 298 = bitmap 294)

and at the very first separation (first test cycle) I saw

  md0: warning - new disk 300015 nearly too old for repair (disk 292  bitmap 
294)

(Yeah, these are my printk's - so what).

So it's all consistent with the idea that the event count is
incremented more frequently than you say.

Anyway, what you are saying is that if a crash occurs on the node with the
array, then the event counts on BOTH mirrors will be the same.  Thus
there is no way of knowing which is the more uptodate.

 It goes without saying that sb's are not stamped at every write, and the
 event count is not incremented at every write, so when and when?
 
 the event count is not incremented at every write, but the dirty flag
 is, and it is cleared lazily after some idle time.
 in older code it was set at array start and cleared only at stop.

Hmmm. You mean this

   int sb_dirty;

in the mddev?  I don't think that's written out .. well, it may be, if
the whole sb is written, but that's very big. What exactly are you
referencing with the dirty flag above?

 so in case of a disk failure the other disks get updated about the
 failure.

Well, yes, but in the case of an array node crash ...

 in case of a restart (crash) the array will be dirty and a coin tossed
 to chose which mirror to use as an authoritative source (the coin is
 biased, but it doesn't matter). At this point any possible parallel
 reality is squashed out of existance.

It is my opinion that one ought always to roll back anything in the
journal (any journal) on a restart. On the grounds that you can't know
for sure if it went to the other mirror.

Would you like me to make a patch to make sure that writes go to all
mirrors or else error back t the user?  The only question in my mind is
how to turn such a policy on or off per array. Any suggestion? I'm not
familiar with most of mdadm's neer capabilities. I'd use the sysctl
interface, but it's not set up to be per array. It should be.

Peter

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID1 and data safety?

2005-03-22 Thread Molle Bestefich
Neil Brown wrote:
 Is there any way to tell MD to do verify-on-write and
 read-from-all-disks on a RAID1 array?

 No.
 I would have thought that modern disk drives did some sort of
 verify-on-write, else how would they detect write errors, and they are
 certainly in the best place to do verify-on-write.

Really?  My guess was that they wouldn't, because it would lead to
less performance.
And that's why read errors crop up at read time.

 Doing it at the md level would be problematic as you would have to
 ensure that you really were reading from the media and not from some
 cache somewhere in the data path.  I doubt it would be a mechanism
 that would actually increase confidence in the safety of the data.

Hmm.  Could hack it by reading / writing blocks larger than the cache.  Ugly.

 Imagine a filesystem that could access multiple devices, and where it
 kept index information it didn't just keep one block address, but
 rather kept two block address, each on different devices, and a strong
 checksum of the data block.  This would allow much the same robustness
 as read-from-all-drives and much lower overhead.

As in, if the checksum fails, try loading the data blocks [again]
from the other device?
Not sure why a checksum of X data blocks should be cheaper
performance-wise than a comparison between X data blocks, but I can
see the point in that you only have to load the data once and check
the checksum.  Not quite the same security, but almost.

 In summary:
  - you cannot do it now.
  - I don't think md is at the right level to solve these sort of problems.
I think a filesystem could do it much better. (I'm working on a
filesystem  slowly...)
  - read-from-all-disks might get implemented one day. verify-on-write
is much less likely.
 
 Apologies if the answer is in the docs.
 
 It isn't.  But it is in the list archives now

Thanks! :-)

(Guess I'll drop the idea for the time being...)
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


AW: RAID1 and data safety?

2005-03-22 Thread Schuett Thomas EXT
Neil Brown wrote:
 Is there any way to tell MD to [...] and
 read-from-all-disks on a RAID1 array?

Not sure why a checksum of X data blocks should be cheaper
performance-wise than a comparison between X data blocks, but I can
see the point in that you only have to load the data once and check
the checksum.  Not quite the same security, but almost.

Still, if there is different data on the two disks due to a previous 
power failure, the comparsion could really be the better choise, isn't it?


-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID1 and data safety?

2005-03-21 Thread Neil Brown
On Wednesday March 16, [EMAIL PROTECTED] wrote:
 Just wondering;
 
 Is there any way to tell MD to do verify-on-write and
 read-from-all-disks on a RAID1 array?

No.
I would have thought that modern disk drives did some sort of
verify-on-write, else how would they detect write errors, and they are
certainly in the best place to do verify-on-write.
Doing it at the md level would be problematic as you would have to
ensure that you really were reading from the media and not from some
cache somewhere in the data path.  I doubt it would be a mechanism
that would actually increase confidence in the safety of the data.

read-from-all-disks would require at least three drives before there
would be any real value in it.  There would be an enormous overhead,
but possibly that could be justified in some circumstances.  If we
ever implement background-data-checking, it might become relatively
easy to implement this.

However I think that checksum based checking would be more effective,
and that it should be done at the filesystem level.

Imagine a filesystem that could access multiple devices, and where it
kept index information it didn't just keep one block address, but
rather kept two block address, each on different devices, and a strong
checksum of the data block.  This would allow much the same robustness
as read-from-all-drives and much lower overhead.

It is very possibly the Sun's new ZFS filesystem works like this,
though I haven't seen precise technical details.

In summary:
 - you cannot do it now.
 - I don't think md is at the right level to solve these sort of problems.
   I think a filesystem could do it much better. (I'm working on a
   filesystem  slowly...)
 - read-from-all-disks might get implemented one day. verify-on-write
   is much less likely.

 
 Apologies if the answer is in the docs.

It isn't.  But it is in the list archives now

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RAID1 and data safety?

2005-03-16 Thread Molle Bestefich
Just wondering;

Is there any way to tell MD to do verify-on-write and
read-from-all-disks on a RAID1 array?

I was thinking of setting up a couple of RAID1s with maximum data safety.
I'd like to verify after each write to a disk plus I'd like to read
from all disks and perform data comparison whenever something is read.
 I'd then run a RAID0 over the RAID1 arrays, to regain some of the
speed lost from all of the excessive checking.

Just wondering if it could be done :-).

Apologies if the answer is in the docs.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html