On Sun, 2011-05-01 at 15:06 -0700, Jameson Graef Rollins wrote:
> On Fri, 29 Apr 2011 05:39:40 +0100, Ben Hutchings <b...@decadent.org.uk> 
> wrote:
> > On Wed, 2011-04-27 at 09:19 -0700, Jameson Graef Rollins wrote:
> > > I run what I imagine is a fairly unusual disk setup on my laptop,
> > > consisting of:
> > > 
> > >   ssd -> raid1 -> dm-crypt -> lvm -> ext4
> > > 
> > > I use the raid1 as a backup.  The raid1 operates normally in degraded
> > > mode.  For backups I then hot-add a usb hdd, let the raid1 sync, and
> > > then fail/remove the external hdd. 
> > 
> > Well, this is not expected to work.  Possibly the hot-addition of a disk
> > with different bio restrictions should be rejected.  But I'm not sure,
> > because it is safe to do that if there is no mounted filesystem or
> > stacking device on top of the RAID.
> 
> Hi, Ben.  Can you explain why this is not expected to work?  Which part
> exactly is not expected to work and why?

Adding another type of disk controller (USB storage versus whatever the
SSD interface is) to a RAID that is already in use.

> > I would recommend using filesystem-level backup (e.g. dirvish or
> > backuppc).  Aside from this bug, if the SSD fails during a RAID resync
> > you will be left with an inconsistent and therefore useless 'backup'.
> 
> I appreciate your recommendation, but it doesn't really have anything to
> do with this bug report.  Unless I am doing something that is
> *expressly* not supposed to work, then it should work, and if it doesn't
> then it's either a bug or a documentation failure (ie. if this setup is
> not supposed to work then it should be clearly documented somewhere what
> exactly the problem is).

The normal state of a RAID set is that all disks are online.  You have
deliberately turned this on its head; the normal state of your RAID set
is that one disk is missing.  This is such a basic principle that most
documentation won't mention it.

> > The block layer correctly returns an error after logging this message.
> > If it's due to a read operation, the error should be propagated up to
> > the application that tried to read.  If it's due to a write operation, I
> > would expect the error to result in the RAID becoming desynchronised.
> > In some cases it might be propagated to the application that tried to
> > write.
> 
> Can you say what is "correct" about the returned error?  That's what I'm
> still not understanding.  Why is there an error and what is it coming
> from?

The error is that you changed the I/O capabilities of the RAID while it
was already in use.  But what I was describing as 'correct' was that an
error code was returned, rather than the error condition only being
logged.  If the error condition is not properly propagated then it could
lead to data loss.

Ben.

-- 
Ben Hutchings
Once a job is fouled up, anything done to improve it makes it worse.

Attachment: signature.asc
Description: This is a digitally signed message part

Reply via email to