Re: Strange crash, possibly vinum-related

2003-03-18 Thread Scott Mitchell
On Tue, Mar 18, 2003 at 10:22:36AM +1030, Greg 'groggy' Lehey wrote:
 On Monday, 17 March 2003 at 10:58:28 +, Scott Mitchell wrote:
  On Mon, Mar 10, 2003 at 11:15:32PM +, Scott Mitchell wrote:
 
  No takers? 
 
 I've been intending to do so, but there's not much I can do based on
 the information you've supplied.

Hi Greg,

Thanks for replying -- I didn't bother you personally with this before
precisely because I didn't think I had enough information there to diagnose
the problem fully.  Is there anything in particular I can send you that
might help?

  Maybe someone who's done this (replacing a failed Vinum drive on
  hot-swap SCSI hardware) before can at least tell me whether:
 
  - I should have done some camcontrol magic before rebuilding
  the drive?
 
 I can't see anything in particular you would need to do, but then I
 haven't seen the details.

I guess it wouldn't have hurt to do a 'camcontrol rescan' after plugging
the new drive.  Disklabel seemed perfectly happy, so it didn't occur to me
until much later that I hadn't done that.

  - Rebuilding the drive without unmounting the volume first was
  just asking for trouble?
 
 There have been reports of this kind of problem, mainly from Vallo
 Kallaste, who has also responded.  I haven't seen it myself, and I
 haven't heard of panics as a result.  But yes, umounting is a good
 precaution.

I've added that to my checklist for next time :-)

Please let me know if there's any other logs you'd like to see or anything
else I can try.  I'm actually planning to set up a very similar array on
another identical machine once 4.8 is released, so there's a window for
experimentation there, on a non-production machine.

Thanks again,

Scott

-- 
===
Scott Mitchell   | PGP Key ID | Eagles may soar, but weasels
Cambridge, England   | 0x54B171B9 |  don't get sucked into jet engines
scott at fishballoon.org | 0xAA775B8B |  -- Anon

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-questions in the body of the message


Re: Strange crash, possibly vinum-related

2003-03-18 Thread Greg 'groggy' Lehey
On Tuesday, 18 March 2003 at 11:28:19 +, Scott Mitchell wrote:
 On Tue, Mar 18, 2003 at 10:22:36AM +1030, Greg 'groggy' Lehey wrote:
 On Monday, 17 March 2003 at 10:58:28 +, Scott Mitchell wrote:
 On Mon, Mar 10, 2003 at 11:15:32PM +, Scott Mitchell wrote:

 No takers?

 I've been intending to do so, but there's not much I can do based on
 the information you've supplied.

 Thanks for replying -- I didn't bother you personally with this
 before precisely because I didn't think I had enough information
 there to diagnose the problem fully.  Is there anything in
 particular I can send you that might help?

The canonical list is at
http://www.vinumvm.org/vinum/how-to-debug.html.

 - Rebuilding the drive without unmounting the volume first was
 just asking for trouble?

 There have been reports of this kind of problem, mainly from Vallo
 Kallaste, who has also responded.  I haven't seen it myself, and I
 haven't heard of panics as a result.  But yes, umounting is a good
 precaution.

 I've added that to my checklist for next time :-)

 Please let me know if there's any other logs you'd like to see or anything
 else I can try.  I'm actually planning to set up a very similar array on
 another identical machine once 4.8 is released, so there's a window for
 experimentation there, on a non-production machine.

Well, feel free to try to break it and reproduce Vallo's problems, but
I really need to reproduce it here so that I can poke at the problem
and fix it.

Greg
--
When replying to this message, please copy the original recipients.
If you don't, I may ignore the reply or reply to the original recipients.
For more information, see http://www.lemis.com/questions.html
See complete headers for address and phone numbers


pgp0.pgp
Description: PGP signature


Re: Strange crash, possibly vinum-related

2003-03-17 Thread Scott Mitchell
On Mon, Mar 10, 2003 at 11:15:32PM +, Scott Mitchell wrote:
 Hi all,
 
 I wonder if anyone out there can shed any light on this:
 
 A drive failed on one of our Vinum-powered RAID-5 arrays over the weekend.
 This morning, we swapped out the offending drive (hot-swappable SCSI
 hardware), disklabel-ed it and restarted the offending subdisk.  Everything
 seemed fine at this point, with vinum happily reviving the stale subdisk.
 
 However, twenty minutes later, with the revive 29% complete, I got this in
 /var/log/messages:
 
 Mar 10 11:39:50 kokako vinum[12708]: can't revive raid.p0.s0: Invalid argument
 
 'vinum list' was also showing an error message, which I foolishly didn't
 capture, something along the lines of 'the revive process died'.  Lacking
 any better ideas, I started the subdisk again.  The revival seemed to pick
 up where it left off.
 
 Half an hour later, the box rebooted :-(  I wasn't actually watching it at
 the time, so I don't know if it finished reviving the subdisk or not.
 There's no indication in the logs as to what happened, but the timing of
 the reboot is consistent with it happening around the time the subdisk
 would have come back to life.
 
 Once the box came back up, I restarted the subdisk yet again (I had to
 create the drive again first), with the RAID volume unmounted.  This time
 the process finished without complaints and things seem to be working as
 well as ever since then.
[logs, etc. snipped...]


No takers?  Maybe someone who's done this (replacing a failed Vinum drive
on hot-swap SCSI hardware) before can at least tell me whether:

- I should have done some camcontrol magic before rebuilding the drive?
- Rebuilding the drive without unmounting the volume first was just
  asking for trouble?
- -hackers or even -stable is a better venue for this kind of problem?


Many thanks in advance,

Scott

-- 
===
Scott Mitchell   | PGP Key ID | Eagles may soar, but weasels
Cambridge, England   | 0x54B171B9 |  don't get sucked into jet engines
scott at fishballoon.org | 0xAA775B8B |  -- Anon

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-questions in the body of the message


Re: Strange crash, possibly vinum-related

2003-03-17 Thread Vallo Kallaste
On Mon, Mar 17, 2003 at 10:58:28AM +, Scott Mitchell
[EMAIL PROTECTED] wrote:

 No takers?  Maybe someone who's done this (replacing a failed Vinum drive
 on hot-swap SCSI hardware) before can at least tell me whether:
 
   - I should have done some camcontrol magic before rebuilding the drive?
   - Rebuilding the drive without unmounting the volume first was just
 asking for trouble?
   - -hackers or even -stable is a better venue for this kind of problem?

What you want to hear? I'll advise you to not put critical data on the
Vinum R5 volume. You did test this configuration thoroughly before
putting online, did you? All this failing/replacing/rebuilding
stuff? Better quiesce the volume (umount it) before rebuilding,
doing otherwise causes data loss in my experience. You should use
camcontrol when doing hot-swapping of drives. For the end, collect
the debug information Greg Lehey (Vinum author) needs and submit it
directly to him.
-- 

Vallo Kallaste

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-questions in the body of the message


Re: Strange crash, possibly vinum-related

2003-03-17 Thread Vallo Kallaste
On Mon, Mar 17, 2003 at 12:19:32PM +, Scott Mitchell
[EMAIL PROTECTED] wrote:

 Thanks -- you've confirmed what I suspected, that I could have avoided the
 problems I saw by being a bit more cautious.  My bad.
 
 Out of interest though, why do you advise not putting critical data on a
 Vinum R5 volume?  This one has been running fine for ~2 years under
 reasonable loads.  The disk failure was the first time it's required any
 attention at all, and it seems the problems I had with that were mostly of
 my own making.  The mailing lists don't seem to be overrun with people
 complaining that 'Vinum ate my files' :-)

Because RAID5 main features are to increase data redundancy _and_
data availability. As you have discovered, it runs until it fails
and then you'll have a hard time recovering it. Recovery is the most
important (and difficult) part of it. When it fails to recover from
the disk loss, what it's worth, then? The 2 years of uninterrupted
service doesn't matter when it happens. Your data is unavailable and
services down. Critical data is, by definition, critical :) I did
put lots of data onto Vinum R5, because I did know that a day of
downtime per half a year isn't problem. Recovery on the quiet
(unmounted) volume did work and all was well. But for critical data
I don't trust it (yet). Just my point of view.
-- 

Vallo Kallaste

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-questions in the body of the message


Re: Strange crash, possibly vinum-related

2003-03-17 Thread Greg 'groggy' Lehey
On Monday, 17 March 2003 at 10:58:28 +, Scott Mitchell wrote:
 On Mon, Mar 10, 2003 at 11:15:32PM +, Scott Mitchell wrote:
 Hi all,

 I wonder if anyone out there can shed any light on this:

 A drive failed on one of our Vinum-powered RAID-5 arrays over the weekend.
 This morning, we swapped out the offending drive (hot-swappable SCSI
 hardware), disklabel-ed it and restarted the offending subdisk.  Everything
 seemed fine at this point, with vinum happily reviving the stale subdisk.

 However, twenty minutes later, with the revive 29% complete, I got this in
 /var/log/messages:

 Mar 10 11:39:50 kokako vinum[12708]: can't revive raid.p0.s0: Invalid argument

 'vinum list' was also showing an error message, which I foolishly didn't
 capture, something along the lines of 'the revive process died'.  Lacking
 any better ideas, I started the subdisk again.  The revival seemed to pick
 up where it left off.

 Half an hour later, the box rebooted :-(  I wasn't actually watching it at
 the time, so I don't know if it finished reviving the subdisk or not.
 There's no indication in the logs as to what happened, but the timing of
 the reboot is consistent with it happening around the time the subdisk
 would have come back to life.

 Once the box came back up, I restarted the subdisk yet again (I had to
 create the drive again first), with the RAID volume unmounted.  This time
 the process finished without complaints and things seem to be working as
 well as ever since then.
 [logs, etc. snipped...]


 No takers? 

I've been intending to do so, but there's not much I can do based on
the information you've supplied.

 Maybe someone who's done this (replacing a failed Vinum drive on
 hot-swap SCSI hardware) before can at least tell me whether:

   - I should have done some camcontrol magic before rebuilding
 the drive?

I can't see anything in particular you would need to do, but then I
haven't seen the details.

   - Rebuilding the drive without unmounting the volume first was
   just asking for trouble?

There have been reports of this kind of problem, mainly from Vallo
Kallaste, who has also responded.  I haven't seen it myself, and I
haven't heard of panics as a result.  But yes, umounting is a good
precaution.

   - -hackers or even -stable is a better venue for this kind of problem?

-questions will do fine.

Greg
--
When replying to this message, please copy the original recipients.
If you don't, I may ignore the reply or reply to the original recipients.
For more information, see http://www.lemis.com/questions.html
See complete headers for address and phone numbers


pgp0.pgp
Description: PGP signature


Re: Strange crash, possibly vinum-related

2003-03-17 Thread Greg 'groggy' Lehey
On Monday, 17 March 2003 at 18:37:46 +0200, Vallo Kallaste wrote:
 On Mon, Mar 17, 2003 at 12:19:32PM +, Scott Mitchell
 [EMAIL PROTECTED] wrote:

 Thanks -- you've confirmed what I suspected, that I could have avoided the
 problems I saw by being a bit more cautious.  My bad.

 Out of interest though, why do you advise not putting critical data on a
 Vinum R5 volume?  This one has been running fine for ~2 years under
 reasonable loads.  The disk failure was the first time it's required any
 attention at all, and it seems the problems I had with that were mostly of
 my own making.  The mailing lists don't seem to be overrun with people
 complaining that 'Vinum ate my files' :-)

 Because RAID5 main features are to increase data redundancy _and_
 data availability. As you have discovered, it runs until it fails
 and then you'll have a hard time recovering it. Recovery is the most
 important (and difficult) part of it.

Well, everybody else seems to manage fine.  It's not difficult, just
unreliable in your experience.  And yes, I take your experience
seriously, but it's not what most other people see.

Greg
--
When replying to this message, please copy the original recipients.
If you don't, I may ignore the reply or reply to the original recipients.
For more information, see http://www.lemis.com/questions.html
See complete headers for address and phone numbers


pgp0.pgp
Description: PGP signature