Re: Strange crash, possibly vinum-related
On Tue, Mar 18, 2003 at 10:22:36AM +1030, Greg 'groggy' Lehey wrote: On Monday, 17 March 2003 at 10:58:28 +, Scott Mitchell wrote: On Mon, Mar 10, 2003 at 11:15:32PM +, Scott Mitchell wrote: No takers? I've been intending to do so, but there's not much I can do based on the information you've supplied. Hi Greg, Thanks for replying -- I didn't bother you personally with this before precisely because I didn't think I had enough information there to diagnose the problem fully. Is there anything in particular I can send you that might help? Maybe someone who's done this (replacing a failed Vinum drive on hot-swap SCSI hardware) before can at least tell me whether: - I should have done some camcontrol magic before rebuilding the drive? I can't see anything in particular you would need to do, but then I haven't seen the details. I guess it wouldn't have hurt to do a 'camcontrol rescan' after plugging the new drive. Disklabel seemed perfectly happy, so it didn't occur to me until much later that I hadn't done that. - Rebuilding the drive without unmounting the volume first was just asking for trouble? There have been reports of this kind of problem, mainly from Vallo Kallaste, who has also responded. I haven't seen it myself, and I haven't heard of panics as a result. But yes, umounting is a good precaution. I've added that to my checklist for next time :-) Please let me know if there's any other logs you'd like to see or anything else I can try. I'm actually planning to set up a very similar array on another identical machine once 4.8 is released, so there's a window for experimentation there, on a non-production machine. Thanks again, Scott -- === Scott Mitchell | PGP Key ID | Eagles may soar, but weasels Cambridge, England | 0x54B171B9 | don't get sucked into jet engines scott at fishballoon.org | 0xAA775B8B | -- Anon To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-questions in the body of the message
Re: Strange crash, possibly vinum-related
On Tuesday, 18 March 2003 at 11:28:19 +, Scott Mitchell wrote: On Tue, Mar 18, 2003 at 10:22:36AM +1030, Greg 'groggy' Lehey wrote: On Monday, 17 March 2003 at 10:58:28 +, Scott Mitchell wrote: On Mon, Mar 10, 2003 at 11:15:32PM +, Scott Mitchell wrote: No takers? I've been intending to do so, but there's not much I can do based on the information you've supplied. Thanks for replying -- I didn't bother you personally with this before precisely because I didn't think I had enough information there to diagnose the problem fully. Is there anything in particular I can send you that might help? The canonical list is at http://www.vinumvm.org/vinum/how-to-debug.html. - Rebuilding the drive without unmounting the volume first was just asking for trouble? There have been reports of this kind of problem, mainly from Vallo Kallaste, who has also responded. I haven't seen it myself, and I haven't heard of panics as a result. But yes, umounting is a good precaution. I've added that to my checklist for next time :-) Please let me know if there's any other logs you'd like to see or anything else I can try. I'm actually planning to set up a very similar array on another identical machine once 4.8 is released, so there's a window for experimentation there, on a non-production machine. Well, feel free to try to break it and reproduce Vallo's problems, but I really need to reproduce it here so that I can poke at the problem and fix it. Greg -- When replying to this message, please copy the original recipients. If you don't, I may ignore the reply or reply to the original recipients. For more information, see http://www.lemis.com/questions.html See complete headers for address and phone numbers pgp0.pgp Description: PGP signature
Re: Strange crash, possibly vinum-related
On Mon, Mar 10, 2003 at 11:15:32PM +, Scott Mitchell wrote: Hi all, I wonder if anyone out there can shed any light on this: A drive failed on one of our Vinum-powered RAID-5 arrays over the weekend. This morning, we swapped out the offending drive (hot-swappable SCSI hardware), disklabel-ed it and restarted the offending subdisk. Everything seemed fine at this point, with vinum happily reviving the stale subdisk. However, twenty minutes later, with the revive 29% complete, I got this in /var/log/messages: Mar 10 11:39:50 kokako vinum[12708]: can't revive raid.p0.s0: Invalid argument 'vinum list' was also showing an error message, which I foolishly didn't capture, something along the lines of 'the revive process died'. Lacking any better ideas, I started the subdisk again. The revival seemed to pick up where it left off. Half an hour later, the box rebooted :-( I wasn't actually watching it at the time, so I don't know if it finished reviving the subdisk or not. There's no indication in the logs as to what happened, but the timing of the reboot is consistent with it happening around the time the subdisk would have come back to life. Once the box came back up, I restarted the subdisk yet again (I had to create the drive again first), with the RAID volume unmounted. This time the process finished without complaints and things seem to be working as well as ever since then. [logs, etc. snipped...] No takers? Maybe someone who's done this (replacing a failed Vinum drive on hot-swap SCSI hardware) before can at least tell me whether: - I should have done some camcontrol magic before rebuilding the drive? - Rebuilding the drive without unmounting the volume first was just asking for trouble? - -hackers or even -stable is a better venue for this kind of problem? Many thanks in advance, Scott -- === Scott Mitchell | PGP Key ID | Eagles may soar, but weasels Cambridge, England | 0x54B171B9 | don't get sucked into jet engines scott at fishballoon.org | 0xAA775B8B | -- Anon To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-questions in the body of the message
Re: Strange crash, possibly vinum-related
On Mon, Mar 17, 2003 at 10:58:28AM +, Scott Mitchell [EMAIL PROTECTED] wrote: No takers? Maybe someone who's done this (replacing a failed Vinum drive on hot-swap SCSI hardware) before can at least tell me whether: - I should have done some camcontrol magic before rebuilding the drive? - Rebuilding the drive without unmounting the volume first was just asking for trouble? - -hackers or even -stable is a better venue for this kind of problem? What you want to hear? I'll advise you to not put critical data on the Vinum R5 volume. You did test this configuration thoroughly before putting online, did you? All this failing/replacing/rebuilding stuff? Better quiesce the volume (umount it) before rebuilding, doing otherwise causes data loss in my experience. You should use camcontrol when doing hot-swapping of drives. For the end, collect the debug information Greg Lehey (Vinum author) needs and submit it directly to him. -- Vallo Kallaste To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-questions in the body of the message
Re: Strange crash, possibly vinum-related
On Mon, Mar 17, 2003 at 12:19:32PM +, Scott Mitchell [EMAIL PROTECTED] wrote: Thanks -- you've confirmed what I suspected, that I could have avoided the problems I saw by being a bit more cautious. My bad. Out of interest though, why do you advise not putting critical data on a Vinum R5 volume? This one has been running fine for ~2 years under reasonable loads. The disk failure was the first time it's required any attention at all, and it seems the problems I had with that were mostly of my own making. The mailing lists don't seem to be overrun with people complaining that 'Vinum ate my files' :-) Because RAID5 main features are to increase data redundancy _and_ data availability. As you have discovered, it runs until it fails and then you'll have a hard time recovering it. Recovery is the most important (and difficult) part of it. When it fails to recover from the disk loss, what it's worth, then? The 2 years of uninterrupted service doesn't matter when it happens. Your data is unavailable and services down. Critical data is, by definition, critical :) I did put lots of data onto Vinum R5, because I did know that a day of downtime per half a year isn't problem. Recovery on the quiet (unmounted) volume did work and all was well. But for critical data I don't trust it (yet). Just my point of view. -- Vallo Kallaste To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-questions in the body of the message
Re: Strange crash, possibly vinum-related
On Monday, 17 March 2003 at 10:58:28 +, Scott Mitchell wrote: On Mon, Mar 10, 2003 at 11:15:32PM +, Scott Mitchell wrote: Hi all, I wonder if anyone out there can shed any light on this: A drive failed on one of our Vinum-powered RAID-5 arrays over the weekend. This morning, we swapped out the offending drive (hot-swappable SCSI hardware), disklabel-ed it and restarted the offending subdisk. Everything seemed fine at this point, with vinum happily reviving the stale subdisk. However, twenty minutes later, with the revive 29% complete, I got this in /var/log/messages: Mar 10 11:39:50 kokako vinum[12708]: can't revive raid.p0.s0: Invalid argument 'vinum list' was also showing an error message, which I foolishly didn't capture, something along the lines of 'the revive process died'. Lacking any better ideas, I started the subdisk again. The revival seemed to pick up where it left off. Half an hour later, the box rebooted :-( I wasn't actually watching it at the time, so I don't know if it finished reviving the subdisk or not. There's no indication in the logs as to what happened, but the timing of the reboot is consistent with it happening around the time the subdisk would have come back to life. Once the box came back up, I restarted the subdisk yet again (I had to create the drive again first), with the RAID volume unmounted. This time the process finished without complaints and things seem to be working as well as ever since then. [logs, etc. snipped...] No takers? I've been intending to do so, but there's not much I can do based on the information you've supplied. Maybe someone who's done this (replacing a failed Vinum drive on hot-swap SCSI hardware) before can at least tell me whether: - I should have done some camcontrol magic before rebuilding the drive? I can't see anything in particular you would need to do, but then I haven't seen the details. - Rebuilding the drive without unmounting the volume first was just asking for trouble? There have been reports of this kind of problem, mainly from Vallo Kallaste, who has also responded. I haven't seen it myself, and I haven't heard of panics as a result. But yes, umounting is a good precaution. - -hackers or even -stable is a better venue for this kind of problem? -questions will do fine. Greg -- When replying to this message, please copy the original recipients. If you don't, I may ignore the reply or reply to the original recipients. For more information, see http://www.lemis.com/questions.html See complete headers for address and phone numbers pgp0.pgp Description: PGP signature
Re: Strange crash, possibly vinum-related
On Monday, 17 March 2003 at 18:37:46 +0200, Vallo Kallaste wrote: On Mon, Mar 17, 2003 at 12:19:32PM +, Scott Mitchell [EMAIL PROTECTED] wrote: Thanks -- you've confirmed what I suspected, that I could have avoided the problems I saw by being a bit more cautious. My bad. Out of interest though, why do you advise not putting critical data on a Vinum R5 volume? This one has been running fine for ~2 years under reasonable loads. The disk failure was the first time it's required any attention at all, and it seems the problems I had with that were mostly of my own making. The mailing lists don't seem to be overrun with people complaining that 'Vinum ate my files' :-) Because RAID5 main features are to increase data redundancy _and_ data availability. As you have discovered, it runs until it fails and then you'll have a hard time recovering it. Recovery is the most important (and difficult) part of it. Well, everybody else seems to manage fine. It's not difficult, just unreliable in your experience. And yes, I take your experience seriously, but it's not what most other people see. Greg -- When replying to this message, please copy the original recipients. If you don't, I may ignore the reply or reply to the original recipients. For more information, see http://www.lemis.com/questions.html See complete headers for address and phone numbers pgp0.pgp Description: PGP signature