Re: ZFS zpool replace problems
I'm removing the In-Reply-To mail headers for this thread, as you've now hijacked it for a different purpose. Please don't do this; start a new thread altogether. :-) On Tue, Jan 26, 2010 at 02:57:20PM +0100, Gerrit Kühn wrote: I am still busy replacing RE2-disks with updated drives. I came across a very strange thing with zfs. Actually I had the following pool layout: mclane# zpool status pool: tank state: ONLINE scrub: none requested config: NAMESTATE READ WRITE CKSUM tankONLINE 0 0 0 raidz1ONLINE 0 0 0 ad8 ONLINE 0 0 0 ad10ONLINE 0 0 0 ad12ONLINE 0 0 0 spares ad14 AVAIL errors: No known data errors All disks still have the firmware bug, so I want to replace them with disks that I already fixed. I put in a updated drive as ad18 and wanted to replace ad12 to get the drive with the broken firmware out: mclane# zpool replace tank /dev/ad12 /dev/ad18 mclane# zpool status pool: tank state: ONLINE status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scrub: resilver in progress for 0h0m, 0.01% done, 52h51m to go config: NAME STATE READ WRITE CKSUM tank ONLINE 0 0 0 raidz1 ONLINE 0 0 0 ad8ONLINE 0 0 0 7.21M resilvered ad10 ONLINE 0 0 0 7.22M resilvered replacing ONLINE 0 0 0 ad12 ONLINE 0 0 0 ad18 ONLINE 0 0 0 10.7M resilvered spares ad14 AVAIL errors: No known data errors However, something must have gone wrong during the resilvering process and it now looks like this: mclane# zpool status pool: tank state: DEGRADED status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://www.sun.com/msg/ZFS-8000-9P scrub: resilver completed after 2h39m with 0 errors on Tue Jan 26 14:00:00 2010 config: NAME STATE READ WRITE CKSUM tank DEGRADED 0 0 0 raidz1 DEGRADED 0 0 0 ad8ONLINE 0 0 0 975M resilvered ad10 ONLINE 0 0 142 974M resilvered replacing DEGRADED 0 7.25M 0 ad12 ONLINE 0 0 0 ad18 REMOVED 0 1 0 79.4M resilvered spares ad14 AVAIL errors: No known data errors What is going on here? ad18 obviously detached during the process. /var/log/messages just gives me Jan 26 11:23:33 mclane kernel: ad18: FAILURE - device detached Additionally ad10 obviously produced chksum errors. What do I do about the degraded replacing process? Can I terminate it somehow and maybe replace ad10 first? Any other hints? I'm not sure how the above is supposed to work (I haven't personally tried it), but: 1) Why didn't you offline the ad10 disk first? zpool offline tank ad10 2) How did you attach ad18? Did you tell the system about it using atacontrol? If so, what commands did you use? 3) Can you please provide uname -a output, as well as relevant dmesg output to show what kind of SATA controller you have, what's attached to what, etc.? -- | Jeremy Chadwick j...@parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB | ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: ZFS zpool replace problems
On Tue, 26 Jan 2010 06:30:21 -0800 Jeremy Chadwick free...@jdc.parodius.com wrote about Re: ZFS zpool replace problems: JC I'm removing the In-Reply-To mail headers for this thread, as you've JC now hijacked it for a different purpose. Please don't do this; start JC a new thread altogether. :-) Thanks. You're perfectly right, I should have done that. JC I'm not sure how the above is supposed to work (I haven't personally JC tried it), but: JC JC 1) Why didn't you offline the ad10 disk first? JCzpool offline tank ad10 Well, probably because I thought that zfs would simply handle the situation. I just wanted to replace drive A with drive B, so this was quite straight-forward for me. JC 2) How did you attach ad18? Did you tell the system about it using JCatacontrol? If so, what commands did you use? Yes. The drives did not appear automatically (verified with atacontrol list). Then I first tried reinit ata9, but that did not work out, so I did a detach/attach for ata9, then the drive was there (with list and also the device node appeared). JC 3) Can you please provide uname -a output, as well as relevant dmesg JCoutput to show what kind of SATA controller you have, what's JCattached to what, etc.? Of course (dmesg is not there anymore, I use pciconf -vl and atacontrol instead): ATA channel 0: Master: no device present Slave: acd0 Optiarc DVD RW AD-7540A/1.01 ATA/ATAPI revision 0 ATA channel 1: Master: no device present Slave: no device present ATA channel 2: Master: ad4 ST380815AS/3.AAC SATA revision 2.x Slave: no device present ATA channel 3: Master: ad6 ST380815AS/3.AAC SATA revision 2.x Slave: no device present ATA channel 4: Master: ad8 WDC WD1000FYPS-01ZKB0/02.01B01 SATA revision 2.x Slave: no device present ATA channel 5: Master: ad10 WDC WD1000FYPS-01ZKB0/02.01B01 SATA revision 2.x Slave: no device present ATA channel 6: Master: ad12 WDC WD1000FYPS-01ZKB0/02.01B01 SATA revision 2.x Slave: no device present ATA channel 7: Master: ad14 WDC WD1000FYPS-01ZKB0/02.01B01 SATA revision 2.x Slave: no device present ATA channel 8: Master: no device present Slave: no device present ATA channel 9: Master: no device present Slave: no device present FreeBSD mclane.rt.aei.uni-hannover.de 7.2-STABLE FreeBSD 7.2-STABLE #0: Mon Sep 7 11:01:56 CEST 2009 r...@mclane.rt.aei.uni-hannover.de:/usr/obj/usr/src/sys/MCLANE.72 amd64 The first six drives (up to ad14) are connected onboard (Supermicro dual opteron board with mcp55): atap...@pci0:0:5:0: class=0x010485 card=0x161115d9 chip=0x037f10de rev=0xa3 hdr=0x00 vendor = 'Nvidia Corp' device = 'MCP55 SATA/RAID Controller (MCP55S)' class = mass storage subclass = RAID atap...@pci0:0:5:1: class=0x010485 card=0x161115d9 chip=0x037f10de rev=0xa3 hdr=0x00 vendor = 'Nvidia Corp' device = 'MCP55 SATA/RAID Controller (MCP55S)' class = mass storage subclass = RAID atap...@pci0:0:5:2: class=0x010485 card=0x161115d9 chip=0x037f10de rev=0xa3 hdr=0x00 vendor = 'Nvidia Corp' device = 'MCP55 SATA/RAID Controller (MCP55S)' class = mass storage subclass = RAID The other two (ad16 and ad18, the chassis has 8 slots and the last two were only intended to be used in situtations like the one I have now) are connected to an extra pci card: atap...@pci0:3:6:0: class=0x010401 card=0x02409005 chip=0x02401095 rev=0x02 hdr=0x00 vendor = 'Silicon Image Inc (Was: CMD Technology Inc)' device = 'SATA/Raid controller(2XSATA150) (SIL3112)' class = mass storage subclass = RAID Meanwhile I took out the ad18 drive again and tried to use a different drive. But that was listed as UNAVAIL with corrupted data by zfs. Probably it already branded the disk for resilvering and is looking for exactly this one now. I also put in the disk which caused the problem above again. The resilvering process started again, but very soon the drive got detached again resulting in the same situation I described above. Any help is greatly appreciated. cu Gerrit ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: ZFS zpool replace problems
Hi-- On Jan 26, 2010, at 7:03 AM, Gerrit Kühn wrote: [ ... ] atap...@pci0:3:6:0: class=0x010401 card=0x02409005 chip=0x02401095 rev=0x02 hdr=0x00 vendor = 'Silicon Image Inc (Was: CMD Technology Inc)' device = 'SATA/Raid controller(2XSATA150) (SIL3112)' class = mass storage subclass = RAID Meanwhile I took out the ad18 drive again and tried to use a different drive. But that was listed as UNAVAIL with corrupted data by zfs. There's your problem-- the Silicon Image 3112/4 chips are remarkably buggy and exhibit data corruption: http://unix.derkeiler.com/Mailing-Lists/FreeBSD/stable/2005-08/0208.html Regards, -- -Chuck ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: ZFS zpool replace problems
On Tue, 26 Jan 2010 08:15:27 -0800 Chuck Swiger cswi...@mac.com wrote about Re: ZFS zpool replace problems: CS Meanwhile I took out the ad18 drive again and tried to use a CS different drive. But that was listed as UNAVAIL with corrupted CS data by zfs. CS There's your problem-- the Silicon Image 3112/4 chips are remarkably CS buggy and exhibit data corruption: Hm, sure? I would expect the same behaviour (detaching) as with the first drive if the controller was the reason in this case. CS http://unix.derkeiler.com/Mailing-Lists/FreeBSD/stable/2005-08/0208.html I already thought about replacing the controller to get rid of the detach-problem. However, I cannot do this online and I really would prefer fixing the disk firmware problem first. I could remove the hotspare drive ad14 and use this slot for putting in a replacement disk. Is it possible to get ad18 out of zfs' replacing process? Maybe by detaching the disk from the pool? cu Gerrit ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: ZFS zpool replace problems
On Tue, Jan 26, 2010 at 08:15:27AM -0800, Chuck Swiger wrote: Hi-- On Jan 26, 2010, at 7:03 AM, Gerrit Kühn wrote: [ ... ] atap...@pci0:3:6:0: class=0x010401 card=0x02409005 chip=0x02401095 rev=0x02 hdr=0x00 vendor = 'Silicon Image Inc (Was: CMD Technology Inc)' device = 'SATA/Raid controller(2XSATA150) (SIL3112)' class = mass storage subclass = RAID Meanwhile I took out the ad18 drive again and tried to use a different drive. But that was listed as UNAVAIL with corrupted data by zfs. There's your problem-- the Silicon Image 3112/4 chips are remarkably buggy and exhibit data corruption: http://unix.derkeiler.com/Mailing-Lists/FreeBSD/stable/2005-08/0208.html Well, to be fair, we can't be 100% certain he got bit by that bug. It's possible/likely, but we don't know for certain at this point. We also don't know what brand hard disks he had connected to ad16 and/or ad18. Older Silicon Image controllers are known for. well, just read the Wikipedia entry for details. http://en.wikipedia.org/wiki/Silicon_Image_Inc.#Product_alerts I don't have any experience with their newer models, but I'm told they're significantly improved (throughput and reliability-wise). But it is amusing, almost ironic, how Silicon Image bought CMD -- the same company who was infamous for their CMD640 IDE controller causing data corruption... back in 1995. As others have stated already: Intel could make a fortune off of a simple PCIe or PCI-X SATA controller card that's ICH9/ICH10-based. I guess there's more money in forcing people to buy motherboards with said southbridge. -- | Jeremy Chadwick j...@parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB | ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: ZFS zpool replace problems
On Tue, 26 Jan 2010 08:27:37 -0800 Jeremy Chadwick free...@jdc.parodius.com wrote about Re: ZFS zpool replace problems: JC Well, to be fair, we can't be 100% certain he got bit by that bug. JC It's possible/likely, but we don't know for certain at this point. We JC also don't know what brand hard disks he had connected to ad16 and/or JC ad18. The same as on the others (WD RE2GP), just with the updated firmware (02.01B02 that is) to get rid of the lcc problem. JC Older Silicon Image controllers are known for. well, just read the JC Wikipedia entry for details. JC http://en.wikipedia.org/wiki/Silicon_Image_Inc.#Product_alerts I knew the card is not top of the line, but I didn't know that it is /that/ bad. When I set up the system 1 or 2 years ago, I just thought it might be nice to be able to use the two extra slots in case of any drives having to be replaced or so and the card was just lying aroung (well, maybe I have an idea now why nobody else wanted to use it :-). I guess I will try to offline the hotspare slot (connected to the mcp55 on the motherboard) and plug the replacement disk in there. Maybe zfs recognizes it and picks up the resilvering there. Otherwise I'll have to look into how to get rid of the degraded resilvering process and restart it with the drive in the other slot. JC As others have stated already: Intel could make a fortune off of a JC simple PCIe or PCI-X SATA controller card that's ICH9/ICH10-based. Indeed. I use these 8-channel Supermicro-Controller (I think I recommended them some time ago here) with LSI chipset that work really nicely. But the backet does not fit into standard slots and there is no PCI-X version. I would certainly prefer a regular card by Intel. cu Gerrit ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: ZFS zpool replace problems
On Tue, Jan 26, 2010 at 04:03:20PM +0100, Gerrit Kühn wrote: On Tue, 26 Jan 2010 06:30:21 -0800 Jeremy Chadwick free...@jdc.parodius.com wrote about Re: ZFS zpool replace problems: JC 2) How did you attach ad18? Did you tell the system about it using JCatacontrol? If so, what commands did you use? Yes. The drives did not appear automatically (verified with atacontrol list). Then I first tried reinit ata9, but that did not work out, so I did a detach/attach for ata9, then the drive was there (with list and also the device node appeared). The procedure -- at least on Intel controllers in AHCI mode -- is: - zpool offline pool disk - atacontrol detach ataX (where X = channel associated with disk) - Physically remove bad disk - Physically insert new disk - Wait 15 seconds for stuff to settle - atacontrol attach ataX (where X = previous channel detached) - zpool replace pool disk - zpool online pool disk reinit shouldn't be needed at all -- in fact, I've seen reinit cause some craziness (even on Intel controllers), including a system deadlock, but this was back during the RELENG_6 and RELENG_7 days. Great improvements have been made to ata(4) since then. If you need me to validate the above procedure (it's been a while since I've had to hot-swap a disk), I can do so. I do have a 4-disk Supermicro SuperServer 5015B-MTB (ICH9-based) sitting on my workbench which I can test with. Meanwhile I took out the ad18 drive again and tried to use a different drive. But that was listed as UNAVAIL with corrupted data by zfs. Probably it already branded the disk for resilvering and is looking for exactly this one now. I also put in the disk which caused the problem above again. The resilvering process started again, but very soon the drive got detached again resulting in the same situation I described above. It honestly sounds like hot-swapping is causing some chaos on your system. Are all of the controllers involved configured for AHCI? If not, physical removal/insertion should be done only when the system power is off. If so, mav@ or others may be able to help figure out what's going on in the underlying ata(4) layer. -- | Jeremy Chadwick j...@parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB | ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: ZFS zpool replace problems
On Tue, 26 Jan 2010 08:46:19 -0800 Jeremy Chadwick free...@jdc.parodius.com wrote about Re: ZFS zpool replace problems: JC - zpool offline pool disk JC - atacontrol detach ataX (where X = channel associated with disk) JC - Physically remove bad disk JC - Physically insert new disk JC - Wait 15 seconds for stuff to settle JC - atacontrol attach ataX (where X = previous channel detached) JC - zpool replace pool disk JC - zpool online pool disk JC reinit shouldn't be needed at all -- in fact, I've seen reinit cause JC some craziness (even on Intel controllers), including a system JC deadlock, but this was back during the RELENG_6 and RELENG_7 days. JC Great improvements have been made to ata(4) since then. Thanks for pointing that out. I would have went exactly this way, if I did not have the extra slots or one of the drives was actually faulty. But in this case I just wanted to replace every drive on-by-one and (at least I thought) I had extra slots, so I did not want to give up the redundancy during the replacement (knowing very well that the drives to be replaced are already beyond the specification of wd due to the load-cycle bug). JC If you need me to validate the above procedure (it's been a while since JC I've had to hot-swap a disk), I can do so. I do have a 4-disk JC Supermicro SuperServer 5015B-MTB (ICH9-based) sitting on my workbench JC which I can test with. I'm quite sure this will work fine. I just don't know how to get rid of the degraded replacement zfs sees. JC It honestly sounds like hot-swapping is causing some chaos on your JC system. Are all of the controllers involved configured for AHCI? I think so. How could I verify this? cu Gerrit ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: ZFS zpool replace problems
On Tue, 26 Jan 2010 08:59:27 -0800 Chuck Swiger cswi...@mac.com wrote about Re: ZFS zpool replace problems: CS As a general matter of maintaining RAID systems, however, the approach CS to upgrading drive firmware on members of a RAID array should be to CS take down the entire container and offline the drives, update one CS drive, test it (via SMART self-test and read-only checksum comparison CS or similar), and then proceed to update all of the drives (preferably CS doing the SMART self-test for each, if time allows) before returning CS them to the RAID container and onlining them. Well, I had several spare drives sitting on the shelf. So I updated the firmware of these spare drives and now want to replace the drives with the old firmware by new new ones one-by-one. Taking the system offline for longer than a few minutes is not really an option. I'd rather roll in a new machine to take over the job in that case. CS Pulling individual drives from a RAID set while live and updating the CS firmware one at a time is not an approach I would take-- running with CS mixed firmware versions doesn't thrill me, and I know of multiple CS cases where someone made a mistake reconnecting a drive with the wrong CS SCSI id or something like that, taking out a second drive while the CS RAID was not redundant, resulting in massive data corruption or even CS total loss of the RAID contents. This scenario was exactly the reason why I plugged in the new drive to an extra slot and asked zfs to replace it with an old one. Well, I did not know what kind of fiasco the controller for this extra slot would turn out to be - otherwise I would have used the hot-spare slot for this in the first place. cu Gerrit ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: ZFS zpool replace problems
Hi-- On Jan 26, 2010, at 8:25 AM, Gerrit Kühn wrote: CS There's your problem-- the Silicon Image 3112/4 chips are remarkably CS buggy and exhibit data corruption: Hm, sure? I'm sure that the SII 3112 is buggy. I am not sure that it is the primary or only cause of the problems you describe. [ ... ] I already thought about replacing the controller to get rid of the detach-problem. However, I cannot do this online and I really would prefer fixing the disk firmware problem first. I could remove the hotspare drive ad14 and use this slot for putting in a replacement disk. Is it possible to get ad18 out of zfs' replacing process? Maybe by detaching the disk from the pool? I don't know enough about ZFS to provide specific advice for recovery attempts (aside from the notion of restoring your data from a backup instead). As a general matter of maintaining RAID systems, however, the approach to upgrading drive firmware on members of a RAID array should be to take down the entire container and offline the drives, update one drive, test it (via SMART self-test and read-only checksum comparison or similar), and then proceed to update all of the drives (preferably doing the SMART self-test for each, if time allows) before returning them to the RAID container and onlining them. Pulling individual drives from a RAID set while live and updating the firmware one at a time is not an approach I would take-- running with mixed firmware versions doesn't thrill me, and I know of multiple cases where someone made a mistake reconnecting a drive with the wrong SCSI id or something like that, taking out a second drive while the RAID was not redundant, resulting in massive data corruption or even total loss of the RAID contents. Regards, -- -Chuck ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org