Re: Storage 'failover' largely kills FreeBSD 10.x under XenServer?
> Hi All, > > We recently experienced an "unplanned storage" fail over on our XenServer > pool. The pool is 7.1 based (on certified HP kit), and runs a mix of > FreeBSD (all 10.3 based except for a legacy 9.x VM) - and a few Windows > VM's - storage is provided by two Citrix certified Synology storage boxes. > > During the fail over - Xen see's the storage paths go down, and come up > again (re-attaching when they are available again). Timing this - it takes > around a minute, worst case. > > The process killed 99% of our FreeBSD VM's :( > > The earlier 9.x FreeBSD box survived, and all the Windows VM's survived. > > Is there some 'tuneable' we can set to make the 10.3 boxes more tolerant of > the I/O delays that occur during a storage fail over? > > I've enclosed some of the error we observed below. I realise a full storage > fail over is a 'stressful time' for VM's - but the Windows VM's, and > earlier FreeBSD version survived without issue. All the 10.3 boxes logged > I/O errors, and then panic'd / rebooted. > > We've setup a test lab with the same kit - and can now replicate this at > will (every time most to all the FreeBSD 10.x boxes panic and reboot, but > Windows prevails) - so we can test any potential fixes. > > So if anyone can suggest anything we can tweak to minimize the chances of > this happening (i.e. make I/O more timeout tolerant, or set larger > timeouts?) that'd be great. As you found one of these let me point out the pair of them: kern.cam.ada.default_timeout: 30 kern.cam.ada.retry_count: 4 Rather than increasing default_timeout you might try increasing retry_count. Though it would seem that the default settings should of allowed for a 2 minute failure window, it may be that these are not working as I expect in this situation. ... > > Errors we observed: > > ada0: disk error cmd=write 11339752-11339767 status: > ada0: disk error cmd=write Did you actually get this 4 times, then it fell through to the next error? There should be some retry counts in here some place counting up to 4, then cam/ada should give up and pass the error up the stack. > g_vfs_done():11340544-11340607gpt/root[WRITE(offset=4731097088, > length=8192)] status: error = 5 > (repeated a couple of times with different values) > > Machine then goes on to panic: Ah, okay, so it is repeating.. these messages should be 30 seconds apart, there should be exactly 4 of them, then you get the panic. If that is the case try cranking kern.cam.ada.retry_count up and see if that resolves your issue. > g_vfs_done():panic: softdep_setup_freeblocks: inode busy > cpuid = 0 > KDB: stack backtrace: > #0 0x8098e810 at kdb_backtrace+0x60 > #1 0x809514e6 at vpanic+0x126 > #2 0x809513b3 at panic+0x43 > #3 0x80b9c685 at softdep_setup_freeblocks+0xaf5 > #4 0x80b86bae at ffs_truncate+0x44e > #5 0x80bbec49 at ufs_setattr+0x769 > #6 0x80e81891 at VOP_SETATTR_APV+0xa1 > #7 0x80a053c5 at vn_trunacte+0x165 > #8 0x809ff236 at kern_openat+0x326 > #9 0x80d56e6f at amd64_syscall+0x40f > #10 0x80d3c0cb at Xfast_syscall+0xfb > > > Another box also logged: > > ada0: disk error cmd=read 9970080-9970082 status: > g_vfs_done():gpt/root[READ(offset=4029825024, length=1536)]error = 5 > vnode_pager_getpages: I/O read error > vm_fault: pager read error, pid 24219 (make) > > And again, went on to panic shortly thereafter. > > I had to hand transcribe the above from screen shots / video, so apologies > if any errors crept in. > > I'm hoping there's just a magic sysctl / kernel option we can set to up the > timeouts? (if it is as simple as timeouts killing things) Yes, freebsd does not live long when its disk drive goes away... 2.5 minutes to panic in almost all cases of a drive failure. -- Rod Grimes rgri...@freebsd.org ___ freebsd-xen@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-xen To unsubscribe, send any mail to "freebsd-xen-unsubscr...@freebsd.org"
Re: Storage 'failover' largely kills FreeBSD 10.x under XenServer?
Karl Pielorz wrote on 2017/09/20 16:54: --On 20 September 2017 at 12:44:18 +0100 Roger Pau Monnéwrote: Is there some 'tuneable' we can set to make the 10.3 boxes more tolerant of the I/O delays that occur during a storage fail over? Do you know whether the VMs saw the disks disconnecting and then connecting again? I can't see any evidence the drives actually get 'disconnected' from the VM's point of view. Plenty of I/O errors - but no "device destroyed" type stuff. I have seen that kind of error logged on our test kit - when deliberately failed non-HA storage, but I don't see it this time. Hm, I have the feeling that part of the problem is that in-flight requests are basically lost when a disconnect/reconnect happens. So if a disconnect doesn't happen (as it appears it isn't) - is there any tunable to set the I/O timeout? 'sysctl -a | grep timeout' finds things like: kern.cam.ada.default_timeout=30 Yes, you can try to set kern.cam.ada.default_timeout to 60 or more, but it can has downside too. Miroslav Lachman ___ freebsd-xen@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-xen To unsubscribe, send any mail to "freebsd-xen-unsubscr...@freebsd.org"
Re: Storage 'failover' largely kills FreeBSD 10.x under XenServer?
--On 20 September 2017 at 12:44:18 +0100 Roger Pau Monnéwrote: Is there some 'tuneable' we can set to make the 10.3 boxes more tolerant of the I/O delays that occur during a storage fail over? Do you know whether the VMs saw the disks disconnecting and then connecting again? I can't see any evidence the drives actually get 'disconnected' from the VM's point of view. Plenty of I/O errors - but no "device destroyed" type stuff. I have seen that kind of error logged on our test kit - when deliberately failed non-HA storage, but I don't see it this time. Hm, I have the feeling that part of the problem is that in-flight requests are basically lost when a disconnect/reconnect happens. So if a disconnect doesn't happen (as it appears it isn't) - is there any tunable to set the I/O timeout? 'sysctl -a | grep timeout' finds things like: kern.cam.ada.default_timeout=30 I might see if that has any effect (from memory - as I'm out of the office now - it did seem to be about 30 seconds before the VM's started logging I/O related errors to the console). As it's a pure test setup - I can try adjusting this without fear of breaking anything :) Though I'm open to other suggestions... fwiw - Who's responsibility is it to re-send lost "in flight" data, e.g. if a write is 'in flight' when an I/O error occurs in the lower layers of XenServer is it XenServers responsibility to retry that - before giving up, or does it just push the error straight back to the VM - expecting the VM to retry it? [or a bit of both?] - just curious. -Karl ___ freebsd-xen@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-xen To unsubscribe, send any mail to "freebsd-xen-unsubscr...@freebsd.org"
Re: Storage 'failover' largely kills FreeBSD 10.x under XenServer?
On Wed, Sep 20, 2017 at 11:35:26AM +0100, Karl Pielorz wrote: > > Hi All, > > We recently experienced an "unplanned storage" fail over on our XenServer > pool. The pool is 7.1 based (on certified HP kit), and runs a mix of FreeBSD > (all 10.3 based except for a legacy 9.x VM) - and a few Windows VM's - > storage is provided by two Citrix certified Synology storage boxes. > > During the fail over - Xen see's the storage paths go down, and come up > again (re-attaching when they are available again). Timing this - it takes > around a minute, worst case. > > The process killed 99% of our FreeBSD VM's :( > > The earlier 9.x FreeBSD box survived, and all the Windows VM's survived. > > Is there some 'tuneable' we can set to make the 10.3 boxes more tolerant of > the I/O delays that occur during a storage fail over? Do you know whether the VMs saw the disks disconnecting and then connecting again? > I've enclosed some of the error we observed below. I realise a full storage > fail over is a 'stressful time' for VM's - but the Windows VM's, and earlier > FreeBSD version survived without issue. All the 10.3 boxes logged I/O > errors, and then panic'd / rebooted. > > We've setup a test lab with the same kit - and can now replicate this at > will (every time most to all the FreeBSD 10.x boxes panic and reboot, but > Windows prevails) - so we can test any potential fixes. > > So if anyone can suggest anything we can tweak to minimize the chances of > this happening (i.e. make I/O more timeout tolerant, or set larger > timeouts?) that'd be great. Hm, I have the feeling that part of the problem is that in-flight requests are basically lost when a disconnect/reconnect happens. Thanks, Roger. ___ freebsd-xen@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-xen To unsubscribe, send any mail to "freebsd-xen-unsubscr...@freebsd.org"