Was thinking along the same lines as the last resort... If you can pin down the problem as a HostOS problem, Shouldn't your Xen Guests(with docker containers within) be easily migratable to a newly built HostOS? Don't know what else is running on your system, but typically it's a recommended best practice to keep the HostOS in a multi-apartment system as simple and uncomplex as possible.
And a general observation, If this machine was originally installed as a 13.1 and survived every upgrade for each openSUSE version from then to now, IMO that's quite an accomplishment. IMO, Tony On Wed, Jan 8, 2020 at 9:01 AM Glen <[email protected]> wrote: > > Hi everyone - > > So my two threads really turn out to be one thread, and I'm replying > to both here. I apologize for the mess. > > First, I really appreciate all the responses and pointers, I'm very > grateful for all the help. THANK YOU to all those who responded on > either thread! > > To recap, essentially what I have is: > > 1. I've always run a bunch of high-traffic 42.3 Xen PV guests on 42.3 hosts. > 2. I upgraded the 42.3 hosts to 15.1, via online upgrade and then via > fresh load. > 3. When I did that, *one* of my 42.3 guests started hanging at random > every 2-7 days. > > The hangs seemed to be related to high network and/or disk traffic. I > discovered by accident that if I did an "xl trigger nmi" I could > "unhang" the guest and make it resume duty, more or less, without a > reboot, but I have no idea why the hang occurs or why it's recoverable > in that way. > > Chasing this down has been painful. It initially looked like sshd was > the culprit, but it wasn't. I thought the kernel mismatch might be > the issue, but other 42.3 guests run on their 15.1 hosts without a > problem, and upgrading the guest to 15.1 didn't solve it. Olaf has > been pointing me to new kernels, and that helped somewhat - moving to > the SLE kernel extended the guest uptime from a few days to a few > weeks (buying me much needed sleep, thank you!) but I still don't > have a solution. > > The problem seems to be in this particular guest... somewhere. The > problem seems to travel with the guest: If I clone the guest, and > bring it up elsewhere that clone also has the problem. So I've > resorted to making a copy of the guest and staging it on a different > host just so I can stress-test it. > > To stress-test it, I basically initiate lots of high-traffic requests > against the troubled guest from an outside source. Initially, the > guest was hanging during a single full outbound rsync. To prevent SSD > wear I modified the command and used a hack to simulate the traffic. > If I boot the guest, and, from a different connected machine, do stuff > like: > > nohup ssh 192.168.1.11 tar cf - --one-file-system /a | cat > /dev/null & > nohup ssh 192.168.1.11 cat /dev/zero | cat > /dev/null & > > (where 1.11 is the troubled guest, and /a is a 4TB filesystem full of > data) I can make the troubled guest hang in somewhere between 45 > minutes and 12 hours. > > Thanks to Olaf, Jan, Tony and Fajar, I've been able to try a number of > things, but so far, I've had no luck: > > Uprading openssh to the latest version did not solve it. > Upgrading the guest to 15.1 (unifying the kernels) did not solve it. > Upgrading the 42.3 guest kernel to a different version helped... but > did not solve it. > Removing some possibly problematic kernel modules did not solve it. > Removing Docker did not solve it. > I had optimizations from the Xen best practices page in > /etc/sysctl.conf for just this guest - removing those did not solve > it. > > The only solution I've found seems to be starting the guest over > fresh. If I do a fresh load of 15.1 as a guest and mount that same /a > filesystem, and do those same tests... the freshly-loaded guest works > fine.... it's rock solid. I had those same tests running against the > freshly-loaded guest for over 24 hours and it did just fine. I can > literally just swap out root filesystem images - booting the troubled > guest's root filesystem results in the hangs - booting the fresh-load > seems completely reliable. > > In short it seems to me now that there's something in this particular > guest's root filesystem image... something I can't find... that is > causing this. The image started as a 13.1 (thirteen point one) fresh > load (years ago) and has been in-place upgraded ever since.... so I am > concluding that something bad has been brought forward that I'm not > aware of. > > It seems at this point that I just need to rebuild the guest as a > fresh 15.1 load, reinstalling only what I currently need, and going > from there, and so that's what I'm going to do. The usual absence of > useful log data when the machine crashes is frustrating, and the time > it takes (1-12 hours) to make a test machine crash makes the test > process slow, so I'm feeling like I should just abandon this and > replace the guest. > > If any of this triggers anything for anyone, please let me know. > Otherwise, I'm continuing to stress test my freshly-loaded guest for a > few more days (just to be sure) and then I'll start the reconnect and > replacement process. It really would have been nice to find out what > on that troubled guest was causing the issue, but it's probably some > legacy thing brought forward that's causing instability, and since > each test cycle takes so long, the "process of elimination" could take > months or more. > > And of course as soon as I send this, something new will break, making > this all invalid. :-) > > Anyway THANK YOU ALL for your support and help here, I am very grateful! > > Glen > -- > To unsubscribe, e-mail: [email protected] > To contact the owner, e-mail: [email protected] > -- To unsubscribe, e-mail: [email protected] To contact the owner, e-mail: [email protected]
