Was thinking along the same lines as the last resort...
If you can pin down the problem as a HostOS problem,
Shouldn't your Xen Guests(with docker containers within) be easily
migratable to a newly built HostOS?
Don't know what else is running on your system, but typically it's a
recommended best practice to keep the HostOS in a multi-apartment
system as simple and uncomplex as possible.

And a general observation,
If this machine was originally installed as a 13.1 and survived every
upgrade for each openSUSE version from then to now, IMO that's quite
an accomplishment.

IMO,
Tony

On Wed, Jan 8, 2020 at 9:01 AM Glen <[email protected]> wrote:
>
> Hi everyone -
>
> So my two threads really turn out to be one thread, and I'm replying
> to both here.  I apologize for the mess.
>
> First, I really appreciate all the responses and pointers, I'm very
> grateful for all the help.  THANK YOU to all those who responded on
> either thread!
>
> To recap, essentially what I have is:
>
> 1. I've always run a bunch of high-traffic 42.3 Xen PV guests on 42.3 hosts.
> 2. I upgraded the 42.3 hosts to 15.1, via online upgrade and then via
> fresh load.
> 3. When I did that, *one* of my 42.3 guests started hanging at random
> every 2-7 days.
>
> The hangs seemed to be related to high network and/or disk traffic. I
> discovered by accident that if I did an "xl trigger nmi" I could
> "unhang" the guest and make it resume duty, more or less, without a
> reboot, but I have no idea why the hang occurs or why it's recoverable
> in that way.
>
> Chasing this down has been painful.  It initially looked like sshd was
> the culprit, but it wasn't.  I thought the kernel mismatch might be
> the issue, but other 42.3 guests run on their 15.1 hosts without a
> problem, and upgrading the guest to 15.1 didn't solve it.  Olaf has
> been pointing me to new kernels, and that helped somewhat - moving to
> the SLE kernel extended the guest uptime from a few days to a few
> weeks (buying me much needed sleep, thank you!)  but I still don't
> have a solution.
>
> The problem seems to be in this particular guest... somewhere. The
> problem seems to travel with the guest:  If I clone the guest, and
> bring it up elsewhere that clone also has the problem. So I've
> resorted to making a copy of the guest and staging it on a different
> host just so I can stress-test it.
>
> To stress-test it, I basically initiate lots of high-traffic requests
> against the troubled guest from an outside source.   Initially, the
> guest was hanging during a single full outbound rsync.  To prevent SSD
> wear I modified the command and used a hack to simulate the traffic.
> If I boot the guest, and, from a different connected machine, do stuff
> like:
>
> nohup ssh 192.168.1.11 tar cf - --one-file-system /a | cat > /dev/null &
> nohup ssh 192.168.1.11 cat /dev/zero | cat > /dev/null &
>
> (where 1.11 is the troubled guest, and /a is a 4TB filesystem full of
> data)  I can make the troubled guest hang in somewhere between 45
> minutes and 12 hours.
>
> Thanks to Olaf, Jan, Tony and Fajar, I've been able to try a number of
> things, but so far, I've had no luck:
>
> Uprading openssh to the latest version did not solve it.
> Upgrading the guest to 15.1 (unifying the kernels) did not solve it.
> Upgrading the 42.3 guest kernel to a different version helped... but
> did not solve it.
> Removing some possibly problematic kernel modules did not solve it.
> Removing Docker did not solve it.
> I had optimizations from the Xen best practices page in
> /etc/sysctl.conf for just this guest - removing those did not solve
> it.
>
> The only solution I've found seems to be starting the guest over
> fresh.  If I do a fresh load of 15.1 as a guest and mount that same /a
> filesystem, and do those same tests... the freshly-loaded guest works
> fine.... it's rock solid.  I had those same tests running against the
> freshly-loaded guest for over 24 hours and it did just fine.  I can
> literally just swap out root filesystem images - booting the troubled
> guest's root filesystem results in the hangs - booting the fresh-load
> seems completely reliable.
>
> In short it seems to me now that there's something in this particular
> guest's root filesystem image... something I can't find... that is
> causing this.   The image started as a 13.1 (thirteen point one) fresh
> load (years ago) and has been in-place upgraded ever since.... so I am
> concluding that something bad has been brought forward that I'm not
> aware of.
>
> It seems at this point that I just need to rebuild the guest as a
> fresh 15.1 load, reinstalling only what I currently need, and going
> from there, and so that's what I'm going to do.  The usual absence of
> useful log data when the machine crashes is frustrating, and the time
> it takes (1-12 hours) to make a test machine crash makes the test
> process slow, so I'm feeling like I should just abandon this and
> replace the guest.
>
> If any of this triggers anything for anyone, please let me know.
> Otherwise, I'm continuing to stress test my freshly-loaded guest for a
> few more days (just to be sure) and then I'll start the reconnect and
> replacement process.   It really would have been nice to find out what
> on that troubled guest was causing the issue, but it's probably some
> legacy thing brought forward that's causing instability, and since
> each test cycle takes so long, the "process of elimination" could take
> months or more.
>
> And of course as soon as I send this, something new will break, making
> this all invalid.  :-)
>
> Anyway THANK YOU ALL for your support and help here, I am very grateful!
>
> Glen
> --
> To unsubscribe, e-mail: [email protected]
> To contact the owner, e-mail: [email protected]
>
-- 
To unsubscribe, e-mail: [email protected]
To contact the owner, e-mail: [email protected]

Reply via email to