Re: [opensuse-virtual] 15.1 Xen DomU freezing under high network/disk load, recoverable with NMI trigger

Glen Mon, 23 Dec 2019 14:48:56 -0800

Jan, Tony -

Thank you both for your responses.  I have more information now which
might be helpful, I'll provide it after I answer your comments.

On Mon, Dec 23, 2019 at 1:26 AM Jan Beulich <[email protected]> wrote:
> > br_netfilter
> > bridge
> One of these two is, according to my experience, a fair candidate for
> your problems.

Thank you.  I'll focus in on these and see what I can do.

On Mon, Dec 23, 2019 at 11:00 AM Tony Su <[email protected]> wrote:
> I'm going to guess that you didn't install your Xen on your HostOS
> using "the" recommended standard procedure... Which is to use the YaST
> Virtualization module. If you did that, then you shouldn't have
> variations. Also, you would be prompted to install a bridge device.

Sorry, I was not clear, my fault.

My HostOS is OpenSuse 15.1 across all hosts.  On two of the hosts, it
is a fresh load from the downloaded ISO, with only the defaults plus
the Xen patterns selected.  Following the fresh load, I did a zypper
update, and I did the recommend standard procedure, using Yast2 to
install the virtualization support.  That process did indeed prompt me
to create a bridge, and I did.  It seemed to me to be the same
procedure.

The other two hosts were fresh loaded (in the past) at 42.3, using the
same procedure then, and have since been zypper-dup'ped to 15.0 and
then 15.1 per the upgrade procedure.  All four hosts seem "clean"...
and the problem exists with guests on all four hosts.

But to be clear - the hosts are not freezing up or losing network
connectivity at all.  The hosts are fine.  It is only the guests that
are having issues.

> That said, I don't know how old your HostOS installations are (except
> for any you say you just installed)

The two fresh hosts were loaded about 6 weeks ago.  The other two were
dup'ped about 12 weeks ago.  All have been zypper-updated to the
latest stuff since then.  All four hosts run only Xen, no other stuff
at the same time.  Just stock OpenSuse 15.1, and Xen Dom0.

> Blabbering away...

Please continue!  I read everything else you said with interest.  It's
often the case that one thing one person says can trigger something
for someone else, and I'm hoping that happens here.  I am very
grateful for the background, history and detail.

On Mon, Dec 23, 2019 at 11:20 AM Tony Su <[email protected]> wrote:
> AFAIC all your /etc/sysctl/ settings look benign,

Thanks.

> But be aware that there is an effort to deprecate /etc/sysctl.

I wasn't aware of that *sigh* but thank you.

> Your post suggests you think that your problem might be network or
> disk related...

Maybe.  I honestly don't know.  All I know is, when I have a guest
running on any of my hosts, even if that guest is idle (because, for
example, it's a copy of a production machine) and therefore not
getting internet traffic or usage, I can make the host crash by
rsyncing a lot of data over a crossover cable at maximum speed.  I
haven't tested letting the host "just sit there", I suppose that'd be
a good bracketing test; for now, it seems like an rsync read triggers
the issue, so I assume (perhaps incorrectly) it's network or disk.

Everything else in your email read, understood, and appreciated, and

> https://sites.google.com/site/4techsecrets/optimize-and-fix-your-network-connection

I will read that after I send this.

So where I am now is this:   I have a guest machine image running
42.3.  There are four copies of this guest, running on four different
hosts.  The hosts are running 15.1.  Two are fresh loads, two are
zypper-dup'ped from 42.3 fresh loads.  The guest machines have had
problems on all four hosts.

1. The "production" guest is running 42.3  When its host was also on
42.3, it was rock solid.  When I dup'ped the host to 15.1, the guest
started going into the weeds every 5-7 days at random.  This is the
one I first reported in the 42.3 thread.  Olaf suggested installing
the SLE12-SP5 kernel on that guest.  I did that roughly 72 hours ago.
So far no issues, but it needs more time.  I previously thought that I
had to destroy/recreate this guest (as I mentioned in my thread); I
now realize (see #2 below) that if it crashed again I should be able
to recover it with an xl trigger nmi.

2. One of the backup guests, whose job it was to just rsync from
production, is a stock 42.3 guest still running the 4.4 (42.3) kernel.
It's host has also been dupped to 15.1.   It has never had a problem,
until today, when it went into the weeds.  I was able to recover it
using xl trigger nmi.

3. A third backup guest is also a stock 42.3 guest running the 4.4
kernel.  Its host, however, was a fresh load of 15.1.  It has locked
up once as well.

4. My fourth guest is a copy of the original 42.3 which the guest
itself has been zypper-dup'ped to 15.1   I have been copying data from
this machine, which causes it to freeze up, which is also recoverable
with an NMI.

So....

* Problem exists on any 15.1 host, whether fresh loaded or dup'ped.
* Problem exists on multiple copies of this particular guest, whether
at 42.3 or dup'ped to 15.1  The SLE12-SP5 kernel *might* have resolved
this, but more time is needed to be sure.  Problem seems to be
confined to (ccopies of) this particular guest , for this particular
client.
* Problem *seems* to not exist on any of my other guests from
different origins/other clients running 42.3 or 15.1, dup'ped or
fresh, although I suppose there could still be broken guests that just
haven't crashed yet.  But testing a fresh loaded 15.1 guest I could
not get it to crash.
* Problem *seems* to be related to utilization of network or disk, the
more utilization, the more frequent the hang.  No log output on guest
at all to indicate why.  Virtual hardware just.... stops.

I'm literally just guessing at this point, but that is why I suspect
something in this particular guest.  It could be a legacy thing - this
guest was last freshly loaded at 13.1, and has been zypper-dup'ped
step by step ever since.  That could be it, but I have other guests
with similar histories that are not malfunctioning.  This guest is
also running the extra modules I mentioned, and I'm going to look at
that.

In addition, this guest runs things like Docker, Elastic Search,
Kibana, and other programs that tend to eat CPU and IO even when
they're idle (grumble grumble).  I can't help but wonder if one of
these might be contributing.

But the thing is, when the machine hangs, it literally just... hangs.
The host can't detect it, it still shows b/r states and normal usage,
the only thing the host sees is "Guest network stalled".  But the
guest console is frozen, and literally you'd think you'd have to
destroy/recreate (as I did).  Only by chance did I discover that an
NMI recovered it.  After recovery, the host literally just starts
running again, just as it was, right from where it left off, except
for the clock.

So it's as if the guest is literally stopping at a (virtual) hardware
level... and hanging.... and then continuing on when I NMI it.  So it
seems to me that processes on the guest, even Docker, would show some
sign of trouble beforehand.  But there is none.  Loads are normal,
iotop is normal, I've literally say on the regular top with a 1.0
second refresh and had a guest hang on me right while I was looking at
it - and there is literally no warning at all.

I mean at this point I'm toying with an every-minute cronjob on the host like:

* * * * * ping -c4 -w5 [myhostip] &>/dev/null || xl trigger [guestdomid] nmi

Meaning that, as soon a the host can't ping the guest, assume it's in
the weeds, and NMI it.

I shouldn't have to live like that, but at least I'd sleep through the night.

So I hope this clarifies.  I'm kind of depressed that zypper-dup'ping
the guest to 15.1 didn't solve this - I hope the SLES kernel does.
But if not, it seems like a fresh load/recreate of the guest is all I
can do, and I'll do it if I have to, but I'm hoping this rings a bell
for someone who can point me to some additional data or solution.

Thank you all for your patience and support!

With great respect and appreciation,
Glen
-- 
To unsubscribe, e-mail: [email protected]
To contact the owner, e-mail: [email protected]

Re: [opensuse-virtual] 15.1 Xen DomU freezing under high network/disk load, recoverable with NMI trigger

Reply via email to