Jan, Tony - Thank you both for your responses. I have more information now which might be helpful, I'll provide it after I answer your comments.
On Mon, Dec 23, 2019 at 1:26 AM Jan Beulich <[email protected]> wrote: > > br_netfilter > > bridge > One of these two is, according to my experience, a fair candidate for > your problems. Thank you. I'll focus in on these and see what I can do. On Mon, Dec 23, 2019 at 11:00 AM Tony Su <[email protected]> wrote: > I'm going to guess that you didn't install your Xen on your HostOS > using "the" recommended standard procedure... Which is to use the YaST > Virtualization module. If you did that, then you shouldn't have > variations. Also, you would be prompted to install a bridge device. Sorry, I was not clear, my fault. My HostOS is OpenSuse 15.1 across all hosts. On two of the hosts, it is a fresh load from the downloaded ISO, with only the defaults plus the Xen patterns selected. Following the fresh load, I did a zypper update, and I did the recommend standard procedure, using Yast2 to install the virtualization support. That process did indeed prompt me to create a bridge, and I did. It seemed to me to be the same procedure. The other two hosts were fresh loaded (in the past) at 42.3, using the same procedure then, and have since been zypper-dup'ped to 15.0 and then 15.1 per the upgrade procedure. All four hosts seem "clean"... and the problem exists with guests on all four hosts. But to be clear - the hosts are not freezing up or losing network connectivity at all. The hosts are fine. It is only the guests that are having issues. > That said, I don't know how old your HostOS installations are (except > for any you say you just installed) The two fresh hosts were loaded about 6 weeks ago. The other two were dup'ped about 12 weeks ago. All have been zypper-updated to the latest stuff since then. All four hosts run only Xen, no other stuff at the same time. Just stock OpenSuse 15.1, and Xen Dom0. > Blabbering away... Please continue! I read everything else you said with interest. It's often the case that one thing one person says can trigger something for someone else, and I'm hoping that happens here. I am very grateful for the background, history and detail. On Mon, Dec 23, 2019 at 11:20 AM Tony Su <[email protected]> wrote: > AFAIC all your /etc/sysctl/ settings look benign, Thanks. > But be aware that there is an effort to deprecate /etc/sysctl. I wasn't aware of that *sigh* but thank you. > Your post suggests you think that your problem might be network or > disk related... Maybe. I honestly don't know. All I know is, when I have a guest running on any of my hosts, even if that guest is idle (because, for example, it's a copy of a production machine) and therefore not getting internet traffic or usage, I can make the host crash by rsyncing a lot of data over a crossover cable at maximum speed. I haven't tested letting the host "just sit there", I suppose that'd be a good bracketing test; for now, it seems like an rsync read triggers the issue, so I assume (perhaps incorrectly) it's network or disk. Everything else in your email read, understood, and appreciated, and > https://sites.google.com/site/4techsecrets/optimize-and-fix-your-network-connection I will read that after I send this. So where I am now is this: I have a guest machine image running 42.3. There are four copies of this guest, running on four different hosts. The hosts are running 15.1. Two are fresh loads, two are zypper-dup'ped from 42.3 fresh loads. The guest machines have had problems on all four hosts. 1. The "production" guest is running 42.3 When its host was also on 42.3, it was rock solid. When I dup'ped the host to 15.1, the guest started going into the weeds every 5-7 days at random. This is the one I first reported in the 42.3 thread. Olaf suggested installing the SLE12-SP5 kernel on that guest. I did that roughly 72 hours ago. So far no issues, but it needs more time. I previously thought that I had to destroy/recreate this guest (as I mentioned in my thread); I now realize (see #2 below) that if it crashed again I should be able to recover it with an xl trigger nmi. 2. One of the backup guests, whose job it was to just rsync from production, is a stock 42.3 guest still running the 4.4 (42.3) kernel. It's host has also been dupped to 15.1. It has never had a problem, until today, when it went into the weeds. I was able to recover it using xl trigger nmi. 3. A third backup guest is also a stock 42.3 guest running the 4.4 kernel. Its host, however, was a fresh load of 15.1. It has locked up once as well. 4. My fourth guest is a copy of the original 42.3 which the guest itself has been zypper-dup'ped to 15.1 I have been copying data from this machine, which causes it to freeze up, which is also recoverable with an NMI. So.... * Problem exists on any 15.1 host, whether fresh loaded or dup'ped. * Problem exists on multiple copies of this particular guest, whether at 42.3 or dup'ped to 15.1 The SLE12-SP5 kernel *might* have resolved this, but more time is needed to be sure. Problem seems to be confined to (ccopies of) this particular guest , for this particular client. * Problem *seems* to not exist on any of my other guests from different origins/other clients running 42.3 or 15.1, dup'ped or fresh, although I suppose there could still be broken guests that just haven't crashed yet. But testing a fresh loaded 15.1 guest I could not get it to crash. * Problem *seems* to be related to utilization of network or disk, the more utilization, the more frequent the hang. No log output on guest at all to indicate why. Virtual hardware just.... stops. I'm literally just guessing at this point, but that is why I suspect something in this particular guest. It could be a legacy thing - this guest was last freshly loaded at 13.1, and has been zypper-dup'ped step by step ever since. That could be it, but I have other guests with similar histories that are not malfunctioning. This guest is also running the extra modules I mentioned, and I'm going to look at that. In addition, this guest runs things like Docker, Elastic Search, Kibana, and other programs that tend to eat CPU and IO even when they're idle (grumble grumble). I can't help but wonder if one of these might be contributing. But the thing is, when the machine hangs, it literally just... hangs. The host can't detect it, it still shows b/r states and normal usage, the only thing the host sees is "Guest network stalled". But the guest console is frozen, and literally you'd think you'd have to destroy/recreate (as I did). Only by chance did I discover that an NMI recovered it. After recovery, the host literally just starts running again, just as it was, right from where it left off, except for the clock. So it's as if the guest is literally stopping at a (virtual) hardware level... and hanging.... and then continuing on when I NMI it. So it seems to me that processes on the guest, even Docker, would show some sign of trouble beforehand. But there is none. Loads are normal, iotop is normal, I've literally say on the regular top with a 1.0 second refresh and had a guest hang on me right while I was looking at it - and there is literally no warning at all. I mean at this point I'm toying with an every-minute cronjob on the host like: * * * * * ping -c4 -w5 [myhostip] &>/dev/null || xl trigger [guestdomid] nmi Meaning that, as soon a the host can't ping the guest, assume it's in the weeds, and NMI it. I shouldn't have to live like that, but at least I'd sleep through the night. So I hope this clarifies. I'm kind of depressed that zypper-dup'ping the guest to 15.1 didn't solve this - I hope the SLES kernel does. But if not, it seems like a fresh load/recreate of the guest is all I can do, and I'll do it if I have to, but I'm hoping this rings a bell for someone who can point me to some additional data or solution. Thank you all for your patience and support! With great respect and appreciation, Glen -- To unsubscribe, e-mail: [email protected] To contact the owner, e-mail: [email protected]
