To circle back: I can reproduce the VM lock-up 100% of the time by typing too quickly into the VM virtual serial console, such as my password and longer command strings that I know by muscle memory.
I tried a few things such as slowly typing several kilobytes of text into the console, one character at a time. If I mash the keyboard inside cu, the VM locks up. I went to the text console of the VM host (my daily-driver laptop), and slowly decreased the keyboard repeat time with: wsconsctl keyboard.repeat.deln=<n> And then attached to the vm virtual console using "doas vmctl console 1" I proceeded to hold down a key and let a few lines of text show up before exiting the console, decreasing the deln delay further, and repeating the experiment. 100 is the default value, so holding a key down (longer than the default 400msec value of del1) will result in a 100msec delay between repeat keystrokes on input. I reduced this first to 75, then to 50, 25, 15, 10, and 5. With a repeat delay of 5msec on the virtual console, I was able to reliably lock up vms in a few dozen "keystrokes" (a matter of a second or two holding a key down). I was able to get three different vms to lock up, one running the october 22 snapshot, and two others running OpenBSD-6.0 Release, one i386, the other amd64. I cannot reproduce this, even with a high keyboard repeat rate, though an SSH session to any of the VMs. Mike and I have been in touch off-list (Thanks again!), but I thought the results of my testing were relevant to misc@. On Wed, Oct 26, 2016 at 7:15 PM, Mike Larkin <mlar...@azathoth.net> wrote: > On Wed, Oct 26, 2016 at 06:36:25PM -0500, Ax0n wrote: > > I'm running vmd with the options you specified, and using tee(1) to peel > it > > off to a file while I can still watch what happens in the foreground. It > > hasn't happened again yet, but I haven't been messing with the VMs as > much > > this week as I was over the weekend. > > > > One thing of interest: inside the VM running the Oct 22 snapshot, top(1) > > reports the CPU utilization hovering over 1.0 load, with nearly 100% in > > interrupt state, which seems pretty odd to me. I am also running an i386 > > and amd64 vm at the same time, both on 6.0-Release and neither of them > are > > exhibiting this high load. I'll probably update the snapshot of the > > -CURRENT(ish) VM tonight, and the snapshot of my host system (which is > also > > my daily driver) this weekend. > > > > I've seen that (and have seen it reported) from time to time as well. This > is unlikely time being spent in interrupt, it's more likely a time > accounting > error that's making the guest think it's spending more in interrupt > servicing > than it actually is. This is due to the fact that both the statclock and > hardclock are running at 100Hz (or close to it) because the host is unable > to inject more frequent interrupts. > > You might try running the host at 1000Hz and see if that fixes the problem. > It did, for me. Note that such an adjustment is really a hack and should > just be viewed as a temporary workaround. Of course, don't run your guests > at 1000Hz as well (that would defeat the purpose of cranking the host). > That > parameter can be adjusted in param.c. > > -ml > > > load averages: 1.07, 1.09, 0.94 vmmbsd.labs.h-i-r.net > > 05:05:27 > > 26 processes: 1 running, 24 idle, 1 on processor up > > 0:28 > > CPU states: 0.0% user, 0.0% nice, 0.4% system, 99.6% interrupt, 0.0% > > idle > > Memory: Real: 21M/130M act/tot Free: 355M Cache: 74M Swap: 0K/63M > > > > PID USERNAME PRI NICE SIZE RES STATE WAIT TIME CPU > COMMAND > > 1 root 10 0 420K 496K idle wait 0:01 0.00% init > > 13415 _ntp 2 -20 888K 2428K sleep poll 0:00 0.00% ntpd > > 15850 axon 3 0 724K 760K sleep ttyin 0:00 0.00% ksh > > 42990 _syslogd 2 0 972K 1468K sleep kqread 0:00 0.00% > syslogd > > 89057 _pflogd 4 0 672K 424K sleep bpf 0:00 0.00% > pflogd > > 2894 root 2 0 948K 3160K sleep poll 0:00 0.00% sshd > > 85054 _ntp 2 0 668K 2316K idle poll 0:00 0.00% ntpd > > > > > > > > On Tue, Oct 25, 2016 at 2:09 AM, Mike Larkin <mlar...@azathoth.net> > wrote: > > > > > On Mon, Oct 24, 2016 at 11:07:32PM -0500, Ax0n wrote: > > > > Thanks for the update, ml. > > > > > > > > The VM Just did it again in the middle of backspacing over uname > -a... > > > > > > > > $ uname -a > > > > OpenBSD vmmbsd.labs.h-i-r.net 6.0 GENERIC.MP#0 amd64 > > > > $ un <-- frozen > > > > > > > > Spinning like mad. > > > > > > > > > > Bizarre. If it were I, I'd next try killing all vmd processes and > > > running vmd -dvvv from a root console window and look for what it dumps > > > out when it hangs like this (if anything). > > > > > > You'll see a fair number of "vmd: unknown exit code 1" (and 48), those > > > are harmless and can be ignored, as can anything that vmd dumps out > > > before the vm gets stuck like this. > > > > > > If you capture this and post somewhere I can take a look. You may need > to > > > extract the content out of /var/log/messages if a bunch gets printed. > > > > > > If this fails to diagnose what happens, I can work with you off-list on > > > how to debug further. > > > > > > -ml > > > > > > > [axon@transient ~]$ vmctl status > > > > ID PID VCPUS MAXMEM CURMEM TTY NAME > > > > 2 2769 1 512MB 149MB /dev/ttyp3 -c > > > > 1 48245 1 512MB 211MB /dev/ttyp0 obsdvmm.vm > > > > [axon@transient ~]$ ps aux | grep 48245 > > > > _vmd 48245 98.5 2.3 526880 136956 ?? Rp 1:54PM 47:08.30 > vmd: > > > > obsdvmm.vm (vmd) > > > > > > > > load averages: 2.43, 2.36, > > > > 2.26 > > > > transient.my.domain 18:29:10 > > > > 56 processes: 53 idle, 3 on > > > > processor > > > > up 4:35 > > > > CPU0 states: 3.8% user, 0.0% nice, 15.4% system, 0.6% interrupt, > 80.2% > > > > idle > > > > CPU1 states: 15.3% user, 0.0% nice, 49.3% system, 0.0% interrupt, > 35.4% > > > > idle > > > > CPU2 states: 6.6% user, 0.0% nice, 24.3% system, 0.0% interrupt, > 69.1% > > > > idle > > > > CPU3 states: 4.7% user, 0.0% nice, 18.1% system, 0.0% interrupt, > 77.2% > > > > idle > > > > Memory: Real: 1401M/2183M act/tot Free: 3443M Cache: 536M Swap: > 0K/4007M > > > > > > > > PID USERNAME PRI NICE SIZE RES STATE WAIT TIME CPU > > > COMMAND > > > > 48245 _vmd 43 0 515M 134M onproc thrslee 47:37 98.00% > vmd > > > > 7234 axon 2 0 737M 715M sleep poll 33:18 19.14% > > > firefox > > > > 42481 _x11 55 0 16M 42M onproc - 2:53 9.96% > Xorg > > > > 2769 _vmd 29 0 514M 62M idle thrslee 2:29 9.62% > vmd > > > > 13503 axon 10 0 512K 2496K sleep nanosle 0:52 1.12% > wmapm > > > > 76008 axon 10 0 524K 2588K sleep nanosle 0:10 0.73% > wmmon > > > > 57059 axon 10 0 248M 258M sleep nanosle 0:08 0.34% > wmnet > > > > 23088 axon 2 0 580K 2532K sleep select 0:10 0.00% > > > > wmclockmon > > > > 64041 axon 2 0 3752K 10M sleep poll 0:05 0.00% > > > wmaker > > > > 16919 axon 2 0 7484K 20M sleep poll 0:04 0.00% > > > > xfce4-terminal > > > > 1 root 10 0 408K 460K idle wait 0:01 0.00% > init > > > > 80619 _ntp 2 -20 880K 2480K sleep poll 0:01 0.00% > ntpd > > > > 9014 _pflogd 4 0 672K 408K sleep bpf 0:01 0.00% > > > pflogd > > > > 58764 root 10 0 2052K 7524K idle wait 0:01 0.00% > slim > > > > > > > > > > > > > > > > On Mon, Oct 24, 2016 at 10:47 PM, Mike Larkin <mlar...@azathoth.net> > > > wrote: > > > > > > > > > On Mon, Oct 24, 2016 at 07:36:48PM -0500, Ax0n wrote: > > > > > > I suppose I'll ask here since it seems on-topic for this thread. > Let > > > me > > > > > > know if I shouldn't do this in the future. I've been testing vmm > for > > > > > > exactly a week on two different snapshots. I have two VMs: One > > > running > > > > > the > > > > > > same snapshot (amd64, Oct 22) I'm running on the host vm, the > other > > > > > running > > > > > > amd64 6.0-RELEASE with no patches of any kind. > > > > > > > > > > > > For some reason, the vm running a recent snapshot locks up > > > occasionally > > > > > > while I'm interacting with it via cu or occasionally ssh. Should > I > > > > > expect a > > > > > > ddb prompt and/or kernel panic messages via the virtualized > serial > > > > > console? > > > > > > Is there some kind of "break" command on the console to get into > ddb > > > when > > > > > > it appears to hang? A "No" or "Not yet" on those two questions > would > > > > > > suffice if not possible. I know this isn't supported, and > appreciate > > > the > > > > > > hard work. > > > > > > > > > > > > Host dmesg: > > > > > > http://stuff.h-i-r.net/2016-10-22.Aspire5733Z.dmesg.txt > > > > > > > > > > > > VM (Oct 22 Snapshot) dmesg: > > > > > > http://stuff.h-i-r.net/2016-10-22.vmm.dmesg.txt > > > > > > > > > > > > > > > > These look fine. Not sure why it would have locked up. Is the > > > associated > > > > > vmd > > > > > process idle, or spinning like mad? > > > > > > > > > > -ml > > > > > > > > > > > Second: > > > > > > I'm using vm.conf (contents below) to start the aforementioned > > > snapshot > > > > > vm > > > > > > at boot. > > > > > > There's a "disable" line inside vm.conf to keep one VM from > spinning > > > up > > > > > > with vmd. Is there a way to start this one with vmctl aside from > > > passing > > > > > > all the options to vmctl as below? > > > > > > > > > > > > doas vmctl start -c -d OBSD-RELa -i 1 -k > /home/axon/obsd/amd64/bsd -m > > > > > 512M > > > > > > > > > > > > I've tried stuff along the lines of: > > > > > > doas vmctl start OBSD-RELa.vm > > > > > > > > > > > > vm "obsdvmm.vm" { > > > > > > memory 512M > > > > > > kernel "bsd" > > > > > > disk "/home/axon/vmm/OBSD6" > > > > > > interface tap > > > > > > } > > > > > > vm "OBSD-RELa.vm" { > > > > > > memory 512M > > > > > > kernel "/home/axon/obsd/amd64/bsd" > > > > > > disk "/home/axon/vmm/OBSD-RELa" > > > > > > interface tap > > > > > > disable > > > > > > } > > > > > > > > > > > > > > > > I think this is being worked on, but not done yet. > > > > > > > > > > -ml