Bug#961676: 5.6 kernel crashes host when using VFIO on KVM guests

2020-07-19 Thread Simon John

I think we can close this as #notabug.

I did a fresh install and its working on 5.7

It worked 100% fine on qemu, but when I imported it into virt-manager it 
crashed, until i removed all of the clock bits, so i think maybe there's 
a bug in the clock/pit parts of libvirt that qemu doesn't seem to use.


Regards.

--
Simon John



Bug#961676: 5.6 kernel crashes host when using VFIO on KVM guests

2020-07-18 Thread Simon John
Ah no, turns out I had disabled the gpu passthrough, when i re-enabled 
it i just got a kernel panic from 5.8.0-rc5



On 19/07/2020 00:34, Simon John wrote:

I tried mainline 5.8.0-rc5 and I couldn't even get into gnome shell
after the gdm3 password prompt, just a black screen!

I tried running virsh start which worked but my monitor never got a
signal. It also didn't crash the host or qemu though, although virsh
destroy finished instantly, so i assume the guest didn't even start.

regards.




--
Simon John



Bug#961676: 5.6 kernel crashes host when using VFIO on KVM guests

2020-07-18 Thread Simon John
I tried mainline 5.8.0-rc5 and I couldn't even get into gnome shell 
after the gdm3 password prompt, just a black screen!


I tried running virsh start which worked but my monitor never got a 
signal. It also didn't crash the host or qemu though, although virsh 
destroy finished instantly, so i assume the guest didn't even start.


regards.

--
Simon John



Bug#961676: 5.6 kernel crashes host when using VFIO on KVM guests

2020-07-18 Thread Simon John

Where would I look for some follow-up on this - upstream kernel.org?


--
Simon John



Bug#961676: 5.6 kernel crashes host when using VFIO on KVM guests

2020-07-03 Thread Simon John

Done some more digging.

I can boot my macos vm using QXL instead of a passthrough GPU, even if i 
passthrough the USB device (not pci controller). So its definitely only 
passing through PCI devices like graphics cards that breaks things.


So this works in qemu terms:

-usb -device usb-host,hostbus=1,hostaddr=3 \

This breaks things:

-device vfio-pci,host=02:00.0,multifunction=on \
-device vfio-pci,host=02:00.1

Oddly enough an Ubuntu 20.04 vm works fine with the gpu and usb passed 
through.


Added the qemu shell scripts to the gist:

https://gist.github.com/sej7278/766043a69c76308f84cfa14b3f3a924f

Regards.

--
Simon John



Bug#961676: 5.6 kernel crashes host when using VFIO on KVM guests

2020-06-28 Thread Simon John
On Sun, 28 Jun 2020 10:41:41 +0200 Salvatore Bonaccorso 
 wrote:

Hi Simon,


Hi Salvatore, thanks for looking into this.


On Sun, Jun 28, 2020 at 01:01:01AM +0100, Simon John wrote:
> This looks a likely culprit:
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=207489


The issue you are seeing now seems different, though afaics.

The fixing commit from the above reference, 8be8f932e3db ("kvm:
ioapic: Restrict lazy EOI update to edge-triggered interrupts") was
applied in v5.7-rc3, which was as well backported to v5.6.13.


Yes, I was going to patch the kernel but noticed its already got that 
patch. I also tried passing through only one pci device and not the usb 
device but it didn't help, so probably isn't that issue.



Now you kernel is tained, can you check if you see the issue as well
when not loading the modules which taint the kernel?


I removed virtualbox which was tainting the kernel, confirmed cat 
/proc/sys/kernel/tainted returned 0 after a reboot.


I still managed to trigger the issue though, which then tainted the 
kernel itself, output from kernel-chktaint:


Kernel is "tainted" for the following reasons:
 * kernel died recently, i.e. there was an OOPS or BUG (#7)
 * kernel issued warning (#9)
 * soft lockup occurred (#14)
For a more detailed explanation of the various taint flags see
 Documentation/admin-guide/tainted-kernels.rst in the the Linux kernel 
sources

 or https://kernel.org/doc/html/latest/admin-guide/tainted-kernels.html
Raw taint value as int/string: 17024/'G  D WL   '

The kvm-pit process was using 100% of a cpu core, i've never even 
noticed that process before, looks like the culprit now?:


watchdog: BUG: soft lockup - CPU#7 stuck for 22s! [kvm-pit/2888:2913]

CPU: 7 PID: 2913 Comm: kvm-pit/2888 Tainted: G  D W 
5.7.0-1-amd64 #1 Debian 5.7.6-1


I've added a strace and dmesg to the gist here:

https://gist.github.com/sej7278/766043a69c76308f84cfa14b3f3a924f

Any other diagnostics I can run?


Next, if it still does show up, does it show up as well in current
mainline?


Not sure what you mean there, do you mean upstream kernel.org kernel - 
i'm not sure how i'd run that.


Regards.

--
Simon John



Bug#961676: 5.6 kernel crashes host when using VFIO on KVM guests

2020-06-28 Thread Salvatore Bonaccorso
Hi Simon,

On Sun, Jun 28, 2020 at 01:01:01AM +0100, Simon John wrote:
> This looks a likely culprit:
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=207489

The issue you are seeing now seems different, though afaics.

The fixing commit from the above reference, 8be8f932e3db ("kvm:
ioapic: Restrict lazy EOI update to edge-triggered interrupts") was
applied in v5.7-rc3, which was as well backported to v5.6.13.

Now you kernel is tained, can you check if you see the issue as well
when not loading the modules which taint the kernel?

Next, if it still does show up, does it show up as well in current
mainline?

Regards,
Salvatore



Bug#961676: 5.6 kernel crashes host when using VFIO on KVM guests

2020-06-27 Thread Simon John

This looks a likely culprit:

https://bugzilla.kernel.org/show_bug.cgi?id=207489

--
Simon John



Bug#961676: 5.6 kernel crashes host when using VFIO on KVM guests

2020-06-27 Thread Simon John
Bug still exists in today's 5.7 kernel, however it took long enough to 
trigger that it managed to write some logs:


https://gist.github.com/sej7278/766043a69c76308f84cfa14b3f3a924f

"BUG: soft lockup - CPU#0 stuck for 22s!" seems to go back to 2013 or 
earlier e.g. https://bugzilla.redhat.com/show_bug.cgi?id=1038929


kvm-pit seemed to be eating a lot of cpu, qemu went defunct, macos guest 
got a little further into its boot sequence.


Can we upgrade this bug so it gets some visibility, that's two major 
kernel versions its not been fixed in now, let alone commented on.


--
Simon John



Bug#961676: 5.6 kernel crashes host when using VFIO on KVM guests

2020-06-13 Thread Simon John
I tried to get some sort of useful logging today, best I could do was 
running strace.


So I noticed that virsh exited 0 but didn't instantly crash, it took 2-3 
seconds more.


Photo of screen with strace output here:

https://i.imgur.com/kOSEXXA.jpg

Managed to install 5.5.0-2 from snapshot.debian.org which works fine.

Regards.

--
Simon John



Bug#961676: 5.6 kernel crashes host when using VFIO on KVM guests

2020-06-11 Thread Simon John
The new linux-image-5.6.0-2-amd64 (5.6.14-2) from today is no better, 
only change is its signed I guess?


Also I noticed the 5.3/5.4/5.5 kernels have all been removed from my 
system and the servers so I have no way to boot back into a kernel that 
doesn't break VFIO.


Has this issue been looked at or is 5.7 due soon?

Regards.

--
Simon John



Bug#961676: 5.6 kernel crashes host when using VFIO on KVM guests

2020-05-27 Thread Simon John
On Wed, 27 May 2020 21:31:46 +0200 Salvatore Bonaccorso 
 wrote:

Source: linux
Version: 5.6.7-1
Control: submitter -1 deb...@the-jedi.co.uk

Forwarding this report as bug in the BTS.

On Mon, May 25, 2020 at 09:03:43PM +0100, Simon John wrote:
> Sorry to email directly but I've tried reportbug but it doesn't seem to work
> with the kernel packages.

What exactly did not work?


I tried to report a bug in 5.5 kernel regarding HDMI audio not working 
after a while until you kill pulseaudio but reportbug did not create a 
valid report i guess as it never got created - i tried using the email 
template too.


same problem when trying to report this bug.

Anyway, at least this bug has been created now, hopefully will get 
looked into.



> The 5.6.0-1 kernel in Sid when used in conjunction with VFIO (PCI
> passthrough) guests in Qemu-KVM causes KVM to crash with no useful logs -
> the guest partially starts. Eventually the host slows down and needs to be
> powered off. Doesn't affect non-VFIO guests using spice/headless, only when
> passing through a PCIe graphics card or USB keyboard/mouse.
> 
> 5.6.0-2 that just landed makes this infinitely worse - as soon as you start

> a VFIO guest the host hard crashes.
> 
> 5.3/5.4/5.5 kernels do not have this problem with the same version of Qemu

> and reverting from qemu 5.0 to 4.2 doesn't help, so it must be the 5.6
> kernel causing the issue.
> 
> Not sure what logs or any useful information I can supply you, there's some

> interesting comments on this reddit thread suggesting a couple of bugs that
> have fixes upstream:
> 
> https://www.reddit.com/r/VFIO/comments/glfgqs/qemu_5_vfio_no_longer_works_on_debian/
> 
> I'm using macos catalina and win10 guests to reproduce the issue, and using

> Intel not AMD chips.
> 
> Cheers.
> 
> -- 
> Simon John


Regards,
Salvatore




Best regards.

--
Simon John



Bug#961676: 5.6 kernel crashes host when using VFIO on KVM guests

2020-05-27 Thread Salvatore Bonaccorso
Source: linux
Version: 5.6.7-1
Control: submitter -1 deb...@the-jedi.co.uk

Forwarding this report as bug in the BTS.

On Mon, May 25, 2020 at 09:03:43PM +0100, Simon John wrote:
> Sorry to email directly but I've tried reportbug but it doesn't seem to work
> with the kernel packages.

What exactly did not work?

> The 5.6.0-1 kernel in Sid when used in conjunction with VFIO (PCI
> passthrough) guests in Qemu-KVM causes KVM to crash with no useful logs -
> the guest partially starts. Eventually the host slows down and needs to be
> powered off. Doesn't affect non-VFIO guests using spice/headless, only when
> passing through a PCIe graphics card or USB keyboard/mouse.
> 
> 5.6.0-2 that just landed makes this infinitely worse - as soon as you start
> a VFIO guest the host hard crashes.
> 
> 5.3/5.4/5.5 kernels do not have this problem with the same version of Qemu
> and reverting from qemu 5.0 to 4.2 doesn't help, so it must be the 5.6
> kernel causing the issue.
> 
> Not sure what logs or any useful information I can supply you, there's some
> interesting comments on this reddit thread suggesting a couple of bugs that
> have fixes upstream:
> 
> https://www.reddit.com/r/VFIO/comments/glfgqs/qemu_5_vfio_no_longer_works_on_debian/
> 
> I'm using macos catalina and win10 guests to reproduce the issue, and using
> Intel not AMD chips.
> 
> Cheers.
> 
> -- 
> Simon John

Regards,
Salvatore