Thanks Vanessa for the testing on the PPA!

@Halil - I'd leave the debugging of the remaining issue to you as while
you can't reproduce it yet it still is much closer to you than it is to
me :-/ Thanks in advance, let us know what you find.

In the meantime I have prepared the SRU content and got a PR review, so
once we are convinced it is good - or found what we need to change - we
should be ready to continue.

** Description changed:

+ [Impact]
+ 
+  * There are situations where disk I/O from guests to host scsi
+    devices can fail to return the right status. In doing so
+    the guest believes that I/O was successful while it was not
+    leading silently to latter data corruption.
+ 
+  * Upstream fixed this in latter versions, backport these
+ 
+ [Test Plan]
+ 
+  * The IBM test lab can run tests with (virtual) cable pulls and
+    all that. This kind of testing revealed the issue initially (we
+    don't know what subset exactly). We'd rely on IBM to run those
+    tests against the builds in the PPA and -proposed - and if only 
+    runnable once at least those in -proposed.
+ 
+  * Any kind of SCSI attached disks would be worth to test.
+    That can be (all details 
herehttps://libvirt.org/formatdomain.html#usb-pci-scsi-devices):
+    - scsi device via hostdevs 
+    - scsi device using iscsi
+    - scsi adapter via scsi-host + vhost
+ 
+    But that can only test if there is no regression in formerly working 
+    simple setups. For the original case we have to rely on IBM (see above)
+ 
+  * Due to the complexity I'd suggest to keep this a bit
+    longer than usual in -proposed
+ 
+ [Where problems could occur]
+ 
+  * Qemu does a lot of things, problems of this change would occur
+    and be limited to the handling of scsi disks.
+    - There is the usual kind of regression potential if our backports 
+      missed anything or are bad. The code isn't easy, but we've now had 
+      three developers having a look and it looks ok.
+    - But then there is also the "intended regression" which is that we
+      now deliver error codes correctly. If there was a setup with bad I/O 
+      errors and relying on not seeing them this will change. With this
+      upload these guests will get the error reported. We can't change
+      this as that is the main purpose of this fix. But one would assume
+      that people prefer that over silent corruption.
+ 
+ [Other Info]
+  
+  * n/a
+ 
+ --- original report ---
+ 
  == Comment: #63 - Halil Pasic <pa...@de.ibm.com> - 2022-03-28 17:33:34 ==
- I'm pretty confident I've figured out what is going on. 
+ I'm pretty confident I've figured out what is going on.
  
  From the guest side, the decision whether the SCSI command was completed 
successfully or not comes down to looking at the sense data. Prior to commit
  a108557bbf ("scsi: inline sg_io_sense_from_errno() into the callers."), we 
don't
  build sense data as a response to seeing a host status presented by the host 
SCSI stack (e.g. kernel).
  
  Thus when the kernel tells us that  a given SCSI command did not get 
completed via
  SCSI_HOST_TRANSPORT_DISRUPTED or SCSI_HOST_NO_LUN, we end up  fooling the 
guest into believing that the command completed successfully.
  
  The guest kernel, and especially virtio and multipath are at no fault
  (AFAIU). Given these facts, it isn't all that surprising, that we end up
  with corruptions.
  
  All we have to do is do backports for QEMU (when necessary). I didn't
  investigate vhost-scsi -- my guess is, that it ain't affected.
  
  How do we want to handle the back-ports?
  
  == Comment: #66 - Halil Pasic <pa...@de.ibm.com> - 2022-04-04 05:36:33 ==
  This is a proposed backport containing 7 patches in mbox format. I tried to 
pick patches sanely, and all I had to do was basically resolving merge 
conflicts.
  
  I have to admit I have no extensive experience in doing such invasive
  backports, and my knowledge of the QEMU SCSI stack is very limited. I
  would be happy if the Ubuntu folks would have a good look at this, and
  if possible improve on it.

** Description changed:

  [Impact]
  
-  * There are situations where disk I/O from guests to host scsi
-    devices can fail to return the right status. In doing so
-    the guest believes that I/O was successful while it was not
-    leading silently to latter data corruption.
+  * There are situations where disk I/O from guests to host scsi
+    devices can fail to return the right status. In doing so
+    the guest believes that I/O was successful while it was not
+    leading silently to latter data corruption.
  
-  * Upstream fixed this in latter versions, backport these
+  * Upstream fixed this in latter versions, backport these
  
  [Test Plan]
  
-  * The IBM test lab can run tests with (virtual) cable pulls and
-    all that. This kind of testing revealed the issue initially (we
-    don't know what subset exactly). We'd rely on IBM to run those
-    tests against the builds in the PPA and -proposed - and if only 
-    runnable once at least those in -proposed.
+  * The IBM test lab can run tests with (virtual) cable pulls and
+    all that. This kind of testing revealed the issue initially (we
+    don't know what subset exactly). We'd rely on IBM to run those
+    tests against the builds in -proposed.
+    IBM already did that on the PPA which was quite helpful (see below)
  
-  * Any kind of SCSI attached disks would be worth to test.
-    That can be (all details 
herehttps://libvirt.org/formatdomain.html#usb-pci-scsi-devices):
-    - scsi device via hostdevs 
-    - scsi device using iscsi
-    - scsi adapter via scsi-host + vhost
+  * Any kind of SCSI attached disks would be worth to test.
+    That can be (all details 
herehttps://libvirt.org/formatdomain.html#usb-pci-scsi-devices):
+    - scsi device via hostdevs
+    - scsi device using iscsi
+    - scsi adapter via scsi-host + vhost
  
-    But that can only test if there is no regression in formerly working 
-    simple setups. For the original case we have to rely on IBM (see above)
+    But that can only test if there is no regression in formerly working
+    simple setups. For the original case we have to rely on IBM (see above)
  
-  * Due to the complexity I'd suggest to keep this a bit
-    longer than usual in -proposed
+  * Due to the complexity I'd suggest to keep this a bit
+    longer than usual in -proposed
  
  [Where problems could occur]
  
-  * Qemu does a lot of things, problems of this change would occur
-    and be limited to the handling of scsi disks.
-    - There is the usual kind of regression potential if our backports 
-      missed anything or are bad. The code isn't easy, but we've now had 
-      three developers having a look and it looks ok.
-    - But then there is also the "intended regression" which is that we
-      now deliver error codes correctly. If there was a setup with bad I/O 
-      errors and relying on not seeing them this will change. With this
-      upload these guests will get the error reported. We can't change
-      this as that is the main purpose of this fix. But one would assume
-      that people prefer that over silent corruption.
+  * Qemu does a lot of things, problems of this change would occur
+    and be limited to the handling of scsi disks.
+    - There is the usual kind of regression potential if our backports
+      missed anything or are bad. The code isn't easy, but we've now had
+      three developers having a look and it looks ok.
+    - But then there is also the "intended regression" which is that we
+      now deliver error codes correctly. If there was a setup with bad I/O
+      errors and relying on not seeing them this will change. With this
+      upload these guests will get the error reported. We can't change
+      this as that is the main purpose of this fix. But one would assume
+      that people prefer that over silent corruption.
  
  [Other Info]
-  
-  * n/a
+ 
+  * n/a
  
  --- original report ---
  
  == Comment: #63 - Halil Pasic <pa...@de.ibm.com> - 2022-03-28 17:33:34 ==
  I'm pretty confident I've figured out what is going on.
  
  From the guest side, the decision whether the SCSI command was completed 
successfully or not comes down to looking at the sense data. Prior to commit
  a108557bbf ("scsi: inline sg_io_sense_from_errno() into the callers."), we 
don't
  build sense data as a response to seeing a host status presented by the host 
SCSI stack (e.g. kernel).
  
  Thus when the kernel tells us that  a given SCSI command did not get 
completed via
  SCSI_HOST_TRANSPORT_DISRUPTED or SCSI_HOST_NO_LUN, we end up  fooling the 
guest into believing that the command completed successfully.
  
  The guest kernel, and especially virtio and multipath are at no fault
  (AFAIU). Given these facts, it isn't all that surprising, that we end up
  with corruptions.
  
  All we have to do is do backports for QEMU (when necessary). I didn't
  investigate vhost-scsi -- my guess is, that it ain't affected.
  
  How do we want to handle the back-ports?
  
  == Comment: #66 - Halil Pasic <pa...@de.ibm.com> - 2022-04-04 05:36:33 ==
  This is a proposed backport containing 7 patches in mbox format. I tried to 
pick patches sanely, and all I had to do was basically resolving merge 
conflicts.
  
  I have to admit I have no extensive experience in doing such invasive
  backports, and my knowledge of the QEMU SCSI stack is very limited. I
  would be happy if the Ubuntu folks would have a good look at this, and
  if possible improve on it.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1967814

Title:
  Ubuntu 20.04.3 - ilzlnx3g1 - virtio-scsi devs on KVM guest having
  miscompares on disktests when there is a failed path.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-z-systems/+bug/1967814/+subscriptions


-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to