On Fri, May 31, 2024 at 9:03 AM Peter Maydell <peter.mayd...@linaro.org>
wrote:

> On Tue, 21 May 2024 at 00:26, David Hubbard <dmamf...@gmail.com> wrote:
> >
> > From: Cord Amfmgm <dmamf...@gmail.com>
> >
> > This changes the way the ohci emulation handles a Transfer Descriptor
> with
> > "Current Buffer Pointer" set to "Buffer End" + 1.
> >
> > The OHCI spec 4.3.1.2 Table 4-2 allows td.cbp to be one byte more than
> td.be
> > to signal the buffer has zero length. Currently qemu only accepts
> zero-length
> > Transfer Descriptors if the td.cbp is equal to 0, while actual OHCI
> hardware
> > accepts both cases.
> >
> > The qemu ohci emulation has a regression in ohci_service_td. Version 4.2
> > and earlier matched the spec. (I haven't taken the time to bisect exactly
> > where the logic was changed.)
> >
> > With a tiny OS[1] that boots and executes a test, the issue can be seen:
> >
> > * OS that sends USB requests to a USB mass storage device
> >   but sends td.cbp = td.be + 1
> > * qemu 4.2
> > * qemu HEAD (4e66a0854)
> > * Actual OHCI controller (hardware)
> >
> > Command line:
> > qemu-system-x86_64 -m 20 \
> >  -device pci-ohci,id=ohci \
> >  -drive if=none,format=raw,id=d,file=testmbr.raw \
> >  -device usb-storage,bus=ohci.0,drive=d \
> >  --trace "usb_*" --trace "ohci_*" -D qemu.log
> >
> > Results are:
> >
> >  qemu 4.2   | qemu HEAD  | actual HW
> > ------------+------------+------------
> >  works fine | ohci_die() | works fine
> >
> > Tip: if the flags "-serial pty -serial stdio" are added to the command
> line
> > the test will output USB requests like this:
> >
> > Testing qemu HEAD:
> >
> > > Free mem 2M ohci port2 conn FS
> > > setup { 80 6 0 1 0 0 8 0 }
> > > ED info=80000 { mps=8 en=0 d=0 } tail=c20920
> > >   td0 c20880 nxt=c20960 f2000000 setup cbp=c20900 be=c20907
> > >   td1 c20960 nxt=c20980 f3140000    in cbp=c20908 be=c2090f
> > >   td2 c20980 nxt=c20920 f3080000   out cbp=c20910 be=c2090f ohci20
> host err
> > > usb stopped
> >
> > And in qemu.log:
> >
> > usb_ohci_iso_td_bad_cc_overrun ISO_TD start_offset=0x00c20910 >
> next_offset=0x00c2090f
> >
> > Testing qemu 4.2:
> >
> > > Free mem 2M ohci port2 conn FS
> > > setup { 80 6 0 1 0 0 8 0 }
> > > ED info=80000 { mps=8 en=0 d=0 } tail=620920
> > >   td0 620880 nxt=620960 f2000000 setup cbp=620900 be=620907
>  cbp=0 be=620907
> > >   td1 620960 nxt=620980 f3140000    in cbp=620908 be=62090f
>  cbp=0 be=62090f
> > >   td2 620980 nxt=620920 f3080000   out cbp=620910 be=62090f
>  cbp=0 be=62090f
> > >    rx { 12 1 0 2 0 0 0 8 }
> > > setup { 0 5 1 0 0 0 0 0 } tx {}
> > > ED info=80000 { mps=8 en=0 d=0 } tail=620880
> > >   td0 620920 nxt=620960 f2000000 setup cbp=620900 be=620907
>  cbp=0 be=620907
> > >   td1 620960 nxt=620880 f3100000    in cbp=620908 be=620907
>  cbp=0 be=620907
> > > setup { 80 6 0 1 0 0 12 0 }
> > > ED info=80001 { mps=8 en=0 d=1 } tail=620960
> > >   td0 620880 nxt=6209c0 f2000000 setup cbp=620920 be=620927
>  cbp=0 be=620927
> > >   td1 6209c0 nxt=6209e0 f3140000    in cbp=620928 be=620939
>  cbp=0 be=620939
> > >   td2 6209e0 nxt=620960 f3080000   out cbp=62093a be=620939
>  cbp=0 be=620939
> > >    rx { 12 1 0 2 0 0 0 8 f4 46 1 0 0 0 1 2 3 1 }
> > > setup { 80 6 0 2 0 0 0 1 }
> > > ED info=80001 { mps=8 en=0 d=1 } tail=620880
> > >   td0 620960 nxt=6209a0 f2000000 setup cbp=620a20 be=620a27
>  cbp=0 be=620a27
> > >   td1 6209a0 nxt=6209c0 f3140004    in cbp=620a28 be=620b27
>  cbp=620a48 be=620b27
> > >   td2 6209c0 nxt=620880 f3080000   out cbp=620b28 be=620b27
>  cbp=0 be=620b27
> > >    rx { 9 2 20 0 1 1 4 c0 0 9 4 0 0 2 8 6 50 0 7 5 81 2 40 0 0 7 5 2 2
> 40 0 0 }
> > > setup { 0 9 1 0 0 0 0 0 } tx {}
> > > ED info=80001 { mps=8 en=0 d=1 } tail=620900
> > >   td0 620880 nxt=620940 f2000000 setup cbp=620a00 be=620a07
>  cbp=0 be=620a07
> > >   td1 620940 nxt=620900 f3100000    in cbp=620a08 be=620a07
>  cbp=0 be=620a07
> >
> > [1] The OS disk image has been emailed to phi...@linaro.org,
> m...@tls.msk.ru,
> > and kra...@redhat.com:
> >
> > * testCbpOffBy1.img.xz
> > * sha256:
> f87baddcb86de845de12f002c698670a426affb40946025cc32694f9daa3abed
> >
> > Signed-off-by: Cord Amfmgm <dmamf...@gmail.com>
> > ---
> >  hw/usb/hcd-ohci.c   | 4 ++--
> >  hw/usb/trace-events | 1 +
> >  2 files changed, 3 insertions(+), 2 deletions(-)
> >
> > diff --git a/hw/usb/hcd-ohci.c b/hw/usb/hcd-ohci.c
> > index acd6016980..71b54914d3 100644
> > --- a/hw/usb/hcd-ohci.c
> > +++ b/hw/usb/hcd-ohci.c
> > @@ -941,8 +941,8 @@ static int ohci_service_td(OHCIState *ohci, struct
> ohci_ed *ed)
> >          if ((td.cbp & 0xfffff000) != (td.be & 0xfffff000)) {
> >              len = (td.be & 0xfff) + 0x1001 - (td.cbp & 0xfff);
> >          } else {
> > -            if (td.cbp > td.be) {
> > -                trace_usb_ohci_iso_td_bad_cc_overrun(td.cbp, td.be);
> > +            if (td.cbp - 1 > td.be) {  /* rely on td.cbp != 0 */
> > +                trace_usb_ohci_td_bad_buf(td.cbp, td.be);
> >                  ohci_die(ohci);
> >                  return 1;
> >              }
>
> This patch seems to me to do what the commit message sets out to
> do, and it looks unlikely to have any unintended side effects
> because it turns a "USB controller flags an error" case into
> a "treat as zero length packet" case, and I have trouble
> imagining that any guest could be relying on looking for the
> controller error. On that basis I'm inclined to accept it.
>
> What I would like to see is what we could classify under
> "rationale", which is to say "what prompted us to make this
> change?". In my experience it's important to record this
> (including in the commit message). There are of course
> many cases in QEMU's git history where we failed to do that,
> but in general I think it's a good standard to meet. (I
> am also erring on the side of caution in reviewing this
> particular patch, because I don't know the relevant standards
> or this bit of the code very well.)
>
> Why do we care about the motivation for a patch?
>
> (1) In the present: it helps to raise confidence that the
> proposed new behaviour is right. This is good because QEMU's
> test suite is far from comprehensive, so in some sense any
> change to the codebase is a risk.
>
> For instance, if a change is being made because the QNX demo
> floppy doesn't run, then the fact that the change fixes that
> failure-to-run indicates that our interpretation of the
> meaning of the standard, or of what should happen in the
> grey areas that the documentation doesn't clearly describe,
> matches what the QNX driver author (an unrelated third party)
> thought and probably also what a lot of in-the-field hardware
> does (since QNX was no doubt tested on a lot of different PCs
> back in the day).
>
> On the other hand, if a change is proposed because it fixes
> booting with older Linux kernels prior to commit XYZ, and
> kernel commit XYZ turns out to be "make this device driver
> program the hardware according to the specification rather
> than relying on an accident of timing", then we might want
> to look at where we want to be in the tradeoff of "run older
> kernels" versus "put workaround for a guest software issue
> into QEMU". (Workarounds for guest software bugs are something
> I'm very reluctant to put into QEMU, because my experience
> is that once they're in the codebase we can essentially never
> remove them, because we don't know what guest code might
> be relying on them. But sometimes they're a necessary evil.)
>
> (2) In the future: if in a year's time or more, somebody
> reports that a particular commit has regressed some specific
> guest workload they have, knowing why we made the change in
> the first place is really useful in investigating the
> regression.
>
> If we need to change code that was initially added to solve
> a problem when running FreeBSD, we know we need to re-test
> with FreeBSD.
>
> If the change went in to fix a buffer overrun, we know we
> need to be careful and cross-check that we don't reintroduce
> the overrun in the course of fixing a regression.
>
> If a change is one that we made on the grounds of "reading
> the spec and the code, this looked like it was clearly wrong,
> but we don't have a definite repro case of it breaking a guest"
> then that might put "revert the change, we were mistaken" on
> the table as a response to a future regression report.
> And so on.
>
> thanks
> -- PMM
>

Thanks, in responding to that, I'm breaking down my responses into the
following answers:

Q1: (summarizing) What prompted us to make this change?

A1: I'm summarizing what I wrote in previous emails and the commit message -

* Bring qemu closer to actual hw with a neatly packaged test case to
demonstrate the behavior
* I interpret the spec (Table 4-2) as saying the "be = cbp - 1" is valid,
in addition to setting "cbp = 0"
* I interpret it that way due to a comment in an old linux kernel version
in the 2.x range, ohci-hcd.c file. It said (paraphrasing) some misbehaving
ohci controllers would fetch physical memory at 0 when cbp = 0 was in the
Transfer Descriptor

Q2: (summarizing) is the proposed new behaviour right?

A2: Like what Peter Maydell wrote, I believe this turns a "USB controller
flags an error" case into "USB controller treats it as a zero length
packet" case.

I came across this corner case as a result of that comment in an old 2.x
linux kernel. I am not an expert on computer history and old OSes that
might want this behavior.

Q3: (summarizing, take 1) if this patch is called up later and folks are
wondering if there are older OSes that need to be tested?

A3: I do not know of computer history and old OSes where this change is
needed.

qemu up until commit 1328fe0c32d54 ("hw: usb: hcd-ohci: check len and
frame_number variables") allowed this behavior, so it is possible that
older committers wanted "be = cbp - 1" to signify a zero length packet, or
knew of some guest OS that relied on this behavior.

I think there is a non trivial chance some guest OS would rely on this
behavior because:

* using code like "be = cbp + len - 1" is *required* in every Transfer
Descriptor
* ohci Transfer Descriptors can be fragmented by the controller (e.g. if
the Info word sets Max Packet Size = 8)
* ohci Transfer Descriptors have physical memory 4K-page boundary
requirements
* ohci Transfer Descriptors for a zero-length packet are required to
complete certain USB1 transfers (such as the status transfer after an OUT,
or the empty packet during an IN)

in other words, there are just enough simultaneous requirements that a
guest OS might choose to lay out a string of Transfer Descriptors where the
guest OS uses this "be = cbp - 1" for a zero-length transfer.

But I only have a test case I created myself, and am not an expert on
computer history here. I think "be liberal in what you accept, by
conservative in what you send" applies when I don't know which historical
OSes, if any, need this behavior. I think the behavior of actual hardware
weighs more heavily since that *is* available and testable. Are there
additional checks that would expand on what's known about actual ohci hw?

Q3: (summarizing, take 2) if this patch is called up later and folks are
wondering whether it is correct?

A3: In my understanding of the discussion, this thread contains the most
significant discussion of the spec, the implementation and all the details.

In typical internet fashion, we're probably writing what future folks will
look at to answer that question.

In the future, actual hardware will become harder and harder to obtain.

It is not inconceivable that qemu will become the primary source of truth
for "what was ohci behavior?" to future folks.

In my understanding, behaving like actual hardware is a great way to
support the future folks looking for such info.

Reply via email to