On Mon, 2021-11-29 at 23:48 +0000, Eugene Bordenkircher wrote:
> Agreed,
>
> We are happy pick up the torch on this, but I'd like to try and hear from
> Joakim first before we do. The patch set is his, so I'd like to give him the
> opportunity. I think he's the only one that can add a truly proper
> description as well because he mentioned that this includes a "few more
> fixes" than just the one we ran into. I'd rather hear from him than try to
> reverse engineer what was being addressed.
>
> Joakim, if you are still watching the thread, would you like to take a stab
> at it? If I don't hear from you in a couple days, we'll pick up the torch
> and do what we can.
>
I am far away from this now and still on 4.19. I don't mind if you tweak tweak
the patches for better "upstreamability"
Regards
Joakim
> Eugene T. Bordenkircher
>
> -----Original Message-----
> From: Leo Li <[email protected]>
> Sent: Monday, November 29, 2021 3:37 PM
> To: Eugene Bordenkircher <[email protected]>; Thorsten Leemhuis
> <[email protected]>; [email protected]
> <[email protected]>; [email protected];
> [email protected]
> Cc: [email protected]; [email protected]
> Subject: RE: bug: usb: gadget: FSL_UDC_CORE Corrupted request list leads to
> unrecoverable loop.
>
> [Caution - External]
>
> > -----Original Message-----
> > From: Eugene Bordenkircher <[email protected]>
> > Sent: Monday, November 29, 2021 11:25 AM
> > To: Thorsten Leemhuis <[email protected]>; [email protected]
> > <[email protected]>; [email protected]; linux-
> > [email protected]
> > Cc: Leo Li <[email protected]>; [email protected];
> > [email protected]
> > Subject: RE: bug: usb: gadget: FSL_UDC_CORE Corrupted request list
> > leads to unrecoverable loop.
> >
> > The final result of our testing is that the patch set posted seems to
> > address all known defects in the Linux kernel. The mentioned
> > additional problems are entirely caused by the antivirus solution on
> > the windows box. The antivirus solution blocks the disconnect
> > messages from reaching the RNDIS driver so it has no idea the USB
> > device went away. There is nothing we can do to address this in the Linux
> > kernel.
>
> Thanks for the confirmation.
>
> >
> > I propose we move forward with the patchset.
>
> I think that we should proceed to merge the patchset but it seems to need
> some cleanup for coding style issues and better description before submitted
> formally.
>
> >
> > Eugene T. Bordenkircher
> >
> > -----Original Message-----
> > From: Thorsten Leemhuis <[email protected]>
> > Sent: Thursday, November 25, 2021 5:59 AM
> > To: Eugene Bordenkircher <[email protected]>; Thorsten
> > Leemhuis <[email protected]>; Joakim Tjernlund
> > <[email protected]>; [email protected]; linux-
> > [email protected]
> > Cc: [email protected]; [email protected]; [email protected]
> > Subject: Re: bug: usb: gadget: FSL_UDC_CORE Corrupted request list
> > leads to unrecoverable loop.
> >
> > Hi, this is your Linux kernel regression tracker speaking.
> >
> > Top-posting for once, to make this easy to process for everyone:
> >
> > Li Yang and Felipe Balbi: how to move on with this? It's quite an old
> > regression, but nevertheless it is one and thus should be fixed. Part
> > of my position is to make that happen and thus remind developers and
> > maintainers about this until the regression is resolved.
> >
> > Ciao, Thorsten
> >
> > On 16.11.21 20:11, Eugene Bordenkircher wrote:
> > > On 02.11.21 22:15, Joakim Tjernlund wrote:
> > > > On Sat, 2021-10-30 at 14:20 +0000, Joakim Tjernlund wrote:
> > > > > On Fri, 2021-10-29 at 17:14 +0000, Eugene Bordenkircher wrote:
> > > >
> > > > > > We've discovered a situation where the FSL udc driver
> > (drivers/usb/gadget/udc/fsl_udc_core.c) will enter a loop iterating
> > over the request queue, but the queue has been corrupted at some point
> > so it loops infinitely. I believe we have narrowed into the offending
> > code, but we are in need of assistance trying to find an appropriate
> > fix for the problem. The identified code appears to be in all
> > versions of the Linux kernel the driver exists in.
> > > > > >
> > > > > > The problem appears to be when handling a USB_REQ_GET_STATUS
> > request. The driver gets this request and then calls the
> > ch9getstatus() function. In this function, it starts a request by
> > "borrowing" the per device status_req, filling it in, and then queuing
> > it with a call to list_add_tail() to add the request to the endpoint
> > queue. Right before it exits the function however, it's calling
> > ep0_prime_status(), which is filling out that same status_req
> > structure and then queuing it with another call to list_add_tail() to
> > add the request to the endpoint queue. This adds two instances of the
> > exact same LIST_HEAD to the endpoint queue, which breaks the list
> > since the prev and next pointers end up pointing to the wrong things.
> > This ends up causing a hard loop the next time nuke() gets called, which
> > happens on the next setup IRQ.
> > > > > >
> > > > > > I'm not sure what the appropriate fix to this problem is, mostly
> > > > > > due to
> > my lack of expertise in USB and this driver stack. The code has been
> > this way in the kernel for a very long time, which suggests that it
> > has been working, unless USB_REQ_GET_STATUS requests are never made.
> > This further suggests that there is something else going on that I don't
> > understand.
> > Deleting the call to ep0_prime_status() and the following ep0stall()
> > call appears, on the surface, to get the device working again, but may
> > have side effects that I'm not seeing.
> > > > > >
> > > > > > I'm hopeful someone in the community can help provide some
> > information on what I may be missing or help come up with a solution
> > to the problem. A big thank you to anyone who would like to help out.
> > > > >
> > > > > Run into this to a while ago. Found the bug and a few more fixes.
> > > > > This is against 4.19 so you may have to tweak them a bit.
> > > > > Feel free to upstream them.
> > > >
> > > > Curious, did my patches help? Good to known once we upgrade as well.
> > >
> > > There's good news and bad news.
> > >
> > > The good news is that this appears to stop the driver from entering
> > > an infinite loop, which prevents the Linux system from locking up
> > > and never recovering. So I'm willing to say we've made the behavior
> > > better.
> > >
> > > The bad news is that once we get past this point, there is new bad
> > > behavior. What is on top of this driver in our system is the RNDIS
> > > gadget driver communicating to a Laptop running Win10 -1809.
> > > Everything appears to work fine with the Linux system until there is
> > > a USB disconnect. After the disconnect, the Linux side appears to
> > > continue on just fine, but the Windows side doesn't seem to
> > > recognize the disconnect, which causes the USB driver on that side
> > > to hang forever and eventually blue screen the box. This doesn't happen
> > > on
> > > all machines, just a select few. I think we can isolate the
> > > behavior to a specific antivirus/security software driver that is
> > > inserting itself into the USB stack and filtering the disconnect
> > > message, but we're still proving that.
> > >
> > > I'm about 90% certain this is a different problem and we can call
> > > this patchset good, at least for our test setup. My only hesitation
> > > is if the Linux side is sending a set of responses that are
> > > confusing the Windows side (specifically this antivirus) or not.
> > > I'd be content calling that a separate defect though and letting
> > > this one close up with that patchset.
> >
> > P.S.: As a Linux kernel regression tracker I'm getting a lot of
> > reports on my table. I can only look briefly into most of them.
> > Unfortunately therefore I sometimes will get things wrong or miss something
> > important.
> > I hope that's not the case here; if you think it is, don't hesitate to
> > tell me about it in a public reply. That's in everyone's interest, as
> > what I wrote above might be misleading to everyone reading this; any
> > suggestion I gave they thus might sent someone reading this down the
> > wrong rabbit hole, which none of us wants.
> >
> > BTW, I have no personal interest in this issue, which is tracked using
> > regzbot, my Linux kernel regression tracking bot
> > (https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Furldefense.com%2Fv3%2F__https%3A%2F%2Feur01.safelinks.protection.outloo&data=04%7C01%7Cjoakim.tjernlund%40infinera.com%7Cb302ff817a8f4b3184c408d9b392bd1c%7C285643de5f5b4b03a1530ae2dc8aaf77%7C1%7C0%7C637738265108962168%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=D2TOGHLaeLnnmbJQG5VEY3CQ66GKtkpBOkFZ16WeW%2F4%3D&reserved=0
> > k.com/?url=https*3A*2F*2Furld__;JSUl!!O7uE89YCNVw!a6nsIMfn544OIzmshw3H
> > bMBVcbwor4cV2Q5OsST7-86jy_YZKvDsN-558Ris4wh8Zawz4puN$
> > efense.com%2Fv3%2F__https%3A%2F%2Flinux-
> > regtracking.leemhuis.info%2Fregzbot%2F__%3B!!O7uE89YCNVw!aHa5_mLM
> > nBeDjINlAtV19tBHm-
> > He9jbusXucMA5h7oonHvNFwYpOHAaaqqewPOuGK9HAzJUz%24&data
> > =04%7C01%7Cleoyang.li%40nxp.com%7C859ce1560a7344729cea08d9b35d2e
> > 67%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0%7C0%7C6377380350721308
> > 84%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luM
> > zIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=ONQZyAKXNgok
> > 6LgYvnaAL7LVY%2B5Wl7pXglZDqWUJZMc%3D&reserved=0 ). I'm only
> > posting this mail to get things rolling again and hence don't need to
> > be CC on all further activities wrt to this regression.
> >
> > #regzbot title: usb: fsl_udc_core: corrupted request list leads to
> > unrecoverable loop