On Mon, 2021-11-29 at 23:48 +0000, Eugene Bordenkircher wrote:
> Agreed,
> 
> We are happy pick up the torch on this, but I'd like to try and hear from 
> Joakim first before we do.  The patch set is his, so I'd like to give him the 
> opportunity.  I think he's the only one that can add a truly proper 
> description as well because he mentioned that this includes a "few more 
> fixes" than just the one we ran into.  I'd rather hear from him than try to 
> reverse engineer what was being addressed.  
> 
> Joakim, if you are still watching the thread, would you like to take a stab 
> at it?  If I don't hear from you in a couple days, we'll pick up the torch 
> and do what we can.
> 

I am far away from this now and still on 4.19. I don't mind if you tweak tweak 
the patches for better "upstreamability" 

  Regards
           Joakim

> Eugene T. Bordenkircher
> 
> -----Original Message-----
> From: Leo Li <leoyang...@nxp.com> 
> Sent: Monday, November 29, 2021 3:37 PM
> To: Eugene Bordenkircher <eugene_bordenkirc...@selinc.com>; Thorsten Leemhuis 
> <regressi...@leemhuis.info>; jo...@infinera.com 
> <joakim.tjernl...@infinera.com>; linuxppc-dev@lists.ozlabs.org; 
> linux-...@vger.kernel.org
> Cc: gre...@linuxfoundation.org; ba...@kernel.org
> Subject: RE: bug: usb: gadget: FSL_UDC_CORE Corrupted request list leads to 
> unrecoverable loop.
> 
> [Caution - External]
> 
> > -----Original Message-----
> > From: Eugene Bordenkircher <eugene_bordenkirc...@selinc.com>
> > Sent: Monday, November 29, 2021 11:25 AM
> > To: Thorsten Leemhuis <regressi...@leemhuis.info>; jo...@infinera.com 
> > <joakim.tjernl...@infinera.com>; linuxppc-dev@lists.ozlabs.org; linux- 
> > u...@vger.kernel.org
> > Cc: Leo Li <leoyang...@nxp.com>; gre...@linuxfoundation.org; 
> > ba...@kernel.org
> > Subject: RE: bug: usb: gadget: FSL_UDC_CORE Corrupted request list 
> > leads to unrecoverable loop.
> > 
> > The final result of our testing is that the patch set posted seems to 
> > address all known defects in the Linux kernel.  The mentioned 
> > additional problems are entirely caused by the antivirus solution on 
> > the windows box.  The antivirus solution blocks the disconnect 
> > messages from reaching the RNDIS driver so it has no idea the USB 
> > device went away.  There is nothing we can do to address this in the Linux 
> > kernel.
> 
> Thanks for the confirmation.
> 
> > 
> > I propose we move forward with the patchset.
> 
> I think that we should proceed to merge the patchset but it seems to need 
> some cleanup for coding style issues and better description before submitted 
> formally.
> 
> > 
> > Eugene T. Bordenkircher
> > 
> > -----Original Message-----
> > From: Thorsten Leemhuis <regressi...@leemhuis.info>
> > Sent: Thursday, November 25, 2021 5:59 AM
> > To: Eugene Bordenkircher <eugene_bordenkirc...@selinc.com>; Thorsten 
> > Leemhuis <regressi...@leemhuis.info>; Joakim Tjernlund 
> > <joakim.tjernl...@infinera.com>; linuxppc-dev@lists.ozlabs.org; linux- 
> > u...@vger.kernel.org
> > Cc: leoyang...@nxp.com; gre...@linuxfoundation.org; ba...@kernel.org
> > Subject: Re: bug: usb: gadget: FSL_UDC_CORE Corrupted request list 
> > leads to unrecoverable loop.
> > 
> > Hi, this is your Linux kernel regression tracker speaking.
> > 
> > Top-posting for once, to make this easy to process for everyone:
> > 
> > Li Yang and Felipe Balbi: how to move on with this? It's quite an old 
> > regression, but nevertheless it is one and thus should be fixed. Part 
> > of my position is to make that happen and thus remind developers and 
> > maintainers about this until the regression is resolved.
> > 
> > Ciao, Thorsten
> > 
> > On 16.11.21 20:11, Eugene Bordenkircher wrote:
> > > On 02.11.21 22:15, Joakim Tjernlund wrote:
> > > > On Sat, 2021-10-30 at 14:20 +0000, Joakim Tjernlund wrote:
> > > > > On Fri, 2021-10-29 at 17:14 +0000, Eugene Bordenkircher wrote:
> > > > 
> > > > > > We've discovered a situation where the FSL udc driver
> > (drivers/usb/gadget/udc/fsl_udc_core.c) will enter a loop iterating 
> > over the request queue, but the queue has been corrupted at some point 
> > so it loops infinitely.  I believe we have narrowed into the offending 
> > code, but we are in need of assistance trying to find an appropriate 
> > fix for the problem.  The identified code appears to be in all 
> > versions of the Linux kernel the driver exists in.
> > > > > > 
> > > > > > The problem appears to be when handling a USB_REQ_GET_STATUS
> > request.  The driver gets this request and then calls the 
> > ch9getstatus() function.  In this function, it starts a request by 
> > "borrowing" the per device status_req, filling it in, and then queuing 
> > it with a call to list_add_tail() to add the request to the endpoint 
> > queue.  Right before it exits the function however, it's calling 
> > ep0_prime_status(), which is filling out that same status_req 
> > structure and then queuing it with another call to list_add_tail() to 
> > add the request to the endpoint queue.  This adds two instances of the 
> > exact same LIST_HEAD to the endpoint queue, which breaks the list 
> > since the prev and next pointers end up pointing to the wrong things.  
> > This ends up causing a hard loop the next time nuke() gets called, which 
> > happens on the next setup IRQ.
> > > > > > 
> > > > > > I'm not sure what the appropriate fix to this problem is, mostly 
> > > > > > due to
> > my lack of expertise in USB and this driver stack.  The code has been 
> > this way in the kernel for a very long time, which suggests that it 
> > has been working, unless USB_REQ_GET_STATUS requests are never made.  
> > This further suggests that there is something else going on that I don't 
> > understand.
> > Deleting the call to ep0_prime_status() and the following ep0stall() 
> > call appears, on the surface, to get the device working again, but may 
> > have side effects that I'm not seeing.
> > > > > > 
> > > > > > I'm hopeful someone in the community can help provide some
> > information on what I may be missing or help come up with a solution 
> > to the problem.  A big thank you to anyone who would like to help out.
> > > > > 
> > > > > Run into this to a while ago. Found the bug and a few more fixes.
> > > > > This is against 4.19 so you may have to tweak them a bit.
> > > > > Feel free to upstream them.
> > > > 
> > > > Curious, did my patches help? Good to known once we upgrade as well.
> > > 
> > > There's good news and bad news.
> > > 
> > > The good news is that this appears to stop the driver from entering 
> > > an infinite loop, which prevents the Linux system from locking up 
> > > and never recovering.  So I'm willing to say we've made the behavior 
> > > better.
> > > 
> > > The bad news is that once we get past this point, there is new bad 
> > > behavior.  What is on top of this driver in our system is the RNDIS 
> > > gadget driver communicating to a Laptop running Win10 -1809.
> > > Everything appears to work fine with the Linux system until there is 
> > > a USB disconnect.  After the disconnect, the Linux side appears to 
> > > continue on just fine, but the Windows side doesn't seem to 
> > > recognize the disconnect, which causes the USB driver on that side 
> > > to hang forever and eventually blue screen the box.  This doesn't happen 
> > > on
> > > all machines, just a select few.   I think we can isolate the
> > > behavior to a specific antivirus/security software driver that is 
> > > inserting itself into the USB stack and filtering the disconnect 
> > > message, but we're still proving that.
> > > 
> > > I'm about 90% certain this is a different problem and we can call 
> > > this patchset good, at least for our test setup.  My only hesitation 
> > > is if the Linux side is sending a set of responses that are 
> > > confusing the Windows side (specifically this antivirus) or not.  
> > > I'd be content calling that a separate defect though and letting 
> > > this one close up with that patchset.
> > 
> > P.S.: As a Linux kernel regression tracker I'm getting a lot of 
> > reports on my table. I can only look briefly into most of them. 
> > Unfortunately therefore I sometimes will get things wrong or miss something 
> > important.
> > I hope that's not the case here; if you think it is, don't hesitate to 
> > tell me about it in a public reply. That's in everyone's interest, as 
> > what I wrote above might be misleading to everyone reading this; any 
> > suggestion I gave they thus might sent someone reading this down the 
> > wrong rabbit hole, which none of us wants.
> > 
> > BTW, I have no personal interest in this issue, which is tracked using 
> > regzbot, my Linux kernel regression tracking bot 
> > (https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Furldefense.com%2Fv3%2F__https%3A%2F%2Feur01.safelinks.protection.outloo&amp;data=04%7C01%7Cjoakim.tjernlund%40infinera.com%7Cb302ff817a8f4b3184c408d9b392bd1c%7C285643de5f5b4b03a1530ae2dc8aaf77%7C1%7C0%7C637738265108962168%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=D2TOGHLaeLnnmbJQG5VEY3CQ66GKtkpBOkFZ16WeW%2F4%3D&amp;reserved=0
> > k.com/?url=https*3A*2F*2Furld__;JSUl!!O7uE89YCNVw!a6nsIMfn544OIzmshw3H
> > bMBVcbwor4cV2Q5OsST7-86jy_YZKvDsN-558Ris4wh8Zawz4puN$
> > efense.com%2Fv3%2F__https%3A%2F%2Flinux-
> > regtracking.leemhuis.info%2Fregzbot%2F__%3B!!O7uE89YCNVw!aHa5_mLM
> > nBeDjINlAtV19tBHm-
> > He9jbusXucMA5h7oonHvNFwYpOHAaaqqewPOuGK9HAzJUz%24&amp;data
> > =04%7C01%7Cleoyang.li%40nxp.com%7C859ce1560a7344729cea08d9b35d2e
> > 67%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0%7C0%7C6377380350721308
> > 84%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luM
> > zIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=ONQZyAKXNgok
> > 6LgYvnaAL7LVY%2B5Wl7pXglZDqWUJZMc%3D&amp;reserved=0 ). I'm only 
> > posting this mail to get things rolling again and hence don't need to 
> > be CC on all further activities wrt to this regression.
> > 
> > #regzbot title: usb: fsl_udc_core: corrupted request list leads to 
> > unrecoverable loop

Reply via email to