On Monday 08 January 2007 5:55 am, Phil Endecott wrote: > Test 0-9 and 11-13 pass. Test 14 fails: > > # strace ./testusb -D /proc/bus/usb/004/010 -t14 -c 15000 -s 256 -v 1 > .. > open("/proc/bus/usb/004/010", O_RDWR) = 4 > ioctl(4, USBDEVFS_IOCTL, 0xbf8df42c) = -1 EOPNOTSUPP (Operation not > supported) > > I thought this had worked before, but maybe I am getting confused with > my g_zero testing; I believe it's not expected to work with gadgetfs > and usb.c, is this right?
Right, that sample program doesn't implement the control requests needed to support test #14. That combo is not expected to work. (Which is not to say it couldn't be _made_ to work on some platforms.) > Anyway the current issue is test 10, which fails more often than not. > Perhaps it passes one time in three. That's not quite what I understood from you before. Intermittent failure means that some race has opened up. The point of test 10 is exactly to turn up such races, by issuing lots of back-to-back requests. It's rather perversely good at doing that. :) I tried this on an at91 system would say that it worked a bit more often than not, but it would indeed fail. And based on what you show below, it looks like a similar failure. > I have tried to spot a pattern; > maybe it is more likely to work when usb.c is first started, more more > likely to fail later. It has certainly been known to fail on the first > run though. Here are the symptoms when it fails. [Note that the > output below is not all from the same run.] > > # ./testusb -D /proc/bus/usb/004/010 -t10 > (Never completes, no output) > dmesg: > > Jan 8 12:37:38 egypt kernel: usbtest 4-4:3.0: TEST 10: queue 32 > control calls, 1000 times > (no further output) Actually if you unplug the device at that point you should get a short burst of additional diagnostic output, saying among other useful data just which subtest failed. Also, unless this is with UHCI, enabling USB debugging will produce files /sys/class/usb_host/usb_hostN/async showing the schedule of USB transactions. What you could do is snapshot that file when this hang happens ... the top few entries for the endpoint for this test will be informative, showing exactly which subtest saw trouble. (In the case of test #1, the actual failure would likely have been triggered by the preceding request, or maybe the one before that...) I'm not sure what to make of this usbmon stuff. All the urbs are submitted into a queue, so that the responses look like they're out of order. What's needed is a bit of postprocessing to associate the responses with the requests. > c7848500 33910869 S Ci:024:00 s 80 06 0400 0000 0009 9 < > c7848260 33911110 C Co:024:00 0 0 subtest #7 (by inference; should have stalled). I don't see a response for c7848500, or a request for c7848260. > c7848260 33911112 S Co:024:00 s 02 01 0000 0000 0000 0 > c7848380 33911117 C Ci:024:00 0 2 = 0000 subtest #8 (by inference; should have stalled) > c7848380 33911118 S Ci:024:00 s 82 00 0000 0000 0002 2 < > c7848560 33911124 C Ci:024:00 -121 39 = 09022700 010304c0 01090400 > 0003ff00 0005 > 0705 87020002 00070503 02000201 subtest #9 (by inference, the next one being #10) > c7848560 33911125 S Ci:024:00 s 80 06 0200 0000 0400 1024 < > c78481a0 33911235 C Ci:024:00 -32 0 This "S" was obviously subtest #10, but the STALL response coming back for c78481a0 doesn't match any URB submitted in the subset of this log you included. It's clear that after that STALL was issued, something got confused in terms of handling the next request ... what ever it was. This is not uncommon ... fault paths are trouble, and test #10 makes a point of triggering them. It even makes a point of triggering several fault paths in a row, so that bugs in cleanup of one are able to trigger additional bugs when processing later ones. :) > c78481a0 33911236 S Ci:024:00 s 80 06 0500 0000 0009 9 < > > > On the device end, usb -v output is: > > /dev/gadget/net2280 ep0 configured > serial="13hz78wkvlc3179o5r9bj6rbm28wotubury5m1jhsqxas9xbceylqit6yvj0tqx" > CONNECT high speed > DISCONNECT > read 2 ep0 events > CONNECT high speed > SETUP 80.06 v0300 i0000 255 > SETUP 80.06 v0302 i0409 255 > SETUP 80.06 v0301 i0409 255 > SETUP 80.06 v0303 i0409 255 > SETUP 00.09 v0003 i0000 0 > CONFIG #3 > simple_source_thread start -1209627728 fd 4 > simple_sink_thread start -1218020432 fd 5 > SETUP 80.06 v0304 i0409 255 > SETUP 80.06 v0305 i0409 255 Those are GET_DESCRIPTOR requests for strings 4 and 5 in US-English. > Then these two lines are repeated: > SETUP 81.0a v0000 i0000 1 Which is a GET_INTERFACE request; that gets bounced up to userspace, unlike most of the test10 requests (which get handled directly in the kernel). > SETUP 80.06 v0300 i0000 9 Which is a GET_DESCRIPTOR request for string zero > > In one failing case the lines were output 1312 times, in another case > 1495 times, then no further output. In another case they were output > only once. When the test passes they are output 2000 times. Normally > the last line to be output is the 80.06 line, but I have seen a case > where it was the other one. Test #10 issues a set of about 15 requests in loop, a few thousand times. So each of those messages should appear once per loop. > I have tried strace -f on usb.c and its output finishes like this: > > [pid 2177] open("ep-a", O_RDWR <unfinished ...> > [pid 2178] open("ep-b", O_RDWR <unfinished ...> > [pid 2177] <... open resumed> ) = 4 > [pid 2178] <... open resumed> ) = 5 > [pid 2177] write(4, "[EMAIL PROTECTED]", 18 > <unfinished ...> > [pid 2178] write(5, "[EMAIL PROTECTED]", 18 > <unfinished ...> > [pid 2177] <... write resumed> ) = 18 > [pid 2178] <... write resumed> ) = 18 Configured the endpoints; context switching in the middle. > [pid 2177] write(4, > "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 7168 > <unfinished ...> > [pid 2178] read(5, <unfinished ...> > [pid 2176] <... sched_yield resumed> ) = 0 > [pid 2176] rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 > [pid 2176] read(3, "", 0) = 0 > [pid 2176] --- SIGIO (I/O possible) @ 0 (0) --- > [pid 2176] rt_sigreturn(0x3) = 0 > [pid 2176] rt_sigtimedwait(~[RTMIN RT_1], 0, 0, 8) = 29 > [pid 2176] read(3, "\3\0\0\0\200\6\4\3\t\4\2\0", 60) = 12 > [pid 2176] write(3, "$\3", 2) = 2 > [pid 2176] --- SIGIO (I/O possible) @ 0 (0) --- > [pid 2176] rt_sigreturn(0x3) = 2 > [pid 2176] rt_sigtimedwait(~[RTMIN RT_1], 0, 0, 8) = 29 > [pid 2176] read(3, "\3\0\0\0\200\6\5\3\t\4\377\0", 60) = 12 > [pid 2176] write(3, "\30\3S\0o\0u\0r\0c\0e\0/\0S\0i\0n\0k\0", 24) = 24 That was writing a control response ... 24 bytes, looks like subcase #14, triggering a host-side short read. That'd be the GET_DESCRIPTOR case. > [pid 2176] rt_sigtimedwait(~[RTMIN RT_1], 0, 0, 8) = 29 > [pid 2176] read(3, "\3\0\0\0\201\n\0\0\0\0\1\0", 60) = 12 > [pid 2176] write(3, "\0", 1) = 1 This would seem to be the GET_INTERFACE case. > [pid 2176] --- SIGIO (I/O possible) @ 0 (0) --- > [pid 2176] rt_sigreturn(0x3) = 1 > [pid 2176] rt_sigtimedwait(~[RTMIN RT_1], <unfinished ...> OK, that's interesting. The failure I looked at on at91 was also the GET_INTERFACE case (as I deduced by looking at the host side's async message queue) ... and in that case, the byte never got sent on the wire, since the host never saw it. That's suggestive that the particular race in question is between subtests 1 (gadgetfs returns config descriptor) and 2 (GET_INTERFACE). Or at least _one_ of the races ... I'd guess there's a similar one on the other request. > I have also tessted with as823.patch reverted. In this case I get: > > # ./testusb -D /proc/bus/usb/004/032 -t10 > unknown speed /proc/bus/usb/004/032 > /proc/bus/usb/004/032 test 10 --> 121 (error 121) > > Is this expected? Yes. > > Let me know what else I can do. I'm thinking the failure is probably a race in how gadgetfs cleans up after the (two) control requests handled in userspace. You could verify that hypothesis by collecting information about exactly which requests fail (do the unplug, see what subtest dmesg says failed, say if it's not one of the two we know gadgetfs handles), and how they're failing (host side, does 'async' queue also show the response data never arrived?). What I'm going to do is look a bit more at what gadgetfs does when it issues a control-IN response. - Dave > > Cheers, > > Phil. > > > > > > > ------------------------------------------------------------------------- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the chance to share your > opinions on IT & business topics through brief surveys - and earn cash > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > _______________________________________________ > linux-usb-devel@lists.sourceforge.net > To unsubscribe, use the last form field at: > https://lists.sourceforge.net/lists/listinfo/linux-usb-devel > ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ linux-usb-devel@lists.sourceforge.net To unsubscribe, use the last form field at: https://lists.sourceforge.net/lists/listinfo/linux-usb-devel