Major A wrote:

Make sure you're running a kernel with the kernel hacking "debug memory
allocations" option enabled.  Then when it pauses, copy the "async"
file for that controller and send it to me.  In fact, just make a
copy of every file in the relevant sysfs directory (it'll be something
like /sys/bus/pci/devices/00:09.2).


Done that -- I've attached a complete log and a tarball of the sysfs
copies, each a directory with the time in its name so you can match it
against the log.

Thanks. Some of this does look like hardware/firmware issues.


* The first problem seemed to be at about 38:37, where the EHCI
  async schedule looks completely normal ... _unless_ the "nak3"
  values in the ep1in qh really are _not_ changing. Those values
  can be recycled very quickly; two samples is rarely enough to
  show changes, four or five samples is better.

  If they're not changing, then EHCI silicon is for some reason
  not polling for I/O.  One way that might happen is if some
  "short-read" logic misfired -- patch in the works.

* The second problem was at 38:42, where something called some
  usbcore code which complains "control/bulk timeout".

  This is curious; it doesn't seem to be associated with an EHCI
  device (no such request in the async schedule snapshots).  If
  it's some other host controller, that may suggest some sort of
  electrical interference happening (why?).  If you're using
  current code, usb-storage isn't making such calls any more...

* The third problem was at about 39:06, where the ep0out control
  transfer was clearly misbehaving.  Its STATUS stage (IN) was
  just waiting, doing nothing ... but the SETUP stage had worked,
  as well as any DATA stage (both OUT packets).

  Notice a pattern emerging:  returning status IN to the host
  isn't working.  But we know the OUT direction worked.  This
  is on ep0out, vs the others being on ep1in.

  No, devices are not allowed to make ep0 control requests
  wait until something else completes.

* Then at about 39:29 the TEST_UNIT_READY command (OUT) was
  accepted but the response (IN) timed out and was aborted.
  Only the one snapshot; can't really see if it was NAKing.

  I'd hope that such requests wouldn't be accepted by hardware
  that's not ready to handle them, but unfortunately it seems
  some developers prefer to leave their OUT FIFOs in "accept
  packets till full" mode ... maybe that's what's happening.

* Fifth, at 39:34 two things happened:  (a) bulk reset requested,
  and (b) more "control/bulk timeout" messages right away.

  Now (a) timed out at 39:54, as with problem #3:  there's one
  snapshot (39:37) showing the same symptom, IN transfer not
  happening ... same as 40:02 and 40:10, which show NAKing.

  That is, it looks like the device really isn't responding.
  And on the other hand (b) is the same as problem #2.
  Both of those are rather suspicous.

So, two electrically suspicious issues (#2, #5b), two cases
where it seems the device isn't handling control requests
legally (#3, #5a), and what seems like a pattern of IN
transfers not behaving correctly on _any_ endpoint once
the first problem (#1) appears.

That's my interpretation of your data, at any rate.

- Dave







-------------------------------------------------------
This SF.net email is sponsored by:  Etnus, makers of TotalView, The best
thread debugger on the planet. Designed with thread debugging features
you've never dreamed of, try TotalView 6 free at www.etnus.com.
_______________________________________________
[EMAIL PROTECTED]
To unsubscribe, use the last form field at:
https://lists.sourceforge.net/lists/listinfo/linux-usb-devel

Reply via email to