On Thu, 2015-01-08 at 14:36 -0800, Roland Dreier wrote: > Hi, > > So we've managed to find one magic setup and workload that reproduces > this reliably, and I've done a bit more debugging that leads me to > believe we probably need Intel's help to really get to the bottom of > this.
Can you share your test case? My test case was also using vfio, repeatedly booting a pair of VMs with assigned GPUs, but I haven't seen any sign of the issue without NET_DMA. I agree that it seems like the hardware is getting somewhere that it shouldn't regardless of the software, but are you able to reproduce on a more recent kernel? I've also seen a report of the same problem without vfio by manipulating IRQ affinity, but it was not a reliable test case. > One question though: you mentioned that you saw this behavior until > you turned off CONFIG_NET_DMA. What platform was that on? We see > this on dual socket Xeon E5 v3 (Haswell EP / Grantley), and I don't > really have any other setup I can try. Did you see this on other > platforms (Ivy Bridge / Romley maybe)? I saw it on a dual socket Xeon E5 v2 (Ivy Bridge EP / Patsburg), and the other report I mention above was the same, different systems though. > Anyway, I added the debugging patch at the end of this mail to our > kernel to dump some status when the driver detects a hung queue. > Below we see some example output. Things I notice: > > - IQH == IQT, in other words the QI hardware thinks it has picked up > and validated all the descriptors submitted by software. > - The queue has successfully processed many operations (although > everything that we can see succeeded is type 2h, ie "IOTLB > invalidate" as opposed to the type 4h "Interrupt Entry Cache > invalidate" that we're stuck on. > - There haven't been any faults and FSTS is clear. > - The hardware hasn't executed the "Invalidation wait" descriptor. This is all consistent with my observations as well. > Beyond that I'm not sure how to make much more progress without > insight into the hardware -- it looks like the driver is doing > everything right. Anyone from Intel have any thoughts? Agreed, the hardware appears to be getting wedged without and sort of fault indication or recovery mechanism. Thanks, Alex _______________________________________________ iommu mailing list [email protected] https://lists.linuxfoundation.org/mailman/listinfo/iommu
