Re: [PATCH] iommu/vt-d: Ratelimit fault handler

2016-03-19 Thread Alex Williamson
On Tue, 15 Mar 2016 19:47:56 +
David Woodhouse  wrote:

> On Tue, 2016-03-15 at 10:35 -0600, Alex Williamson wrote:
> > Fault rates can easily overwhelm the console and make the system
> > unresponsive.  Ratelimit to allow an opportunity for maintenance.
> > 
> > Signed-off-by: Alex Williamson   
> 
> Rather than just rate-limiting the printk, I'd prefer to handle this
> explicitly. There's a bit in the context-entry which can tell the IOMMU
> not to bother raising an interrupt at all. And then we can re-enable it
> if/when the driver recovers the device. (Or perhaps just when it next
> does a mapping).

Seems like we need to keep statistics per context entry for that, are
you prepared for that sort of overhead?  IME, a device that's spewing
faults at this rate is broken to the point where it needs to be removed
from the system or is actively being tested and debugged for driver or
assignment work.  In those case, I think we want to keep reminding the
user that something is very wrong and it probably explains why the
device isn't working properly.  If the device is using the DMA API,
maybe clearing FPD on each mapping event is a way to do that, but an
IOMMU API managed device might have very long lived mapping entries.
It seems impractical to setup statistics per context entry and timers
to check back on them for things that really ought to be rare events.
My goal was only to reduce the overall impact on the system so that
it's usable when this occurs.

> We really ought to be reporting faults to drivers too, FWIW. I keep
> meaning to take a look at that.

Yes, that path has been absent for far too long.  Thanks,

Alex

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH] iommu/vt-d: Ratelimit fault handler

2016-03-15 Thread Joe Perches
On Tue, 2016-03-15 at 10:35 -0600, Alex Williamson wrote:
> Fault rates can easily overwhelm the console and make the system
> unresponsive.  Ratelimit to allow an opportunity for maintenance.

A few suggestions:

o Use a single ratelimit state.
o The multiple lines output are unnecessary and hard to grep
  in the dmesg output because of inconsistent prefixing as
  second and subsequent output lines are not prefixed by pr_fmt.
o The DMAR prefix on the second block is also unnecessary as
  it's already prefixed by pr_fmt
o Coalesce the formats for easier grep.

so maybe:
---
 drivers/iommu/dmar.c | 28 
 1 file changed, 16 insertions(+), 12 deletions(-)

diff --git a/drivers/iommu/dmar.c b/drivers/iommu/dmar.c
index 8ffd756..59dcaaa 100644
--- a/drivers/iommu/dmar.c
+++ b/drivers/iommu/dmar.c
@@ -1575,23 +1575,27 @@ static int dmar_fault_do_one(struct intel_iommu *iommu, 
int type,
 {
    const char *reason;
    int fault_type;
+   static DEFINE_RATELIMIT_STATE(rs,
+     DEFAULT_RATELIMIT_INTERVAL,
+     DEFAULT_RATELIMIT_BURST);
+
+   if (__ratelimit())
+   return 0;
 
    reason = dmar_get_fault_reason(fault_reason, _type);
 
    if (fault_type == INTR_REMAP)
-   pr_err("INTR-REMAP: Request device [[%02x:%02x.%d] "
-      "fault index %llx\n"
-   "INTR-REMAP:[fault reason %02d] %s\n",
-   (source_id >> 8), PCI_SLOT(source_id & 0xFF),
-   PCI_FUNC(source_id & 0xFF), addr >> 48,
-   fault_reason, reason);
+   pr_err("[INTR-REMAP] Request device [%02x:%02x.%d] fault index 
%llx [fault reason %02d] %s\n",
+      source_id >> 8, PCI_SLOT(source_id & 0xFF),
+      PCI_FUNC(source_id & 0xFF), addr >> 48,
+      fault_reason, reason);
    else
-   pr_err("DMAR:[%s] Request device [%02x:%02x.%d] "
-      "fault addr %llx \n"
-      "DMAR:[fault reason %02d] %s\n",
-      (type ? "DMA Read" : "DMA Write"),
-      (source_id >> 8), PCI_SLOT(source_id & 0xFF),
-      PCI_FUNC(source_id & 0xFF), addr, fault_reason, reason);
+   pr_err("[%s] Request device [%02x:%02x.%d] fault addr %llx 
[fault reason %02d] %s\n",
+      type ? "DMA Read" : "DMA Write",
+      source_id >> 8, PCI_SLOT(source_id & 0xFF),
+      PCI_FUNC(source_id & 0xFF), addr,
+      fault_reason, reason);
+
    return 0;
 }
 

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] iommu/vt-d: Ratelimit fault handler

2016-03-15 Thread David Woodhouse
On Tue, 2016-03-15 at 10:35 -0600, Alex Williamson wrote:
> Fault rates can easily overwhelm the console and make the system
> unresponsive.  Ratelimit to allow an opportunity for maintenance.
> 
> Signed-off-by: Alex Williamson 

Rather than just rate-limiting the printk, I'd prefer to handle this
explicitly. There's a bit in the context-entry which can tell the IOMMU
not to bother raising an interrupt at all. And then we can re-enable it
if/when the driver recovers the device. (Or perhaps just when it next
does a mapping).

We really ought to be reporting faults to drivers too, FWIW. I keep
meaning to take a look at that.

-- 
David WoodhouseOpen Source Technology Centre
david.woodho...@intel.com  Intel Corporation



smime.p7s
Description: S/MIME cryptographic signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH] iommu/vt-d: Ratelimit fault handler

2016-03-15 Thread Joe Perches
On Tue, 2016-03-15 at 10:10 -0700, Joe Perches wrote:
> o Use a single ratelimit state.
[]
> diff --git a/drivers/iommu/dmar.c b/drivers/iommu/dmar.c
[]
> + if (__ratelimit())
> + return 0;

That of course should be:

if (!__ratelimit())
return 0;
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH] iommu/vt-d: Ratelimit fault handler

2016-03-15 Thread Alex Williamson
Fault rates can easily overwhelm the console and make the system
unresponsive.  Ratelimit to allow an opportunity for maintenance.

Signed-off-by: Alex Williamson 
---
 drivers/iommu/dmar.c |   28 +++-
 1 file changed, 15 insertions(+), 13 deletions(-)

diff --git a/drivers/iommu/dmar.c b/drivers/iommu/dmar.c
index 8ffd756..628987e 100644
--- a/drivers/iommu/dmar.c
+++ b/drivers/iommu/dmar.c
@@ -1579,19 +1579,20 @@ static int dmar_fault_do_one(struct intel_iommu *iommu, 
int type,
reason = dmar_get_fault_reason(fault_reason, _type);
 
if (fault_type == INTR_REMAP)
-   pr_err("INTR-REMAP: Request device [[%02x:%02x.%d] "
-  "fault index %llx\n"
-   "INTR-REMAP:[fault reason %02d] %s\n",
-   (source_id >> 8), PCI_SLOT(source_id & 0xFF),
-   PCI_FUNC(source_id & 0xFF), addr >> 48,
-   fault_reason, reason);
+   pr_err_ratelimited("INTR-REMAP: Request device [[%02x:%02x.%d] "
+  "fault index %llx\n"
+  "INTR-REMAP:[fault reason %02d] %s\n",
+  (source_id >> 8), PCI_SLOT(source_id & 0xFF),
+  PCI_FUNC(source_id & 0xFF), addr >> 48,
+  fault_reason, reason);
else
-   pr_err("DMAR:[%s] Request device [%02x:%02x.%d] "
-  "fault addr %llx \n"
-  "DMAR:[fault reason %02d] %s\n",
-  (type ? "DMA Read" : "DMA Write"),
-  (source_id >> 8), PCI_SLOT(source_id & 0xFF),
-  PCI_FUNC(source_id & 0xFF), addr, fault_reason, reason);
+   pr_err_ratelimited("DMAR:[%s] Request device [%02x:%02x.%d] "
+  "fault addr %llx \n"
+  "DMAR:[fault reason %02d] %s\n",
+  (type ? "DMA Read" : "DMA Write"),
+  (source_id >> 8), PCI_SLOT(source_id & 0xFF),
+  PCI_FUNC(source_id & 0xFF), addr,
+  fault_reason, reason);
return 0;
 }
 
@@ -1606,7 +1607,8 @@ irqreturn_t dmar_fault(int irq, void *dev_id)
raw_spin_lock_irqsave(>register_lock, flag);
fault_status = readl(iommu->reg + DMAR_FSTS_REG);
if (fault_status)
-   pr_err("DRHD: handling fault status reg %x\n", fault_status);
+   pr_err_ratelimited("DRHD: handling fault status reg %x\n",
+  fault_status);
 
/* TBD: ignore advanced fault log currently */
if (!(fault_status & DMA_FSTS_PPF))

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu