Re: [PATCH 3/4] iommu/arm-smmu: Disable stalling faults for all endpoints

2017-09-18 Thread Rob Clark
On Mon, Sep 18, 2017 at 1:33 PM, Will Deacon  wrote:
> On Wed, Sep 13, 2017 at 03:31:20PM -0400, Rob Clark wrote:
>> On Fri, Dec 16, 2016 at 6:54 AM, Will Deacon  wrote:
>> > Hi Rob,
>> >
>> > On Tue, Dec 06, 2016 at 06:30:21PM -0500, Rob Clark wrote:
>> >> On Thu, Aug 18, 2016 at 9:05 AM, Will Deacon  wrote:
>> >> > Enabling stalling faults can result in hardware deadlock on poorly
>> >> > designed systems, particularly those with a PCI root complex upstream of
>> >> > the SMMU.
>> >> >
>> >> > Although it's not really Linux's job to save hardware integrators from
>> >> > their own misfortune, it *is* our job to stop userspace (e.g. VFIO
>> >> > clients) from hosing the system for everybody else, even if they might
>> >> > already be required to have elevated privileges.
>> >> >
>> >> > Given that the fault handling code currently executes entirely in IRQ
>> >> > context, there is nothing that can sensibly be done to recover from
>> >> > things like page faults anyway, so let's rip this code out for now and
>> >> > avoid the potential for deadlock.
>> >>
>> >> so, I'd like to re-introduce this feature, I *guess* as some sort of
>> >> opt-in quirk (ie. disabled by default unless something in DT tells you
>> >> otherwise??  But I'm open to suggestions.  I'm not entirely sure what
>> >> hw was having problems due to this feature.)
>> >>
>> >> On newer snapdragon devices we are using arm-smmu for the GPU, and
>> >> halting the GPU so the driver's fault handler can dump some GPU state
>> >> on faults is enormously helpful for debugging and tracking down where
>> >> in the gpu cmdstream the fault was triggered.  In addition, we will
>> >> eventually want the ability to update pagetables from fault handler
>> >> and resuming the faulting transition.
>> >
>> > I'm not against reintroducing this, but it would certainly need to be
>> > opt-in, as you suggest. If we want to try to use stall faults to enable
>> > demand paging for DMA, then that means running core mm code to resolve
>> > the fault and we can't do that in irq context. Consequently, we have to
>> > hand this off to a thread, which means the hardware must allow the SS
>> > bit to remain set without immediately reasserting the interrupt line.
>> > Furthermore, we can't handle multiple faults on a context-bank, so we'd
>> > need to restrict ourselves to one device (i.e. faulting stream) per
>> > domain (CB).
>> >
>> > I think that means we want both specific compatible strings to identify
>> > the SS bit behaviour, but also a way to opt-in for the stall model as a
>> > separate property to indicate that the SoC integration can support this
>> > without e.g. deadlocking.
>> >
>>
>> How do you feel about a short-term step to keep things in irq context,
>> but enable stall mode?  I'm debugging an issue, where it appears that
>> the gpu cannot handle a non-stalled fault (triggers some fairly
>> bizarre follow-on problems).  So I think even if we don't add a fault
>> handler callback, we still want to set the CFCFG bit and
>> RESUME_TERMINATE in the fault handler on this hardware.
>
> Hmm, colour me unconvinced. Why does enabling stalls fix this problem?
> I'd rather we bite the bullet and implement things properly on top of
> a workqueue so that you can build on the same basic infrastructure as
> the SVM work that Jean-Philippe is looking at, particular as you also
> have a use-case for running fault code in non-atomic context.
>

So, it seems like setting either CFCFG or HUPCF (which is what
downstream android kernel apparently does) avoids this issue.

It seems like some sort of hw bug, but not sure if in the iommu or the
gpu.. you'd probably have to ask someone @qcom about the details, but
without setting either of these bits, it seems that other concurrent
memory transactions to the one triggering fault and up returning bogus
values to the GPU.  So the CP (which is reading cmdstream somewhat far
ahead of the DRAW or COMPUTE shader that triggers a fault) ends up
reading bogus cmdstream values and triggers all sorts of spectacular
fireworks.

BR,
-R
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 3/4] iommu/arm-smmu: Disable stalling faults for all endpoints

2017-09-18 Thread Will Deacon
On Wed, Sep 13, 2017 at 03:31:20PM -0400, Rob Clark wrote:
> On Fri, Dec 16, 2016 at 6:54 AM, Will Deacon  wrote:
> > Hi Rob,
> >
> > On Tue, Dec 06, 2016 at 06:30:21PM -0500, Rob Clark wrote:
> >> On Thu, Aug 18, 2016 at 9:05 AM, Will Deacon  wrote:
> >> > Enabling stalling faults can result in hardware deadlock on poorly
> >> > designed systems, particularly those with a PCI root complex upstream of
> >> > the SMMU.
> >> >
> >> > Although it's not really Linux's job to save hardware integrators from
> >> > their own misfortune, it *is* our job to stop userspace (e.g. VFIO
> >> > clients) from hosing the system for everybody else, even if they might
> >> > already be required to have elevated privileges.
> >> >
> >> > Given that the fault handling code currently executes entirely in IRQ
> >> > context, there is nothing that can sensibly be done to recover from
> >> > things like page faults anyway, so let's rip this code out for now and
> >> > avoid the potential for deadlock.
> >>
> >> so, I'd like to re-introduce this feature, I *guess* as some sort of
> >> opt-in quirk (ie. disabled by default unless something in DT tells you
> >> otherwise??  But I'm open to suggestions.  I'm not entirely sure what
> >> hw was having problems due to this feature.)
> >>
> >> On newer snapdragon devices we are using arm-smmu for the GPU, and
> >> halting the GPU so the driver's fault handler can dump some GPU state
> >> on faults is enormously helpful for debugging and tracking down where
> >> in the gpu cmdstream the fault was triggered.  In addition, we will
> >> eventually want the ability to update pagetables from fault handler
> >> and resuming the faulting transition.
> >
> > I'm not against reintroducing this, but it would certainly need to be
> > opt-in, as you suggest. If we want to try to use stall faults to enable
> > demand paging for DMA, then that means running core mm code to resolve
> > the fault and we can't do that in irq context. Consequently, we have to
> > hand this off to a thread, which means the hardware must allow the SS
> > bit to remain set without immediately reasserting the interrupt line.
> > Furthermore, we can't handle multiple faults on a context-bank, so we'd
> > need to restrict ourselves to one device (i.e. faulting stream) per
> > domain (CB).
> >
> > I think that means we want both specific compatible strings to identify
> > the SS bit behaviour, but also a way to opt-in for the stall model as a
> > separate property to indicate that the SoC integration can support this
> > without e.g. deadlocking.
> >
> 
> How do you feel about a short-term step to keep things in irq context,
> but enable stall mode?  I'm debugging an issue, where it appears that
> the gpu cannot handle a non-stalled fault (triggers some fairly
> bizarre follow-on problems).  So I think even if we don't add a fault
> handler callback, we still want to set the CFCFG bit and
> RESUME_TERMINATE in the fault handler on this hardware.

Hmm, colour me unconvinced. Why does enabling stalls fix this problem?
I'd rather we bite the bullet and implement things properly on top of
a workqueue so that you can build on the same basic infrastructure as
the SVM work that Jean-Philippe is looking at, particular as you also
have a use-case for running fault code in non-atomic context.

Will
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 3/4] iommu/arm-smmu: Disable stalling faults for all endpoints

2017-09-13 Thread Rob Clark
On Fri, Dec 16, 2016 at 6:54 AM, Will Deacon  wrote:
> Hi Rob,
>
> On Tue, Dec 06, 2016 at 06:30:21PM -0500, Rob Clark wrote:
>> On Thu, Aug 18, 2016 at 9:05 AM, Will Deacon  wrote:
>> > Enabling stalling faults can result in hardware deadlock on poorly
>> > designed systems, particularly those with a PCI root complex upstream of
>> > the SMMU.
>> >
>> > Although it's not really Linux's job to save hardware integrators from
>> > their own misfortune, it *is* our job to stop userspace (e.g. VFIO
>> > clients) from hosing the system for everybody else, even if they might
>> > already be required to have elevated privileges.
>> >
>> > Given that the fault handling code currently executes entirely in IRQ
>> > context, there is nothing that can sensibly be done to recover from
>> > things like page faults anyway, so let's rip this code out for now and
>> > avoid the potential for deadlock.
>>
>> so, I'd like to re-introduce this feature, I *guess* as some sort of
>> opt-in quirk (ie. disabled by default unless something in DT tells you
>> otherwise??  But I'm open to suggestions.  I'm not entirely sure what
>> hw was having problems due to this feature.)
>>
>> On newer snapdragon devices we are using arm-smmu for the GPU, and
>> halting the GPU so the driver's fault handler can dump some GPU state
>> on faults is enormously helpful for debugging and tracking down where
>> in the gpu cmdstream the fault was triggered.  In addition, we will
>> eventually want the ability to update pagetables from fault handler
>> and resuming the faulting transition.
>
> I'm not against reintroducing this, but it would certainly need to be
> opt-in, as you suggest. If we want to try to use stall faults to enable
> demand paging for DMA, then that means running core mm code to resolve
> the fault and we can't do that in irq context. Consequently, we have to
> hand this off to a thread, which means the hardware must allow the SS
> bit to remain set without immediately reasserting the interrupt line.
> Furthermore, we can't handle multiple faults on a context-bank, so we'd
> need to restrict ourselves to one device (i.e. faulting stream) per
> domain (CB).
>
> I think that means we want both specific compatible strings to identify
> the SS bit behaviour, but also a way to opt-in for the stall model as a
> separate property to indicate that the SoC integration can support this
> without e.g. deadlocking.
>

How do you feel about a short-term step to keep things in irq context,
but enable stall mode?  I'm debugging an issue, where it appears that
the gpu cannot handle a non-stalled fault (triggers some fairly
bizarre follow-on problems).  So I think even if we don't add a fault
handler callback, we still want to set the CFCFG bit and
RESUME_TERMINATE in the fault handler on this hardware.

BR,
-R
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 3/4] iommu/arm-smmu: Disable stalling faults for all endpoints

2016-12-20 Thread Will Deacon
On Mon, Dec 19, 2016 at 02:33:36PM +0530, Sricharan wrote:
> >On Tue, Dec 06, 2016 at 06:30:21PM -0500, Rob Clark wrote:
> >> On Thu, Aug 18, 2016 at 9:05 AM, Will Deacon  wrote:
> >> > Enabling stalling faults can result in hardware deadlock on poorly
> >> > designed systems, particularly those with a PCI root complex upstream of
> >> > the SMMU.
> >> >
> >> > Although it's not really Linux's job to save hardware integrators from
> >> > their own misfortune, it *is* our job to stop userspace (e.g. VFIO
> >> > clients) from hosing the system for everybody else, even if they might
> >> > already be required to have elevated privileges.
> >> >
> >> > Given that the fault handling code currently executes entirely in IRQ
> >> > context, there is nothing that can sensibly be done to recover from
> >> > things like page faults anyway, so let's rip this code out for now and
> >> > avoid the potential for deadlock.
> >>
> >> so, I'd like to re-introduce this feature, I *guess* as some sort of
> >> opt-in quirk (ie. disabled by default unless something in DT tells you
> >> otherwise??  But I'm open to suggestions.  I'm not entirely sure what
> >> hw was having problems due to this feature.)
> >>
> >> On newer snapdragon devices we are using arm-smmu for the GPU, and
> >> halting the GPU so the driver's fault handler can dump some GPU state
> >> on faults is enormously helpful for debugging and tracking down where
> >> in the gpu cmdstream the fault was triggered.  In addition, we will
> >> eventually want the ability to update pagetables from fault handler
> >> and resuming the faulting transition.
> >
> >I'm not against reintroducing this, but it would certainly need to be
> >opt-in, as you suggest. If we want to try to use stall faults to enable
> >demand paging for DMA, then that means running core mm code to resolve
> >the fault and we can't do that in irq context. Consequently, we have to
> >hand this off to a thread, which means the hardware must allow the SS
> >bit to remain set without immediately reasserting the interrupt line.
> >Furthermore, we can't handle multiple faults on a context-bank, so we'd
> >need to restrict ourselves to one device (i.e. faulting stream) per
> >domain (CB).
> >
> >I think that means we want both specific compatible strings to identify
> >the SS bit behaviour, but also a way to opt-in for the stall model as a
> >separate property to indicate that the SoC integration can support this
> >without e.g. deadlocking.
> >
> 
> To understand the reason on the need for the quirk based on SS bit behavior,
> if the platform supports stall model and enabled, then SS bit should be 
> implemented
> and remain set until the RESUME register is written back, means same behavior
> always ?

The behaviour of the SS bit is IMPLEMENTATION DEFINED per the architecture,
so we need to know which way the given implementation chose to go. If we
want to support paging, then we absolutely need a way to return from the
interrupt handler without having handled the stall (i.e. without having
written to the RESUME register). That means that we mustn't take the same
interrupt immediately, otherwise we'll end up getting stuck in an infinite
fault. One hacky option would be to mask the interrupt at the GIC, but
that adds an additional requirement of one interrupt per context bank,
which isn't typically implemented in my experience.

Will
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH 3/4] iommu/arm-smmu: Disable stalling faults for all endpoints

2016-12-19 Thread Sricharan
Hi Will,

>On Tue, Dec 06, 2016 at 06:30:21PM -0500, Rob Clark wrote:
>> On Thu, Aug 18, 2016 at 9:05 AM, Will Deacon  wrote:
>> > Enabling stalling faults can result in hardware deadlock on poorly
>> > designed systems, particularly those with a PCI root complex upstream of
>> > the SMMU.
>> >
>> > Although it's not really Linux's job to save hardware integrators from
>> > their own misfortune, it *is* our job to stop userspace (e.g. VFIO
>> > clients) from hosing the system for everybody else, even if they might
>> > already be required to have elevated privileges.
>> >
>> > Given that the fault handling code currently executes entirely in IRQ
>> > context, there is nothing that can sensibly be done to recover from
>> > things like page faults anyway, so let's rip this code out for now and
>> > avoid the potential for deadlock.
>>
>> so, I'd like to re-introduce this feature, I *guess* as some sort of
>> opt-in quirk (ie. disabled by default unless something in DT tells you
>> otherwise??  But I'm open to suggestions.  I'm not entirely sure what
>> hw was having problems due to this feature.)
>>
>> On newer snapdragon devices we are using arm-smmu for the GPU, and
>> halting the GPU so the driver's fault handler can dump some GPU state
>> on faults is enormously helpful for debugging and tracking down where
>> in the gpu cmdstream the fault was triggered.  In addition, we will
>> eventually want the ability to update pagetables from fault handler
>> and resuming the faulting transition.
>
>I'm not against reintroducing this, but it would certainly need to be
>opt-in, as you suggest. If we want to try to use stall faults to enable
>demand paging for DMA, then that means running core mm code to resolve
>the fault and we can't do that in irq context. Consequently, we have to
>hand this off to a thread, which means the hardware must allow the SS
>bit to remain set without immediately reasserting the interrupt line.
>Furthermore, we can't handle multiple faults on a context-bank, so we'd
>need to restrict ourselves to one device (i.e. faulting stream) per
>domain (CB).
>
>I think that means we want both specific compatible strings to identify
>the SS bit behaviour, but also a way to opt-in for the stall model as a
>separate property to indicate that the SoC integration can support this
>without e.g. deadlocking.
>

To understand the reason on the need for the quirk based on SS bit behavior,
if the platform supports stall model and enabled, then SS bit should be 
implemented
and remain set until the RESUME register is written back, means same behavior
always ?

Regards,
 Sricharan

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 3/4] iommu/arm-smmu: Disable stalling faults for all endpoints

2016-12-16 Thread Will Deacon
Hi Rob,

On Tue, Dec 06, 2016 at 06:30:21PM -0500, Rob Clark wrote:
> On Thu, Aug 18, 2016 at 9:05 AM, Will Deacon  wrote:
> > Enabling stalling faults can result in hardware deadlock on poorly
> > designed systems, particularly those with a PCI root complex upstream of
> > the SMMU.
> >
> > Although it's not really Linux's job to save hardware integrators from
> > their own misfortune, it *is* our job to stop userspace (e.g. VFIO
> > clients) from hosing the system for everybody else, even if they might
> > already be required to have elevated privileges.
> >
> > Given that the fault handling code currently executes entirely in IRQ
> > context, there is nothing that can sensibly be done to recover from
> > things like page faults anyway, so let's rip this code out for now and
> > avoid the potential for deadlock.
> 
> so, I'd like to re-introduce this feature, I *guess* as some sort of
> opt-in quirk (ie. disabled by default unless something in DT tells you
> otherwise??  But I'm open to suggestions.  I'm not entirely sure what
> hw was having problems due to this feature.)
> 
> On newer snapdragon devices we are using arm-smmu for the GPU, and
> halting the GPU so the driver's fault handler can dump some GPU state
> on faults is enormously helpful for debugging and tracking down where
> in the gpu cmdstream the fault was triggered.  In addition, we will
> eventually want the ability to update pagetables from fault handler
> and resuming the faulting transition.

I'm not against reintroducing this, but it would certainly need to be
opt-in, as you suggest. If we want to try to use stall faults to enable
demand paging for DMA, then that means running core mm code to resolve
the fault and we can't do that in irq context. Consequently, we have to
hand this off to a thread, which means the hardware must allow the SS
bit to remain set without immediately reasserting the interrupt line.
Furthermore, we can't handle multiple faults on a context-bank, so we'd
need to restrict ourselves to one device (i.e. faulting stream) per
domain (CB).

I think that means we want both specific compatible strings to identify
the SS bit behaviour, but also a way to opt-in for the stall model as a
separate property to indicate that the SoC integration can support this
without e.g. deadlocking.

Will
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH 3/4] iommu/arm-smmu: Disable stalling faults for all endpoints

2016-12-10 Thread Sricharan
Hi Rob,

>> > Although it's not really Linux's job to save hardware integrators from
>> > their own misfortune, it *is* our job to stop userspace (e.g. VFIO
>> > clients) from hosing the system for everybody else, even if they might
>> > already be required to have elevated privileges.
>> >
>> > Given that the fault handling code currently executes entirely in IRQ
>> > context, there is nothing that can sensibly be done to recover from
>> > things like page faults anyway, so let's rip this code out for now and
>> > avoid the potential for deadlock.
>>
>> Hi Will,
>>
>> so, I'd like to re-introduce this feature, I *guess* as some sort of
>> opt-in quirk (ie. disabled by default unless something in DT tells you
>> otherwise??  But I'm open to suggestions.  I'm not entirely sure what
>> hw was having problems due to this feature.)
>>
>> On newer snapdragon devices we are using arm-smmu for the GPU, and
>> halting the GPU so the driver's fault handler can dump some GPU state
>> on faults is enormously helpful for debugging and tracking down where
>> in the gpu cmdstream the fault was triggered.  In addition, we will
>> eventually want the ability to update pagetables from fault handler
>> and resuming the faulting transition.
>>
>> Some additional comments below..
>>
>> > Cc: 
>> > Reported-by: Matt Evans 
>> > Signed-off-by: Will Deacon 
>> > ---
>> >  drivers/iommu/arm-smmu.c | 34 +++---
>> >  1 file changed, 7 insertions(+), 27 deletions(-)
>> >
>> > diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c
>> > index 4f49fe29f202..2db74ebc3240 100644
>> > --- a/drivers/iommu/arm-smmu.c
>> > +++ b/drivers/iommu/arm-smmu.c
>> > @@ -686,8 +686,7 @@ static struct iommu_gather_ops arm_smmu_gather_ops = {
>> >
>> >  static irqreturn_t arm_smmu_context_fault(int irq, void *dev)
>> >  {
>> > -   int flags, ret;
>> > -   u32 fsr, fsynr, resume;
>> > +   u32 fsr, fsynr;
>> > unsigned long iova;
>> > struct iommu_domain *domain = dev;
>> > struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
>> > @@ -701,34 +700,15 @@ static irqreturn_t arm_smmu_context_fault(int irq, 
>> > void *dev)
>> > if (!(fsr & FSR_FAULT))
>> > return IRQ_NONE;
>> >
>> > -   if (fsr & FSR_IGN)
>> > -   dev_err_ratelimited(smmu->dev,
>> > -   "Unexpected context fault (fsr 
>> > 0x%x)\n",
>> > -   fsr);
>> > -
>> > fsynr = readl_relaxed(cb_base + ARM_SMMU_CB_FSYNR0);
>> > -   flags = fsynr & FSYNR0_WNR ? IOMMU_FAULT_WRITE : IOMMU_FAULT_READ;
>> > -
>> > iova = readq_relaxed(cb_base + ARM_SMMU_CB_FAR);
>> > -   if (!report_iommu_fault(domain, smmu->dev, iova, flags)) {
>> > -   ret = IRQ_HANDLED;
>> > -   resume = RESUME_RETRY;
>> > -   } else {
>> > -   dev_err_ratelimited(smmu->dev,
>> > -   "Unhandled context fault: iova=0x%08lx, fsynr=0x%x, 
>> > cb=%d\n",
>> > -   iova, fsynr, cfg->cbndx);
>>
>> I would like to decouple this dev_err_ratelimit() print from the
>> RESUME_RETRY vs RESUME_TERMINATE behaviour.  I need the ability to
>> indicate by return from my fault handler whether to resume or
>> terminate.  But I already have my own ratelimted prints and would
>> prefer not to spam dmesg twice.
>>
>> I'm thinking about report_iommu_fault() returning:
>>
>>   0 => RESUME_RETRY
>>   -EFAULT => RESUME_TERMINATE but don't print
>>   anything else (or specifically -ENOSYS?) => RESUME_TERMINATE and print
>>
>> thoughts?
>>
>> > -   ret = IRQ_NONE;
>> > -   resume = RESUME_TERMINATE;
>> > -   }
>> > -
>> > -   /* Clear the faulting FSR */
>> > -   writel(fsr, cb_base + ARM_SMMU_CB_FSR);
>> >
>> > -   /* Retry or terminate any stalled transactions */
>> > -   if (fsr & FSR_SS)
>> > -   writel_relaxed(resume, cb_base + ARM_SMMU_CB_RESUME);
>>
>> This might be a bug in qcom's implementation of the smmu spec, but
>> seems like we don't have SS bit set, yet we still require RESUME reg
>> to be written, otherwise gpu is perma-wedged.  Maybe topic for a
>> separate quirk?  I'm not sure if writing RESUME reg on other hw when
>> SS bit is not set is likely to cause problems?  If not I suppose we
>> could just unconditionally write it.
>>
>> Anyways, I'm not super-familiar w/ arm-smmu so suggestions welcome..
>> in between debugging freedreno I'll try to put together some patches.
>
>From what I can tell we need SCTLR_CFCFG to make the stall happen otherwise
>the operation just gets terminated immediately and *then* we get notification
>but by then the system keeps going.
>

Yes thats right, SCTLR_CFCFG was removed as a part of this patch.

>I think SCTLR_HUPCF helps control that behavior (i.e. we don't go off faulting
>through eternity) but I don't know how it 

Re: [PATCH 3/4] iommu/arm-smmu: Disable stalling faults for all endpoints

2016-12-06 Thread Jordan Crouse
On Tue, Dec 06, 2016 at 06:30:21PM -0500, Rob Clark wrote:
> On Thu, Aug 18, 2016 at 9:05 AM, Will Deacon  wrote:
> > Enabling stalling faults can result in hardware deadlock on poorly
> > designed systems, particularly those with a PCI root complex upstream of
> > the SMMU.
> >
> > Although it's not really Linux's job to save hardware integrators from
> > their own misfortune, it *is* our job to stop userspace (e.g. VFIO
> > clients) from hosing the system for everybody else, even if they might
> > already be required to have elevated privileges.
> >
> > Given that the fault handling code currently executes entirely in IRQ
> > context, there is nothing that can sensibly be done to recover from
> > things like page faults anyway, so let's rip this code out for now and
> > avoid the potential for deadlock.
> 
> Hi Will,
> 
> so, I'd like to re-introduce this feature, I *guess* as some sort of
> opt-in quirk (ie. disabled by default unless something in DT tells you
> otherwise??  But I'm open to suggestions.  I'm not entirely sure what
> hw was having problems due to this feature.)
> 
> On newer snapdragon devices we are using arm-smmu for the GPU, and
> halting the GPU so the driver's fault handler can dump some GPU state
> on faults is enormously helpful for debugging and tracking down where
> in the gpu cmdstream the fault was triggered.  In addition, we will
> eventually want the ability to update pagetables from fault handler
> and resuming the faulting transition.
> 
> Some additional comments below..
> 
> > Cc: 
> > Reported-by: Matt Evans 
> > Signed-off-by: Will Deacon 
> > ---
> >  drivers/iommu/arm-smmu.c | 34 +++---
> >  1 file changed, 7 insertions(+), 27 deletions(-)
> >
> > diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c
> > index 4f49fe29f202..2db74ebc3240 100644
> > --- a/drivers/iommu/arm-smmu.c
> > +++ b/drivers/iommu/arm-smmu.c
> > @@ -686,8 +686,7 @@ static struct iommu_gather_ops arm_smmu_gather_ops = {
> >
> >  static irqreturn_t arm_smmu_context_fault(int irq, void *dev)
> >  {
> > -   int flags, ret;
> > -   u32 fsr, fsynr, resume;
> > +   u32 fsr, fsynr;
> > unsigned long iova;
> > struct iommu_domain *domain = dev;
> > struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
> > @@ -701,34 +700,15 @@ static irqreturn_t arm_smmu_context_fault(int irq, 
> > void *dev)
> > if (!(fsr & FSR_FAULT))
> > return IRQ_NONE;
> >
> > -   if (fsr & FSR_IGN)
> > -   dev_err_ratelimited(smmu->dev,
> > -   "Unexpected context fault (fsr 0x%x)\n",
> > -   fsr);
> > -
> > fsynr = readl_relaxed(cb_base + ARM_SMMU_CB_FSYNR0);
> > -   flags = fsynr & FSYNR0_WNR ? IOMMU_FAULT_WRITE : IOMMU_FAULT_READ;
> > -
> > iova = readq_relaxed(cb_base + ARM_SMMU_CB_FAR);
> > -   if (!report_iommu_fault(domain, smmu->dev, iova, flags)) {
> > -   ret = IRQ_HANDLED;
> > -   resume = RESUME_RETRY;
> > -   } else {
> > -   dev_err_ratelimited(smmu->dev,
> > -   "Unhandled context fault: iova=0x%08lx, fsynr=0x%x, 
> > cb=%d\n",
> > -   iova, fsynr, cfg->cbndx);
> 
> I would like to decouple this dev_err_ratelimit() print from the
> RESUME_RETRY vs RESUME_TERMINATE behaviour.  I need the ability to
> indicate by return from my fault handler whether to resume or
> terminate.  But I already have my own ratelimted prints and would
> prefer not to spam dmesg twice.
> 
> I'm thinking about report_iommu_fault() returning:
> 
>   0 => RESUME_RETRY
>   -EFAULT => RESUME_TERMINATE but don't print
>   anything else (or specifically -ENOSYS?) => RESUME_TERMINATE and print
> 
> thoughts?
> 
> > -   ret = IRQ_NONE;
> > -   resume = RESUME_TERMINATE;
> > -   }
> > -
> > -   /* Clear the faulting FSR */
> > -   writel(fsr, cb_base + ARM_SMMU_CB_FSR);
> >
> > -   /* Retry or terminate any stalled transactions */
> > -   if (fsr & FSR_SS)
> > -   writel_relaxed(resume, cb_base + ARM_SMMU_CB_RESUME);
> 
> This might be a bug in qcom's implementation of the smmu spec, but
> seems like we don't have SS bit set, yet we still require RESUME reg
> to be written, otherwise gpu is perma-wedged.  Maybe topic for a
> separate quirk?  I'm not sure if writing RESUME reg on other hw when
> SS bit is not set is likely to cause problems?  If not I suppose we
> could just unconditionally write it.
> 
> Anyways, I'm not super-familiar w/ arm-smmu so suggestions welcome..
> in between debugging freedreno I'll try to put together some patches.

>From what I can tell we need SCTLR_CFCFG to make the stall happen otherwise
the operation just gets terminated immediately and *then* we get notification
but by then the system keeps going.

I 

Re: [PATCH 3/4] iommu/arm-smmu: Disable stalling faults for all endpoints

2016-12-06 Thread Rob Clark
On Thu, Aug 18, 2016 at 9:05 AM, Will Deacon  wrote:
> Enabling stalling faults can result in hardware deadlock on poorly
> designed systems, particularly those with a PCI root complex upstream of
> the SMMU.
>
> Although it's not really Linux's job to save hardware integrators from
> their own misfortune, it *is* our job to stop userspace (e.g. VFIO
> clients) from hosing the system for everybody else, even if they might
> already be required to have elevated privileges.
>
> Given that the fault handling code currently executes entirely in IRQ
> context, there is nothing that can sensibly be done to recover from
> things like page faults anyway, so let's rip this code out for now and
> avoid the potential for deadlock.

Hi Will,

so, I'd like to re-introduce this feature, I *guess* as some sort of
opt-in quirk (ie. disabled by default unless something in DT tells you
otherwise??  But I'm open to suggestions.  I'm not entirely sure what
hw was having problems due to this feature.)

On newer snapdragon devices we are using arm-smmu for the GPU, and
halting the GPU so the driver's fault handler can dump some GPU state
on faults is enormously helpful for debugging and tracking down where
in the gpu cmdstream the fault was triggered.  In addition, we will
eventually want the ability to update pagetables from fault handler
and resuming the faulting transition.

Some additional comments below..

> Cc: 
> Reported-by: Matt Evans 
> Signed-off-by: Will Deacon 
> ---
>  drivers/iommu/arm-smmu.c | 34 +++---
>  1 file changed, 7 insertions(+), 27 deletions(-)
>
> diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c
> index 4f49fe29f202..2db74ebc3240 100644
> --- a/drivers/iommu/arm-smmu.c
> +++ b/drivers/iommu/arm-smmu.c
> @@ -686,8 +686,7 @@ static struct iommu_gather_ops arm_smmu_gather_ops = {
>
>  static irqreturn_t arm_smmu_context_fault(int irq, void *dev)
>  {
> -   int flags, ret;
> -   u32 fsr, fsynr, resume;
> +   u32 fsr, fsynr;
> unsigned long iova;
> struct iommu_domain *domain = dev;
> struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
> @@ -701,34 +700,15 @@ static irqreturn_t arm_smmu_context_fault(int irq, void 
> *dev)
> if (!(fsr & FSR_FAULT))
> return IRQ_NONE;
>
> -   if (fsr & FSR_IGN)
> -   dev_err_ratelimited(smmu->dev,
> -   "Unexpected context fault (fsr 0x%x)\n",
> -   fsr);
> -
> fsynr = readl_relaxed(cb_base + ARM_SMMU_CB_FSYNR0);
> -   flags = fsynr & FSYNR0_WNR ? IOMMU_FAULT_WRITE : IOMMU_FAULT_READ;
> -
> iova = readq_relaxed(cb_base + ARM_SMMU_CB_FAR);
> -   if (!report_iommu_fault(domain, smmu->dev, iova, flags)) {
> -   ret = IRQ_HANDLED;
> -   resume = RESUME_RETRY;
> -   } else {
> -   dev_err_ratelimited(smmu->dev,
> -   "Unhandled context fault: iova=0x%08lx, fsynr=0x%x, 
> cb=%d\n",
> -   iova, fsynr, cfg->cbndx);

I would like to decouple this dev_err_ratelimit() print from the
RESUME_RETRY vs RESUME_TERMINATE behaviour.  I need the ability to
indicate by return from my fault handler whether to resume or
terminate.  But I already have my own ratelimted prints and would
prefer not to spam dmesg twice.

I'm thinking about report_iommu_fault() returning:

  0 => RESUME_RETRY
  -EFAULT => RESUME_TERMINATE but don't print
  anything else (or specifically -ENOSYS?) => RESUME_TERMINATE and print

thoughts?

> -   ret = IRQ_NONE;
> -   resume = RESUME_TERMINATE;
> -   }
> -
> -   /* Clear the faulting FSR */
> -   writel(fsr, cb_base + ARM_SMMU_CB_FSR);
>
> -   /* Retry or terminate any stalled transactions */
> -   if (fsr & FSR_SS)
> -   writel_relaxed(resume, cb_base + ARM_SMMU_CB_RESUME);

This might be a bug in qcom's implementation of the smmu spec, but
seems like we don't have SS bit set, yet we still require RESUME reg
to be written, otherwise gpu is perma-wedged.  Maybe topic for a
separate quirk?  I'm not sure if writing RESUME reg on other hw when
SS bit is not set is likely to cause problems?  If not I suppose we
could just unconditionally write it.

Anyways, I'm not super-familiar w/ arm-smmu so suggestions welcome..
in between debugging freedreno I'll try to put together some patches.

BR,
-R

> +   dev_err_ratelimited(smmu->dev,
> +   "Unhandled context fault: fsr=0x%x, iova=0x%08lx, fsynr=0x%x, 
> cb=%d\n",
> +   fsr, iova, fsynr, cfg->cbndx);
>
> -   return ret;
> +   writel(fsr, cb_base + ARM_SMMU_CB_FSR);
> +   return IRQ_HANDLED;
>  }
>
>  static irqreturn_t arm_smmu_global_fault(int irq, void *dev)
> @@ -837,7 +817,7 @@ static void arm_smmu_init_context_bank(struct 
> arm_smmu_domain *smmu_domain,
>