Re: [PATCH 1/2] arm64: dts: stratix10: Add Stratix10 SMMU support

2018-08-16 Thread Thor Thayer

Hi Robin,

On 08/08/2018 12:38 PM, Robin Murphy wrote:

On 25/07/18 19:24, thor.tha...@linux.intel.com wrote:

From: Thor Thayer 

Add SMMU support to the Stratix10 Device Tree which
includes adding the SMMU node and adding IOMMU stream
ids to the SMMU peripherals. Update bindings.

Signed-off-by: Thor Thayer 
---
This patch is dependent on the patch series
"iommu/arm-smmu: Add runtime pm/sleep support"
(https://patchwork.ozlabs.org/cover/946160/)
---
  .../devicetree/bindings/iommu/arm,smmu.txt | 25 
++
  arch/arm64/boot/dts/altera/socfpga_stratix10.dtsi  | 30 
++

  2 files changed, 55 insertions(+)

diff --git a/Documentation/devicetree/bindings/iommu/arm,smmu.txt 
b/Documentation/devicetree/bindings/iommu/arm,smmu.txt

index 7c71a6ed465a..8e3fe0594e3e 100644
--- a/Documentation/devicetree/bindings/iommu/arm,smmu.txt
+++ b/Documentation/devicetree/bindings/iommu/arm,smmu.txt
@@ -18,6 +18,7 @@ conditions.
  "arm,mmu-500"
  "cavium,smmu-v2"
  "qcom,-smmu-v2", "qcom,smmu-v2"
+    "altr,smmu-v2"


Can we guarantee that no Altera SoC will ever exist with a different 
SMMU implementation, configuration, or clock tree? If we must have 
compatibles for SoC-specific integrations, I'd be a lot happier if they 
were actually SoC-specific, i.e. at least "altr,stratix10-smmu", or even 
something like "altr,gx5500-smmu" if there's a chance of new 
incompatible designs being added to the Stratix 10 family in future.




Good point. I'll get a better compatible string if I pursue this.

I'm still dubious that we actually need this for MMU-500, though, since 
we will always need the TCU clock enabled to do anything, and given the 
difficulty in associating particular TBU clocks with whichever domains 
might cause allocations into which TBU's TLBs, it seems highly unlikely 
that it'll ever be feasible to work at a granularity finer than "all of 
the clocks". And at that point the names don't really matter, and we 
merely need something like the proposed of_clk_bulk_get()[1], which 
works fine regardless of how many TBUs and distinct clocks exist for a 
particular MMU-500 configuration and integration.




Yes, I would prefer to use the standard arm,mmu-500 but with the changes 
proposed by [2] that this patch was dependent on, it seemed I would need 
to make a new structure and type.


I like the patch series for devm_clk_bulk_get_all() that includes 
of_clk_bulk_get(). That would enable my patch to work with minor changes 
to add bulk_clk to arm-smmu.c. However, the patchset doesn't seem to 
have been accepted yet.



    depending on the particular implementation and/or the
    version of the architecture implemented.
@@ -179,3 +180,27 @@ conditions.
   < SMMU_MDP_AHB_CLK>;
  clock-names = "bus", "iface";
  };
+
+    /* Stratix10 arm,smmu-v2 implementation */
+    smmu5: iommu@fa00 {
+    compatible = "altr,smmu-v2", "arm,mmu-500",
+ "arm,smmu-v2";
+    reg = <0xfa00 0x4>;
+    #global-interrupts = <2>;
+    #iommu-cells = <1>;
+    clocks = < STRATIX10_L4_MAIN_CLK>;
+    clock-names = "masters";


This isn't documented as an actual property, and if it also clocks the 
TCU then I'm not sure it's really the most accurate name.


Robin.

[1] https://patchwork.kernel.org/patch/10427095/


In the patch I'll remove the clock-names.

I'll keep track of the status of this patch (and [3] from the same 
patchset). I tested a simple patch that uses devm_clk_bulk_get_all() 
from these bulk_clk patches and it works well with the standard 
"arm,mmu-500" compatible.


This patch has dependencies on [2]. It seems like [2] could use the 
bulk_clk patches in [1] above (in particular [2]'s patch 1/4).  The 
function arm_smmu_fill_clk_data() wouldn't be needed because everything 
happens in devm_clk_bulk_get_all(). However, those bulk_clk patches have 
been hanging out there since May.


I'm unclear on how to proceed. Do I continue with dependency on [2] or 
create a new patch adding the bulk clocks changes in [1] (and [3])?


Thanks for reviewing!

Thor

[2] https://patchwork.kernel.org/patch/10546677/
[3] https://patchwork.kernel.org/patch/10427079/



+    interrupt-parent = <>;
+    interrupts = <0 128 4>,    /* Global Secure Fault */
+    <0 129 4>, /* Global Non-secure Fault */
+    /* Non-secure Context Interrupts (32) */
+    <0 138 4>, <0 139 4>, <0 140 4>, <0 141 4>,
+    <0 142 4>, <0 143 4>, <0 144 4>, <0 145 4>,
+    <0 146 4>, <0 147 4>, <0 148 4>, <0 149 4>,
+    <0 150 4>, <0 151 4>, <0 152 4>, <0 153 4>,
+    <0 154 4>, <0 155 4>, <0 156 4>, <0 157 4>,
+    <0 158 4>, <0 159 4>, <0 160 4>, <0 161 4>,
+    <0 162 4>, <0 163 4>, <0 164 4>, <0 165 4>,
+    <0 166 4>, <0 167 4>, <0 168 4>, <0 169 

Re: IOAT DMA w/IOMMU

2018-08-16 Thread Logan Gunthorpe



On 16/08/18 12:53 PM, Kit Chow wrote:
> 
> 
> On 08/16/2018 10:21 AM, Logan Gunthorpe wrote:
>>
>> On 16/08/18 11:16 AM, Kit Chow wrote:
>>> I only have access to intel hosts for testing (and possibly an AMD
>>> host currently collecting dust) and am not sure how to go about getting
>>> the proper test coverage for other architectures.
>> Well, I thought you were only changing the Intel IOMMU implementation...
>> So testing on Intel hardware seems fine to me.
> For the ntb change, I wasn't sure if there was some other arch where
> ntb_async_tx_submit would work ok with the pci bar address but
> would break with the dma map (assuming map_resource has been implemented).
> If its all x86 at the moment, I guess I'm good to go. Please confirm.

I expect lots of arches have issues with dma_map_resource(), it's pretty
new and if it didn't work correctly on x86 I imagine many other arches
are wrong too. I an arch is doesn't work, someone will have to fix that
arches Iommu implementation. But using dma_map_resource() in
ntb_transport is definitely correct compared to what was done before.

Logan
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: IOAT DMA w/IOMMU

2018-08-16 Thread Kit Chow



On 08/16/2018 10:21 AM, Logan Gunthorpe wrote:


On 16/08/18 11:16 AM, Kit Chow wrote:

I only have access to intel hosts for testing (and possibly an AMD
host currently collecting dust) and am not sure how to go about getting
the proper test coverage for other architectures.

Well, I thought you were only changing the Intel IOMMU implementation...
So testing on Intel hardware seems fine to me.

For the ntb change, I wasn't sure if there was some other arch where
ntb_async_tx_submit would work ok with the pci bar address but
would break with the dma map (assuming map_resource has been implemented).
If its all x86 at the moment, I guess I'm good to go. Please confirm.

Thanks
Kit


Any suggestions? Do you anticipate any issues with dma mapping the pci bar
address in ntb_async_tx_submit on non-x86 archs?  Do you or know of folks
who have ntb test setups with non-x86 hosts who might be able to help?

I expect other IOMMUs will need to be updated to properly support
dma_map_resource but that will probably need to be done on a case by
case basis as the need arises. Unfortunately it seems the work to add
dma_map_resource() didn't really consider that the IOMMUs had to have
proper support and probably should have errored out instead of just
passing along the physical address if the IOMMU didn't have support.

I don't know of anyone with an NTB setup on anything but x86 at the moment.

Logan


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC PATCH v1 0/6] Resolve unwanted DMA backing with IOMMU

2018-08-16 Thread Robin Murphy

On 15/08/18 20:56, Dmitry Osipenko wrote:

On Friday, 3 August 2018 18:43:41 MSK Robin Murphy wrote:

On 02/08/18 19:24, Dmitry Osipenko wrote:

On Friday, 27 July 2018 20:16:53 MSK Dmitry Osipenko wrote:

On Friday, 27 July 2018 20:03:26 MSK Jordan Crouse wrote:

On Fri, Jul 27, 2018 at 05:02:37PM +0100, Robin Murphy wrote:

On 27/07/18 15:10, Dmitry Osipenko wrote:

On Friday, 27 July 2018 12:03:28 MSK Will Deacon wrote:

On Fri, Jul 27, 2018 at 10:25:13AM +0200, Joerg Roedel wrote:

On Fri, Jul 27, 2018 at 02:16:18AM +0300, Dmitry Osipenko wrote:

The proposed solution adds a new option to the base device driver
structure that allows device drivers to explicitly convey to the
drivers
core that the implicit IOMMU backing for devices must not happen.


Why is IOMMU mapping a problem for the Tegra GPU driver?

If we add something like this then it should not be the choice of
the
device driver, but of the user and/or the firmware.


Agreed, and it would still need somebody to configure an identity
domain
so
that transactions aren't aborted immediately. We currently allow the
identity domain to be used by default via a command-line option, so I
guess
we'd need a way for firmware to request that on a per-device basis.


The IOMMU mapping itself is not a problem, the problem is the
management
of
the IOMMU. For Tegra we don't want anything to intrude into the IOMMU
activities because:

1) GPU HW require additional configuration for the IOMMU usage and
dumb
mapping of the allocations simply doesn't work.


Generally, that's already handled by the DRM drivers allocating
their own unmanaged domains. The only problem we really need to
solve in that regard is that currently the device DMA ops don't get
updated when moving away from the managed domain. That's been OK for
the VFIO case where the device is bound to a different driver which
we know won't make any explicit DMA API calls, but for the more
general case of IOMMU-aware drivers we could certainly do with a bit
of cooperation between the IOMMU API, DMA API, and arch code to
update the DMA ops dynamically to cope with intermediate subsystems
making DMA API calls on behalf of devices they don't know the
intimate details of.


2) Older Tegra generations have a limited resource and capabilities in
regards to IOMMU usage, allocating IOMMU domain per-device is just
impossible for example.

3) HW performs context switches and so particular allocations have to
be
assigned to a particular contexts IOMMU domain.


I understand Qualcomm SoCs have a similar thing too, and AFAICS that
case just doesn't fit into the current API model at all. We need the
IOMMU driver to somehow know about the specific details of which
devices have magic associations with specific contexts, and we
almost certainly need a more expressive interface than
iommu_domain_alloc() to have any hope of reliable results.


This is correct for Qualcomm GPUs - The GPU hardware context switching
requires a specific context and there are some restrictions around
secure contexts as well.

We don't really care if the DMA attaches to a context just as long as it
doesn't attach to the one(s) we care about. Perhaps a "valid context"
mask
would work in from the DT or the device struct to give the subsystems a
clue as to which domains they were allowed to use. I recognize that
there
isn't a one-size-fits-all solution to this problem so I'm open to
different
ideas.


Designating whether implicit IOMMU backing is appropriate for a device
via
device-tree property sounds a bit awkward because that will be a kinda
software description (of a custom Linux driver model), while device-tree
is
supposed to describe HW.

What about to grant IOMMU drivers with ability to decide whether the
implicit backing for a device is appropriate? Like this:

bool implicit_iommu_for_dma_is_allowed(struct device *dev)
{

const struct iommu_ops *ops = dev->bus->iommu_ops;
struct iommu_group *group;

group = iommu_group_get(dev);
if (!group)

return NULL;

iommu_group_put(group);

if (!ops->implicit_iommu_for_dma_is_allowed)

return true;

return ops->implicit_iommu_for_dma_is_allowed(dev);

}

Then arch_setup_dma_ops() could have a clue whether implicit IOMMU
backing
for a device is appropriate.


Guys, does it sound good to you or maybe you have something else on your
mind? Even if it's not an ideal solution, it fixes the immediate problem
and should be good enough for the starter.


To me that looks like a step ion the wrong direction that won't help at
all in actually addressing the underlying issues.

If the GPU driver wants to explicitly control IOMMU mappings instead of
relying on the IOMMU_DOMAIN_DMA abstraction, then it should use its own
unmanaged domain. At that point it shouldn't matter if a DMA ops domain
was allocated, since the GPU device will no longer be attached to it.


It is not obvious 

Re: IOAT DMA w/IOMMU

2018-08-16 Thread Logan Gunthorpe


On 16/08/18 11:16 AM, Kit Chow wrote:
> I only have access to intel hosts for testing (and possibly an AMD
> host currently collecting dust) and am not sure how to go about getting
> the proper test coverage for other architectures.

Well, I thought you were only changing the Intel IOMMU implementation...
So testing on Intel hardware seems fine to me.

> Any suggestions? Do you anticipate any issues with dma mapping the pci bar
> address in ntb_async_tx_submit on non-x86 archs?  Do you or know of folks
> who have ntb test setups with non-x86 hosts who might be able to help?

I expect other IOMMUs will need to be updated to properly support
dma_map_resource but that will probably need to be done on a case by
case basis as the need arises. Unfortunately it seems the work to add
dma_map_resource() didn't really consider that the IOMMUs had to have
proper support and probably should have errored out instead of just
passing along the physical address if the IOMMU didn't have support.

I don't know of anyone with an NTB setup on anything but x86 at the moment.

Logan
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: IOAT DMA w/IOMMU

2018-08-16 Thread Kit Chow


On 08/09/2018 02:36 PM, Logan Gunthorpe wrote:


On 09/08/18 03:31 PM, Eric Pilmore wrote:

On Thu, Aug 9, 2018 at 12:35 PM, Logan Gunthorpe  wrote:

Hey,

On 09/08/18 12:51 PM, Eric Pilmore wrote:

Was wondering if anybody here has used IOAT DMA engines with an
IOMMU turned on (Xeon based system)? My specific question is really
whether it is possible to DMA (w/IOAT) to a PCI BAR address as the
destination without having to map that address to the IOVA space of
the DMA engine first (assuming the IOMMU is on)?

I haven't tested this scenario but my guess would be that IOAT would
indeed go through the IOMMU and the PCI BAR address would need to be
properly mapped into the IOAT's IOVA. The fact that you see DMAR errors
is probably a good indication that this is the case. I really don't know
why you'd want to DMA something without mapping it.

The thought was to avoid paying the price of having to go through yet another
translation and also because it was not believed to be necessary anyway since
the DMA device could go straight to a PCI BAR address without the need for a
mapping.  We have been playing with two DMA engines, IOAT and PLX. The
PLX does not have any issues going straight to the PCI BAR address, but unlike
IOAT, PLX is sitting "in the PCI tree".

Yes, you can do true P2P transactions with the PLX DMA engine. (Though
you do have to be careful as technically you need to translate to a PCI
bus address not the CPU physical address which are the same on x86; and
you need to watch that ACS doesn't mess it up).

It doesn't sound like it's possible to avoid the IOMMU when using the
IOAT so you need the mapping. I would not expect the extra translation
in the IOMMU to be noticeable, performance wise.

Logan

Hi Logan,

Based on Dave Jiang's suggestion, I am looking into upstreaming the
ntb change to dma map the pci bar address in ntb_async_tx_submit() 
(along with

the intel iommu change to support dma_map_resource).

I only have access to intel hosts for testing (and possibly an AMD
host currently collecting dust) and am not sure how to go about getting
the proper test coverage for other architectures.

Any suggestions? Do you anticipate any issues with dma mapping the pci bar
address in ntb_async_tx_submit on non-x86 archs?  Do you or know of folks
who have ntb test setups with non-x86 hosts who might be able to help?

Thanks
Kit


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v3 1/2] iommu/arm-smmu-v3: fix unexpected CMD_SYNC timeout

2018-08-16 Thread Robin Murphy

On 2018-08-16 10:18 AM, Will Deacon wrote:

On Thu, Aug 16, 2018 at 04:21:17PM +0800, Leizhen (ThunderTown) wrote:

On 2018/8/15 20:26, Robin Murphy wrote:

On 15/08/18 11:23, Zhen Lei wrote:

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 1d64710..3f5c236 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -566,7 +566,7 @@ struct arm_smmu_device {

   intgerr_irq;
   intcombined_irq;
-atomic_tsync_nr;
+u32sync_nr;

   unsigned longias; /* IPA */
   unsigned longoas; /* PA */
@@ -775,6 +775,11 @@ static int queue_remove_raw(struct arm_smmu_queue *q, u64 
*ent)
   return 0;
   }

+static inline void arm_smmu_cmdq_sync_set_msidata(u64 *cmd, u32 msidata)


If we *are* going to go down this route then I think it would make sense
to move the msiaddr and CMDQ_SYNC_0_CS_MSI logic here as well; i.e.
arm_smmu_cmdq_build_cmd() always generates a "normal" SEV-based sync
command, then calling this guy would convert it to an MSI-based one.
As-is, having bits of mutually-dependent data handled across two
separate places just seems too messy and error-prone.


Yes, How about create a new function "arm_smmu_cmdq_build_sync_msi_cmd"?

static inline
void arm_smmu_cmdq_build_sync_msi_cmd(u64 *cmd, struct arm_smmu_cmdq_ent *ent)
{
cmd[0]  = FIELD_PREP(CMDQ_0_OP, ent->opcode);
cmd[0] |= FIELD_PREP(CMDQ_SYNC_0_CS, CMDQ_SYNC_0_CS_IRQ);
cmd[0] |= FIELD_PREP(CMDQ_SYNC_0_MSH, ARM_SMMU_SH_ISH);
cmd[0] |= FIELD_PREP(CMDQ_SYNC_0_MSIATTR, ARM_SMMU_MEMATTR_OIWB);
cmd[1]  = ent->sync.msiaddr & CMDQ_SYNC_1_MSIADDR_MASK;
}


None of this seems justified given the numbers from John, so please just do
the simple thing and build the command with the lock held.


Agreed - sorry if my wording was unclear, but that suggestion was only 
for the possibility of it proving genuinely worthwhile to build the 
command outside the lock. Since that isn't the case, I definitely prefer 
the simpler approach too.


Robin.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v3 1/2] iommu/arm-smmu-v3: fix unexpected CMD_SYNC timeout

2018-08-16 Thread Will Deacon
On Thu, Aug 16, 2018 at 04:21:17PM +0800, Leizhen (ThunderTown) wrote:
> On 2018/8/15 20:26, Robin Murphy wrote:
> > On 15/08/18 11:23, Zhen Lei wrote:
> >> diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
> >> index 1d64710..3f5c236 100644
> >> --- a/drivers/iommu/arm-smmu-v3.c
> >> +++ b/drivers/iommu/arm-smmu-v3.c
> >> @@ -566,7 +566,7 @@ struct arm_smmu_device {
> >>
> >>   intgerr_irq;
> >>   intcombined_irq;
> >> -atomic_tsync_nr;
> >> +u32sync_nr;
> >>
> >>   unsigned longias; /* IPA */
> >>   unsigned longoas; /* PA */
> >> @@ -775,6 +775,11 @@ static int queue_remove_raw(struct arm_smmu_queue *q, 
> >> u64 *ent)
> >>   return 0;
> >>   }
> >>
> >> +static inline void arm_smmu_cmdq_sync_set_msidata(u64 *cmd, u32 msidata)
> > 
> > If we *are* going to go down this route then I think it would make sense
> > to move the msiaddr and CMDQ_SYNC_0_CS_MSI logic here as well; i.e.
> > arm_smmu_cmdq_build_cmd() always generates a "normal" SEV-based sync
> > command, then calling this guy would convert it to an MSI-based one.
> > As-is, having bits of mutually-dependent data handled across two
> > separate places just seems too messy and error-prone.
> 
> Yes, How about create a new function "arm_smmu_cmdq_build_sync_msi_cmd"?
> 
> static inline
> void arm_smmu_cmdq_build_sync_msi_cmd(u64 *cmd, struct arm_smmu_cmdq_ent *ent)
> {
>   cmd[0]  = FIELD_PREP(CMDQ_0_OP, ent->opcode);
>   cmd[0] |= FIELD_PREP(CMDQ_SYNC_0_CS, CMDQ_SYNC_0_CS_IRQ);
>   cmd[0] |= FIELD_PREP(CMDQ_SYNC_0_MSH, ARM_SMMU_SH_ISH);
>   cmd[0] |= FIELD_PREP(CMDQ_SYNC_0_MSIATTR, ARM_SMMU_MEMATTR_OIWB);
>   cmd[1]  = ent->sync.msiaddr & CMDQ_SYNC_1_MSIADDR_MASK;
> }

None of this seems justified given the numbers from John, so please just do
the simple thing and build the command with the lock held.

Will
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v3 1/2] iommu/arm-smmu-v3: fix unexpected CMD_SYNC timeout

2018-08-16 Thread Leizhen (ThunderTown)



On 2018/8/15 20:26, Robin Murphy wrote:
> On 15/08/18 11:23, Zhen Lei wrote:
>> The condition "(int)(VAL - sync_idx) >= 0" to break loop in function
>> __arm_smmu_sync_poll_msi requires that sync_idx must be increased
>> monotonously according to the sequence of the CMDs in the cmdq.
>>
>> But ".msidata = atomic_inc_return_relaxed(>sync_nr)" is not protected
>> by spinlock, so the following scenarios may appear:
>> cpu0cpu1
>> msidata=0
>> msidata=1
>> insert cmd1
>> insert cmd0
>> smmu execute cmd1
>> smmu execute cmd0
>> poll timeout, because msidata=1 is overridden by
>> cmd0, that means VAL=0, sync_idx=1.
>>
>> This is not a functional problem, just make the caller wait for a long
>> time until TIMEOUT. It's rare to happen, because any other CMD_SYNCs
>> during the waiting period will break it.
>>
>> Signed-off-by: Zhen Lei 
>> ---
>>   drivers/iommu/arm-smmu-v3.c | 12 
>>   1 file changed, 8 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
>> index 1d64710..3f5c236 100644
>> --- a/drivers/iommu/arm-smmu-v3.c
>> +++ b/drivers/iommu/arm-smmu-v3.c
>> @@ -566,7 +566,7 @@ struct arm_smmu_device {
>>
>>   intgerr_irq;
>>   intcombined_irq;
>> -atomic_tsync_nr;
>> +u32sync_nr;
>>
>>   unsigned longias; /* IPA */
>>   unsigned longoas; /* PA */
>> @@ -775,6 +775,11 @@ static int queue_remove_raw(struct arm_smmu_queue *q, 
>> u64 *ent)
>>   return 0;
>>   }
>>
>> +static inline void arm_smmu_cmdq_sync_set_msidata(u64 *cmd, u32 msidata)
> 
> If we *are* going to go down this route then I think it would make sense to 
> move the msiaddr and CMDQ_SYNC_0_CS_MSI logic here as well; i.e. 
> arm_smmu_cmdq_build_cmd() always generates a "normal" SEV-based sync command, 
> then calling this guy would convert it to an MSI-based one. As-is, having 
> bits of mutually-dependent data handled across two separate places just seems 
> too messy and error-prone.

Yes, How about create a new function "arm_smmu_cmdq_build_sync_msi_cmd"?

static inline
void arm_smmu_cmdq_build_sync_msi_cmd(u64 *cmd, struct arm_smmu_cmdq_ent *ent)
{
cmd[0]  = FIELD_PREP(CMDQ_0_OP, ent->opcode);
cmd[0] |= FIELD_PREP(CMDQ_SYNC_0_CS, CMDQ_SYNC_0_CS_IRQ);
cmd[0] |= FIELD_PREP(CMDQ_SYNC_0_MSH, ARM_SMMU_SH_ISH);
cmd[0] |= FIELD_PREP(CMDQ_SYNC_0_MSIATTR, ARM_SMMU_MEMATTR_OIWB);
cmd[1]  = ent->sync.msiaddr & CMDQ_SYNC_1_MSIADDR_MASK;
}


> 
> That said, I still don't think that just building the whole command under the 
> lock is really all that bad - even when it doesn't get optimised into one of 
> the assignments that memset you call out is only a single "stp xzr, xzr, 
> ...", and a couple of extra branches doesn't seem a huge deal compared to the 
> DSB and MMIO accesses (and potentially polling) that we're about to do 
> anyway. I've tried hacking things up enough to convince GCC to inline a 
> specialisation of the relevant switch case when ent->opcode is known, and 
> that reduces the "overhead" down to just a handful of ALU instructions. I 
> still need to try cleaning said hack up and double-check that it doesn't have 
> any adverse impact on all the other SMMUv3 stuff in development, but watch 
> this space...
> 
> Robin.
> 
>> +{
>> +cmd[0] |= FIELD_PREP(CMDQ_SYNC_0_MSIDATA, msidata);
>> +}
>> +
>>   /* High-level queue accessors */
>>   static int arm_smmu_cmdq_build_cmd(u64 *cmd, struct arm_smmu_cmdq_ent *ent)
>>   {
>> @@ -836,7 +841,6 @@ static int arm_smmu_cmdq_build_cmd(u64 *cmd, struct 
>> arm_smmu_cmdq_ent *ent)
>>   cmd[0] |= FIELD_PREP(CMDQ_SYNC_0_CS, CMDQ_SYNC_0_CS_SEV);
>>   cmd[0] |= FIELD_PREP(CMDQ_SYNC_0_MSH, ARM_SMMU_SH_ISH);
>>   cmd[0] |= FIELD_PREP(CMDQ_SYNC_0_MSIATTR, ARM_SMMU_MEMATTR_OIWB);
>> -cmd[0] |= FIELD_PREP(CMDQ_SYNC_0_MSIDATA, ent->sync.msidata);
>>   cmd[1] |= ent->sync.msiaddr & CMDQ_SYNC_1_MSIADDR_MASK;
>>   break;
>>   default:
>> @@ -947,7 +951,6 @@ static int __arm_smmu_cmdq_issue_sync_msi(struct 
>> arm_smmu_device *smmu)
>>   struct arm_smmu_cmdq_ent ent = {
>>   .opcode = CMDQ_OP_CMD_SYNC,
>>   .sync= {
>> -.msidata = atomic_inc_return_relaxed(>sync_nr),
>>   .msiaddr = virt_to_phys(>sync_count),
>>   },
>>   };
>> @@ -955,6 +958,8 @@ static int __arm_smmu_cmdq_issue_sync_msi(struct 
>> arm_smmu_device *smmu)
>>   arm_smmu_cmdq_build_cmd(cmd, );
>>
>>   spin_lock_irqsave(>cmdq.lock, flags);
>> +ent.sync.msidata = ++smmu->sync_nr;
>> +arm_smmu_cmdq_sync_set_msidata(cmd, ent.sync.msidata);
>>   arm_smmu_cmdq_insert_cmd(smmu, cmd);
>>   spin_unlock_irqrestore(>cmdq.lock, flags);
>>
>> @@ -2179,7 +2184,6 @@ static int arm_smmu_init_structures(struct 
>> arm_smmu_device *smmu)