Re: [PATCH v4 00/14] x86: Trenchboot secure dynamic launch Linux kernel support

2021-12-02 Thread Daniel P. Smith
Hi Paul!

On 11/30/21 8:06 PM, Paul Moore wrote:
> On Fri, Aug 27, 2021 at 9:20 AM Ross Philipson
>  wrote:
>>
>> The larger focus of the Trechboot project (https://github.com/TrenchBoot) is 
>> to
>> enhance the boot security and integrity in a unified manner. The first area 
>> of
>> focus has been on the Trusted Computing Group's Dynamic Launch for 
>> establishing
>> a hardware Root of Trust for Measurement, also know as DRTM (Dynamic Root of
>> Trust for Measurement).
> 
> My apologies for such a late reply, but I'm just getting around to
> looking at this and I have a few questions on the basic design/flow
> (below) ...

No worries, thank you so much for taking the time to review.

>> The basic flow is:
>>
>>  - Entry from the dynamic launch jumps to the SL stub
> 
> So I'm clear, at this point the combined stub+kernel+initramfs+cmdline
> image has already been loaded into memory and the SL stub is
> executing, yes?

That is correct.

> As TrenchBoot seems to be focused on boot measurement and not
> enforcing policy, I'm guessing this is considered out-of-scope (not to
> mention that the combined stub+kernel image makes this less
> interesting), but has any thought been given to leveraging the TXT
> launch control policy, or is it simply an empty run-everything policy?

The TrenchBoot model is a bit different and takes a more flexible
approach to allow users to build tailored solutions. For instance Secure
Launch is able to be used in a configuration that is similar to tboot.
Consider the functions of tboot, it has a portion that is the
post-launch kernel that handles the handover from the ACM and a portion
that provides the Verified Launch policy engine, which is only capable
of enforcing policy on what is contained in the Multiboot chain. The
TrenchBoot approach is to introduce the Secure Launch capability into a
kernel, in this case Linux, to handle the handover from the ACM, and
then transition to a running user space that can contain a distribution
specific policy enforcement. As an example, the TrenchBoot project
contributed to the uroot project a Secure Launch policy engine which
enables the creation of an initramfs image which can then be embedded
into a minimal configuration Secure Launch Linux kernel. This results in
a single binary that functions like tboot but with a far richer and more
extensible policy engine.

With regard to TXT's Launch Control Policy, it is a function of SINIT
ACM and so it is still very much possible to be used with Secure Launch.
In fact such a configuration has been tested and used. Now it is out of
scope in the sense that the tboot project already maintains and provides
the lcptools suite for managing LCPs. If there is a requirement to use
an LCP, then the lcptools can be used to create a policy to only allow
the specific instance(s) of a Secure Launch kernel.

>>  - SL stub fixes up the world on the BSP
>>  - For TXT, SL stub wakes the APs, fixes up their worlds
>>  - For TXT, APs are left halted waiting for an NMI to wake them
>>  - SL stub jumps to startup_32
>>  - SL main locates the TPM event log and writes the measurements of
>>configuration and module information into it.
> 
> Since the stub+kernel image are combined it looks like the kernel
> measurement comes from the ACM via the MLE measurement into PCR 18,
> while the stub generated measurements are extended into PCR 19 or 20
> depending on the configuration, yes?

If TXT is launched in Legacy PCR usage mode, then the bzImage, as loaded
into memory by the bootloader (GRUB), will be hashed into PCR 18. If it
is launched in the default Details and Authorities (DA) PCR usage mode,
then the bzImage will be hashed into PCR 17. This is because the kernel
has been promoted to being the MLE.

> I realize that moving the TXT code into the kernel makes this
> difficult (not possible?), but one of the things that was nice about
> the tboot based approach (dynamic, early launch) was that it could be
> extended to do different types of measurements, e.g. a signing
> authority measurement similar to UEFI Secure Boot and PCR 7.  If that
> is possible, I think it is something worth including in the design,
> even if it isn't initially implemented.  The only thing that
> immediately comes to mind would be a section/region based approach
> similar to systemd-boot/gummiboot where the (signed) kernel is kept in
> a well known region and verified/measured by the stub prior to jumping
> into its start point.

Revisiting the tboot-like configuration: the Secure Launch kernel, its
configuration (cmdline), and uroot initramfs (which may be embedded or
separate) are all part of the MLE started by the ACM. For tboot there is
the tboot binary and the VL policy, though uncertain if it was
configurable where the policy hash would be extended. Like the tboot VL
policy engine, the u-root policy engine is configurable where the
measurements are stored.

As highlighted, more components are measured as part of Secure Launch
than 

Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

2021-12-02 Thread Thomas Gleixner
Jason,

On Thu, Dec 02 2021 at 09:55, Jason Gunthorpe wrote:
> On Thu, Dec 02, 2021 at 01:01:42AM +0100, Thomas Gleixner wrote:
>> On Wed, Dec 01 2021 at 21:21, Thomas Gleixner wrote:
>> > On Wed, Dec 01 2021 at 14:14, Jason Gunthorpe wrote:
>> > Which in turn is consistent all over the place and does not require any
>> > special case for anything. Neither for interrupts nor for anything else.
>> 
>> that said, feel free to tell me that I'm getting it all wrong.
>> 
>> The reason I'm harping on this is that we are creating ABIs on several
>> ends and we all know that getting that wrong is a major pain.
>
> I don't really like coupling the method to fetch IRQs with needing
> special struct devices. Struct devices have a sysfs presence and it is
> not always appropriate to create sysfs stuff just to allocate some
> IRQs.
>
> A queue is simply not a device, it doesn't make any sense. A queue is
> more like a socket().

Let's put the term 'device' for a bit please. 

> That said, we often have enough struct devices floating about to make
> this work. Between netdev/ib_device/aux device/mdev we can use them to
> do this.
>
> I think it is conceptual nonsense to attach an IMS IRQ domain to a
> netdev or a cdev, but it will solve this problem.

The IMS irqdomain is _NOT_ part of the netdev or cdev or whatever. I
explained that several times now.

We seem to have a serious problem of terminology and the understanding
of topology which is why we continue to talk past each other forever.

Let's take a big step back and clarify these things first.

An irq domain is an entity which has two parts:

   1) The resource management part
   
   2) An irqchip which knows how to access the actual hardware
  implementation for the various operations related to irq chips
  (mask, unmask, ack, eoi, )

irq domains are used for uniform resource spaces which have a uniform
hardware interrupt chip. Modern computer systems have several resource
levels which form a hierarchy of several irqdomain levels.

The root of that hierarchy is at the CPU level:

 |---|
 | CPU | BUS |
 | |-|
 | | IRQ |
 |

The CPU exposes the bus and a mechanism to raise interrupts in the CPU
from external components. Whether those interrupts are raised by wire or
by writing a message to a given address is an implementation detail.

So the root IRQ domain is the one which handles the CPU level interrupt
controller and manages the associated resources: The per CPU vector
space on x86, the global vector space on ARM64.

If there's an IOMMU in the system them the IOMMU is the next level
interrupt controller:

1) It manages the translation table resources

2) Provides an interrupt chip which knows how to access the
   table entries

Of course there can be several IOMMUs depending on the system topology,
but from an irq domain view they are at the same level.

On top of that we have on x86 the next level irq domains:

   IO/APIC, HPET and PCI/MSI

which again do resource management and provide an irqchip which
handles the underlying hardware implementation.

So the full hierarchy so far looks like this on X86

   VECTOR --|
|-- IOMMU --|
|-- IO/APIC
|-- HPET
|-- PCI/MSI

So allocating an interrupt on the outer level which is exposed to
devices requires allocation through the hierarchy down to the root
domain (VECTOR). The reason why we have this hierarchy is that this
makes it trivial to support systems with and without IOMMU because on a
system w/o IOMMU the outer domains have the vector domain as their
parent. ARM(64) has even deeper hierarchies, but operates on the same
principle.

Let's look at the PCI/MSI domain more detailed:

  It's either one irqdomain per system or one irqdomain per IOMMU domain

  The PCI/MCI irqdomain does not really do resource allocations, it's
  all delegated down the hierarchy.

  It provides also an interrupt chip which knows to access the MSI[X]
  entries and can handle masking etc.

Now you might ask how the MSI descriptors come into play. The MSI
descriptors are allocated by the PCI/MSI interface functions in order to
enable the PCI/MSI irqdomain to actually allocate resources.

They could be allocated as part of the irqdomain allocation mechanism
itself. That's what I do now for simple irq domains which only have a
linear MSI descriptor space where the only information in the descriptor
is the index and not all the PCI/MSI tidbits.

It's kinda blury, but that's moslty due to the fact that we still have
to support the legacy PCI/MSI implementations on architectures which do
not use hierarchical irq domains.

The MSI descriptors are stored in device->msi.data.store. That makes
sense because they are allocated on behalf of that device. But there is
no requirement to store them there. We could just store a cookie in the
device and have some central storage. It's an 

Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

2021-12-02 Thread Jason Gunthorpe via iommu
On Thu, Dec 02, 2021 at 08:25:48PM +0100, Thomas Gleixner wrote:
> Jason,
> 
> On Thu, Dec 02 2021 at 09:55, Jason Gunthorpe wrote:
> > On Thu, Dec 02, 2021 at 01:01:42AM +0100, Thomas Gleixner wrote:
> >> On Wed, Dec 01 2021 at 21:21, Thomas Gleixner wrote:
> >> > On Wed, Dec 01 2021 at 14:14, Jason Gunthorpe wrote:
> >> > Which in turn is consistent all over the place and does not require any
> >> > special case for anything. Neither for interrupts nor for anything else.
> >> 
> >> that said, feel free to tell me that I'm getting it all wrong.
> >> 
> >> The reason I'm harping on this is that we are creating ABIs on several
> >> ends and we all know that getting that wrong is a major pain.
> >
> > I don't really like coupling the method to fetch IRQs with needing
> > special struct devices. Struct devices have a sysfs presence and it is
> > not always appropriate to create sysfs stuff just to allocate some
> > IRQs.
> >
> > A queue is simply not a device, it doesn't make any sense. A queue is
> > more like a socket().
> 
> Let's put the term 'device' for a bit please. 
> 
> > That said, we often have enough struct devices floating about to make
> > this work. Between netdev/ib_device/aux device/mdev we can use them to
> > do this.
> >
> > I think it is conceptual nonsense to attach an IMS IRQ domain to a
> > netdev or a cdev, but it will solve this problem.
> 
> The IMS irqdomain is _NOT_ part of the netdev or cdev or whatever. I
> explained that several times now.
> 
> We seem to have a serious problem of terminology and the understanding
> of topology which is why we continue to talk past each other forever.

I think I understand and agree with everything you said below.

The point we diverge is where to put the vector storage:
 
> Of course we can store them in pci_dev.dev.msi.data.store. Either with a
> dedicated xarray or by partitioning the xarray space. Both have their
> pro and cons.

This decision seems to drive the question of how many 'struct devices'
do we need, and where do we get them..

> But what I really struggle with is how this is further represented when
> the queues are allocated for VFIO, cdev, whatever usage.

Yes, this seems to be the primary question

> The fact that the irqdomain is instantiated by the device driver does
> not make it any different. As explained above it's just an
> implementation detail which makes it possible to handle the device
> specific message storage requirement in a sane way. The actual interrupt
> resources come still from the underlying irqdomains as for PCI/MSI.

Yes! This is not under debate

> Now I was looking for a generic representation of such a container and
> my initial reaction was to bind it to a struct device, which also makes
> it trivial to store these MSI descriptors in that struct device.

Right, I've been trying to argue around just this choice.
 
> I can understand your resistance against that to some extent, but I
> think we really have to come up with a proper abstract representation
> for these.

Ok

> Such a logical function would be the entity to hand out for VFIO or
> cdev.

What is a logical function, concretely? 

Does it have struct device?

Can I instead suggest a name like 'message interrupt table' ?

Ie a device has two linearly indexed message interrupt tables - the
PCI SIG defined MSI/MSI-X one created by the PCI core and the IMS one
created by the driver.

Both start at 0 index and they have different irq_domains.

Instead of asking the driver to create a domain we ask the driver to
create a new 'message interrupt table'. The driver provides the
irq_chip to program the messages and the pci_device. The core code
manages the irq domain setup.

Using what you say below:

> If this is not split out, then every driver and wrapper has to come up
> with it's own representation of this instead of being able to do:
> 
>  request_irq(msi_get_virq(lfunc, idx=0), handler0, ...);
>  request_irq(msi_get_virq(lfunc, idx=1), handler1, ...);

We could say:
  msi_get_virq(device.pci_msi_table, index=0)

Is the 0th PCI SIG MSI vector

Something like:

 ims_table = pci_create_msi_table(pci_dev, my_irq_chip,..)
 msi_get_virq(ims_table, index=0)

Is the 0th IMS vector

Is it close to what you are thinking with lfunc?

Regards,
Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

2021-12-02 Thread Thomas Gleixner
Jason,

On Thu, Dec 02 2021 at 16:00, Jason Gunthorpe wrote:
> On Thu, Dec 02, 2021 at 08:25:48PM +0100, Thomas Gleixner wrote:
>> We seem to have a serious problem of terminology and the understanding
>> of topology which is why we continue to talk past each other forever.
>
> I think I understand and agree with everything you said below.

Good!

> The point we diverge is where to put the vector storage:

Kinda. The vector, i.e. message storage is either:

  - MSI entry in the PCI config space
  - MSI-X table in the PCI config space
  - Device specific IMS storage

The software representation aka struct msi_desc is a different
story. That's what we are debating.

>> Of course we can store them in pci_dev.dev.msi.data.store. Either with a
>> dedicated xarray or by partitioning the xarray space. Both have their
>> pro and cons.
>
> This decision seems to drive the question of how many 'struct devices'
> do we need, and where do we get them..

Not really. There is nothing what enforces to make the MSI irqdomain
storage strictly hang off struct device. There has to be a connection to
a struct device in some way obviously to make IOMMU happy.

>> Such a logical function would be the entity to hand out for VFIO or
>> cdev.
>
> What is a logical function, concretely?

That's a name I came up with for a abstract representation of such a
queue container. I came up with that as a obvious consequence of my
previous reasoning about PF -> VF -> XF.

> Does it have struct device?

It does not necessarily have to .

> Can I instead suggest a name like 'message interrupt table' ?

Well yes, but that's not what I meant. See below.

> Ie a device has two linearly indexed message interrupt tables - the
> PCI SIG defined MSI/MSI-X one created by the PCI core and the IMS one
> created by the driver.
>
> Both start at 0 index and they have different irq_domains.
>
> Instead of asking the driver to create a domain we ask the driver to
> create a new 'message interrupt table'. The driver provides the
> irq_chip to program the messages and the pci_device. The core code
> manages the irq domain setup.
>
> Using what you say below:
>
>> If this is not split out, then every driver and wrapper has to come up
>> with it's own representation of this instead of being able to do:
>> 
>>  request_irq(msi_get_virq(lfunc, idx=0), handler0, ...);
>>  request_irq(msi_get_virq(lfunc, idx=1), handler1, ...);
>
> We could say:
>   msi_get_virq(device.pci_msi_table, index=0)
>
> Is the 0th PCI SIG MSI vector
>
> Something like:
>
>  ims_table = pci_create_msi_table(pci_dev, my_irq_chip,..)
>  msi_get_virq(ims_table, index=0)

Which is pretty much a wrapper around two different irqdomains for the
device and either partitioned index space in the xarray or two xarrays.

Just badly named because the table itself is where the resulting message
is stored, which is composed with the help of the relevant MSI
descriptor. See above.

We really should not try to make up an artifical table representation
for something which does not necessarily have a table at all, i.e. the
devices you talk about which store the message in queue specific system
memory. Pretending that this is a table is just silly.

Also I disagree that this has to be tied to a PCI specific interface,
except for creating a PCI specific wrapper for it to not make a driver
developer have to write '>dev', which is the very least of our
problems.

IMS as a technical concept is absolutely not PCI specific at all and not
invented by PCI/SIG. It's a marketing brandname for something which
existed way before they thought about it: Message signaled interrupts.

Aside of that 'my_irq_chip' does not cut it at all because of the way
how the resulting messages are stored. IDXD has IOMEM storage and a
storage space limitation while your device uses system memory storage
and has other limitations, i.e. system memory and the number of queues
the device can provide.

An irqchip is a just set of functions to talk to hardware either
directly or via some indirect transport (I2C, SPI, MLX queue management
magic...). These functions require irqdomain and/or device specific
information to function.

Trying to create a universal pci_create_foo() wrapper around this is
going to be like the 13th Herkulean task.

Seriously, you cannot make something uniform which is by definition
non-uniform.

Let's not even try to pretend that it is possible.

> Is the 0th IMS vector
>
> Is it close to what you are thinking with lfunc?

Not really. I was really reasoning about an abstract representation for
a functional queue, which is more than just a queue allocated from the
PF or VF device.

I really meant a container like this:

struct logical_function {
/* Pointer to the physical device */
struct device   *phys_device;
/* MSI descriptor storage */
struct msi_data msi;
/* The queue number */
unsigned intqueue_nr;
/* Add more information 

Re: [PATCH v3 2/5] iommu/virtio: Support bypass domains

2021-12-02 Thread Eric Auger



On 12/1/21 6:33 PM, Jean-Philippe Brucker wrote:
> The VIRTIO_IOMMU_F_BYPASS_CONFIG feature adds a new flag to the ATTACH
> request, that creates a bypass domain. Use it to enable identity
> domains.
>
> When VIRTIO_IOMMU_F_BYPASS_CONFIG is not supported by the device, we
> currently fail attaching to an identity domain. Future patches will
> instead create identity mappings in this case.
>
> Reviewed-by: Kevin Tian 
> Signed-off-by: Jean-Philippe Brucker 
Reviewed-by: Eric Auger 

Eric
> ---
>  drivers/iommu/virtio-iommu.c | 20 +++-
>  1 file changed, 19 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/iommu/virtio-iommu.c b/drivers/iommu/virtio-iommu.c
> index 80930ce04a16..14dfee76fd19 100644
> --- a/drivers/iommu/virtio-iommu.c
> +++ b/drivers/iommu/virtio-iommu.c
> @@ -71,6 +71,7 @@ struct viommu_domain {
>   struct rb_root_cached   mappings;
>  
>   unsigned long   nr_endpoints;
> + boolbypass;
>  };
>  
>  struct viommu_endpoint {
> @@ -587,7 +588,9 @@ static struct iommu_domain *viommu_domain_alloc(unsigned 
> type)
>  {
>   struct viommu_domain *vdomain;
>  
> - if (type != IOMMU_DOMAIN_UNMANAGED && type != IOMMU_DOMAIN_DMA)
> + if (type != IOMMU_DOMAIN_UNMANAGED &&
> + type != IOMMU_DOMAIN_DMA &&
> + type != IOMMU_DOMAIN_IDENTITY)
>   return NULL;
>  
>   vdomain = kzalloc(sizeof(*vdomain), GFP_KERNEL);
> @@ -630,6 +633,17 @@ static int viommu_domain_finalise(struct viommu_endpoint 
> *vdev,
>   vdomain->map_flags  = viommu->map_flags;
>   vdomain->viommu = viommu;
>  
> + if (domain->type == IOMMU_DOMAIN_IDENTITY) {
> + if (!virtio_has_feature(viommu->vdev,
> + VIRTIO_IOMMU_F_BYPASS_CONFIG)) {
> + ida_free(>domain_ids, vdomain->id);
> + vdomain->viommu = NULL;
> + return -EOPNOTSUPP;
> + }
> +
> + vdomain->bypass = true;
> + }
> +
>   return 0;
>  }
>  
> @@ -691,6 +705,9 @@ static int viommu_attach_dev(struct iommu_domain *domain, 
> struct device *dev)
>   .domain = cpu_to_le32(vdomain->id),
>   };
>  
> + if (vdomain->bypass)
> + req.flags |= cpu_to_le32(VIRTIO_IOMMU_ATTACH_F_BYPASS);
> +
>   for (i = 0; i < fwspec->num_ids; i++) {
>   req.endpoint = cpu_to_le32(fwspec->ids[i]);
>  
> @@ -1132,6 +1149,7 @@ static unsigned int features[] = {
>   VIRTIO_IOMMU_F_DOMAIN_RANGE,
>   VIRTIO_IOMMU_F_PROBE,
>   VIRTIO_IOMMU_F_MMIO,
> + VIRTIO_IOMMU_F_BYPASS_CONFIG,
>  };
>  
>  static struct virtio_device_id id_table[] = {

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] mm: kmemleak: Ignore kmemleak scanning on CMA regions

2021-12-02 Thread Catalin Marinas
On Sun, Nov 28, 2021 at 09:50:53AM +0800, Calvin Zhang wrote:
> On Sat, Nov 27, 2021 at 04:07:18PM -0800, Andrew Morton wrote:
> >On Fri, 26 Nov 2021 10:47:11 +0800 Calvin Zhang  
> >wrote:
> >> Just like this:
> >> commit 620951e27457 ("mm/cma: make kmemleak ignore CMA regions").
> >> 
> >> Add kmemleak_ignore_phys() for CMA created from of reserved node.
[...]
> >The 620951e27457 changelog says "Without this, the kernel crashes...". 
> >Does your patch also fix a crash?  If so under what circumstances and
> >should we backport this fix into -stable kernels?
> 
> No crash occurred. 620951e27457 avoids crashes caused by accessing
> highmem and it was fixed later. Now kmemleak_alloc_phys() and
> kmemleak_ignore_phys() skip highmem. This patch is based on the
> point that CMA regions don't contain pointers to other kmemleak
> objects, and ignores CMA regions from reserved memory as what
> 620951e27457 did.

Note that kmemleak_ignore() only works if there was a prior
kmemleak_alloc() on that address range. With the previous commit we get
this via the memblock_alloc_range() but I fail to see one on the
rmem_cma_setup() path.

-- 
Catalin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v4 04/14] Documentation/x86: Secure Launch kernel documentation

2021-12-02 Thread Robin Murphy

On 2021-08-27 14:28, Ross Philipson wrote:
[...]

+IOMMU Configuration
+---
+
+When doing a Secure Launch, the IOMMU should always be enabled and the drivers
+loaded. However, IOMMU passthrough mode should never be used. This leaves the
+MLE completely exposed to DMA after the PMR's [2]_ are disabled. First, IOMMU
+passthrough should be disabled by default in the build configuration::
+
+  "Device Drivers" -->
+  "IOMMU Hardware Support" -->
+  "IOMMU passthrough by default [ ]"
+
+This unset the Kconfig value CONFIG_IOMMU_DEFAULT_PASSTHROUGH.


Note that the config structure has now changed, and if set, passthrough 
is deselected by choosing a different default domain type.



+In addition, passthrough must be disabled on the kernel command line when doing
+a Secure Launch as follows::
+
+  iommu=nopt iommu.passthrough=0


This part is a bit silly - those options are literally aliases for the 
exact same thing, and furthermore if the config is already set as 
required then the sole effect either of them will have is to cause "(set 
by kernel command line)" to be printed. There is no value in explicitly 
overriding the default value with the default value - if anyone can 
append an additional "iommu.passthrough=1" (or "iommu=pt") to the end of 
the command line they'll still win.


Robin.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v3 1/5] iommu/virtio: Add definitions for VIRTIO_IOMMU_F_BYPASS_CONFIG

2021-12-02 Thread Eric Auger
Hi Jean,

On 12/1/21 6:33 PM, Jean-Philippe Brucker wrote:
> Add definitions for the VIRTIO_IOMMU_F_BYPASS_CONFIG, which supersedes
> VIRTIO_IOMMU_F_BYPASS.
>
> Reviewed-by: Kevin Tian 
> Signed-off-by: Jean-Philippe Brucker 
Reviewed-by: Eric Auger 

Eric
> ---
>  include/uapi/linux/virtio_iommu.h | 8 +++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
>
> diff --git a/include/uapi/linux/virtio_iommu.h 
> b/include/uapi/linux/virtio_iommu.h
> index 237e36a280cb..1ff357f0d72e 100644
> --- a/include/uapi/linux/virtio_iommu.h
> +++ b/include/uapi/linux/virtio_iommu.h
> @@ -16,6 +16,7 @@
>  #define VIRTIO_IOMMU_F_BYPASS3
>  #define VIRTIO_IOMMU_F_PROBE 4
>  #define VIRTIO_IOMMU_F_MMIO  5
> +#define VIRTIO_IOMMU_F_BYPASS_CONFIG 6
>  
>  struct virtio_iommu_range_64 {
>   __le64  start;
> @@ -36,6 +37,8 @@ struct virtio_iommu_config {
>   struct virtio_iommu_range_32domain_range;
>   /* Probe buffer size */
>   __le32  probe_size;
> + __u8bypass;
> + __u8reserved[3];
>  };
>  
>  /* Request types */
> @@ -66,11 +69,14 @@ struct virtio_iommu_req_tail {
>   __u8reserved[3];
>  };
>  
> +#define VIRTIO_IOMMU_ATTACH_F_BYPASS (1 << 0)
> +
>  struct virtio_iommu_req_attach {
>   struct virtio_iommu_req_headhead;
>   __le32  domain;
>   __le32  endpoint;
> - __u8reserved[8];
> + __le32  flags;
> + __u8reserved[4];
>   struct virtio_iommu_req_tailtail;
>  };
>  

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

2021-12-02 Thread Jason Gunthorpe via iommu
On Thu, Dec 02, 2021 at 11:31:11PM +0100, Thomas Gleixner wrote:

> The software representation aka struct msi_desc is a different
> story. That's what we are debating.

Okay, I did mean msi_desc storage, so we are talking about the same thigns

> >> Of course we can store them in pci_dev.dev.msi.data.store. Either with a
> >> dedicated xarray or by partitioning the xarray space. Both have their
> >> pro and cons.
> >
> > This decision seems to drive the question of how many 'struct devices'
> > do we need, and where do we get them..
> 
> Not really. There is nothing what enforces to make the MSI irqdomain
> storage strictly hang off struct device. There has to be a connection to
> a struct device in some way obviously to make IOMMU happy.

Yes, I thought this too, OK we agree

> Just badly named because the table itself is where the resulting message
> is stored, which is composed with the help of the relevant MSI
> descriptor. See above.

I picked the name because it looks like it will primarily contain an
xarray and the API you suggested to query it is idx based. To me that
is a table. A table of msi_desc storage to be specific.

It seems we have some agreement here as your lfunc also included the
same xarray and uses an idx as part of the API, right?

> We really should not try to make up an artifical table representation
> for something which does not necessarily have a table at all, i.e. the
> devices you talk about which store the message in queue specific system
> memory. Pretending that this is a table is just silly.

Now I'm a bit confused, what is the idx in the lfunc? 

I think this is language again, I would call idx an artificial table
representation?

> Also I disagree that this has to be tied to a PCI specific interface,
> except for creating a PCI specific wrapper for it to not make a driver
> developer have to write '>dev', which is the very least of our
> problems.

Agreed, just trying to be illustrative at a higher level
 
> Aside of that 'my_irq_chip' does not cut it at all because of the way
> how the resulting messages are stored. IDXD has IOMEM storage and a
> storage space limitation while your device uses system memory storage
> and has other limitations, i.e. system memory and the number of queues
> the device can provide.

Okay, so the device must setup the domain because it must provide
information during setup. Makes sense

> Not really. I was really reasoning about an abstract representation for
> a functional queue, which is more than just a queue allocated from the
> PF or VF device.
> 
> I really meant a container like this:
> 
> struct logical_function {
> /* Pointer to the physical device */
> struct device *phys_device;
> /* MSI descriptor storage */
>   struct msi_data msi;

Up to here is basically what I thought the "message table" would
contain. Possibly a pointer to the iommu_domain too?

> /* The queue number */
> unsigned int  queue_nr;
> /* Add more information which is common to these things */

Not sure why do we need this?

Lets imagine a simple probe function for a simple timer device. It has
no cdev, vfio, queues, etc. However, the device only supports IMS. No
MSI, MSI-X, nothing else.

What does the probe function look like to get to request_irq()?

Why does this simple driver create something called a 'logical
function' to access its only interrupt?

I think you have the right idea here, just forget about VFIO, the IDXD
use case, all of it. Provide a way to use IMS cleanly and concurrently
with MSI.

Do that and everything else should have sane solutions too.

> The idea is to have a common representation for these type of things
> which allows:
> 
>  1) Have common code for exposing queues to VFIO, cdev, sysfs...
> 
> You still need myqueue specific code, but the common stuff which is
> in struct logical_function can be generic and device independent.

I will quote you: "Seriously, you cannot make something uniform which
is by definition non-uniform." :)

We will find there is no common stuff here, we did this exercise
already when we designed struct auxiliary_device, and came up
empty. 

>  2) Having the MSI storage per logical function (queue) allows to have
> a queue relative 0 based MSI index space.

Can you explain a bit what you see 0 meaning in this idx numbering?

I also don't understand what 'queue relative' means?

> The actual index in the physical table (think IMS) would be held in
> the msi descriptor itself.

This means that a device like IDXD would store the phsical IMS table
entry # in the msi descriptor? What is idx then?

For a device like IDXD with a simple linear table, how does the driver
request a specific entry in the IMS table? eg maybe IMS table entry #0
is special like we often see in MSI?

> msi_get_virq(>lfunc.msi, idx = 0)
>
> v.s.
>
> idx = myqueue->msidx[0];
> msi_get_virq(pcidev->dev, idx);
 
>

Re: [PATCH] mm: kmemleak: Ignore kmemleak scanning on CMA regions

2021-12-02 Thread Calvin Zhang
On Thu, Dec 02, 2021 at 06:11:12PM +, Catalin Marinas wrote:
>On Sun, Nov 28, 2021 at 09:50:53AM +0800, Calvin Zhang wrote:
>> On Sat, Nov 27, 2021 at 04:07:18PM -0800, Andrew Morton wrote:
>> >On Fri, 26 Nov 2021 10:47:11 +0800 Calvin Zhang 
>> > wrote:
>> >> Just like this:
>> >> commit 620951e27457 ("mm/cma: make kmemleak ignore CMA regions").
>> >> 
>> >> Add kmemleak_ignore_phys() for CMA created from of reserved node.
>[...]
>> >The 620951e27457 changelog says "Without this, the kernel crashes...". 
>> >Does your patch also fix a crash?  If so under what circumstances and
>> >should we backport this fix into -stable kernels?
>> 
>> No crash occurred. 620951e27457 avoids crashes caused by accessing
>> highmem and it was fixed later. Now kmemleak_alloc_phys() and
>> kmemleak_ignore_phys() skip highmem. This patch is based on the
>> point that CMA regions don't contain pointers to other kmemleak
>> objects, and ignores CMA regions from reserved memory as what
>> 620951e27457 did.
>
>Note that kmemleak_ignore() only works if there was a prior
>kmemleak_alloc() on that address range. With the previous commit we get
>this via the memblock_alloc_range() but I fail to see one on the
>rmem_cma_setup() path.

rmem is from memblock_reserve() or early_init_dt_alloc_reserved_memory_arch()
kmemleak_alloc() is not called in the first case. And It's bad to add one.

I think all the reserved regions should be allocated from memblock without
kmemleak_alloc() and let rmem handler choose to add it as kmemleak object
by kmemleak_alloc(). Because MEMBLOCK_ALLOC_NOLEAKTRACE conflicts with range
parameter in memlbock_alloc_* series, all reserved regions and default CMA
region are allocated with kmemleak_alloc().

I think it's better to add memblock_alloc_* series a spearate flag paramter
(like "NOLEAKTRACE") instead of encoding MEMBLOCK_ALLOC_NOLEAKTRACE in `end`
parameter.

--
Calvin

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH 1/4] dt-bindings: memory: mediatek: Correct the minItems of clk for larbs

2021-12-02 Thread Yong Wu
If a platform's larb support gals, there will be some larbs have a one
more "gals" clock while the others still only need "apb"/"smi" clocks.
then the minItems is 2 and the maxItems is 3.

Fixes: 27bb0e42855a ("dt-bindings: memory: mediatek: Convert SMI to DT schema")
Signed-off-by: Yong Wu 
---
 .../bindings/memory-controllers/mediatek,smi-larb.yaml  | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git 
a/Documentation/devicetree/bindings/memory-controllers/mediatek,smi-larb.yaml 
b/Documentation/devicetree/bindings/memory-controllers/mediatek,smi-larb.yaml
index eaeff1ada7f8..a1402f3b8344 100644
--- 
a/Documentation/devicetree/bindings/memory-controllers/mediatek,smi-larb.yaml
+++ 
b/Documentation/devicetree/bindings/memory-controllers/mediatek,smi-larb.yaml
@@ -81,7 +81,7 @@ allOf:
   properties:
 clock:
   items:
-minItems: 3
+minItems: 2
 maxItems: 3
 clock-names:
   items:
-- 
2.18.0

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH 2/4] dt-bindings: memory: mediatek: Add mt8186 support

2021-12-02 Thread Yong Wu
Add mt8186 smi support in the bindings.

Signed-off-by: Yong Wu 
---
 .../bindings/memory-controllers/mediatek,smi-common.yaml  | 4 +++-
 .../bindings/memory-controllers/mediatek,smi-larb.yaml| 3 +++
 2 files changed, 6 insertions(+), 1 deletion(-)

diff --git 
a/Documentation/devicetree/bindings/memory-controllers/mediatek,smi-common.yaml 
b/Documentation/devicetree/bindings/memory-controllers/mediatek,smi-common.yaml
index 3a82b0b27fa0..fbe5077408d8 100644
--- 
a/Documentation/devicetree/bindings/memory-controllers/mediatek,smi-common.yaml
+++ 
b/Documentation/devicetree/bindings/memory-controllers/mediatek,smi-common.yaml
@@ -16,7 +16,7 @@ description: |
   MediaTek SMI have two generations of HW architecture, here is the list
   which generation the SoCs use:
   generation 1: mt2701 and mt7623.
-  generation 2: mt2712, mt6779, mt8167, mt8173, mt8183, mt8192 and mt8195.
+  generation 2: mt2712, mt6779, mt8167, mt8173, mt8183, mt8186, mt8192 and 
mt8195.
 
   There's slight differences between the two SMI, for generation 2, the
   register which control the iommu port is at each larb's register base. But
@@ -35,6 +35,7 @@ properties:
   - mediatek,mt8167-smi-common
   - mediatek,mt8173-smi-common
   - mediatek,mt8183-smi-common
+  - mediatek,mt8186-smi-common
   - mediatek,mt8192-smi-common
   - mediatek,mt8195-smi-common-vdo
   - mediatek,mt8195-smi-common-vpp
@@ -127,6 +128,7 @@ allOf:
   enum:
 - mediatek,mt6779-smi-common
 - mediatek,mt8183-smi-common
+- mediatek,mt8186-smi-common
 - mediatek,mt8192-smi-common
 - mediatek,mt8195-smi-common-vdo
 - mediatek,mt8195-smi-common-vpp
diff --git 
a/Documentation/devicetree/bindings/memory-controllers/mediatek,smi-larb.yaml 
b/Documentation/devicetree/bindings/memory-controllers/mediatek,smi-larb.yaml
index a1402f3b8344..e019990f892a 100644
--- 
a/Documentation/devicetree/bindings/memory-controllers/mediatek,smi-larb.yaml
+++ 
b/Documentation/devicetree/bindings/memory-controllers/mediatek,smi-larb.yaml
@@ -23,6 +23,7 @@ properties:
   - mediatek,mt8167-smi-larb
   - mediatek,mt8173-smi-larb
   - mediatek,mt8183-smi-larb
+  - mediatek,mt8186-smi-larb
   - mediatek,mt8192-smi-larb
   - mediatek,mt8195-smi-larb
 
@@ -75,6 +76,7 @@ allOf:
 compatible:
   enum:
 - mediatek,mt8183-smi-larb
+- mediatek,mt8186-smi-larb
 - mediatek,mt8195-smi-larb
 
 then:
@@ -109,6 +111,7 @@ allOf:
   - mediatek,mt2712-smi-larb
   - mediatek,mt6779-smi-larb
   - mediatek,mt8167-smi-larb
+  - mediatek,mt8186-smi-larb
   - mediatek,mt8192-smi-larb
   - mediatek,mt8195-smi-larb
 
-- 
2.18.0

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH 0/4] MT8186 SMI SUPPORT

2021-12-02 Thread Yong Wu
This patchset add mt8186 smi support.
mainly add a sleep control function.

Base on v5.16-rc1.

Yong Wu (4):
  dt-bindings: memory: mediatek: Correct the minItems of clk for larbs
  dt-bindings: memory: mediatek: Add mt8186 support
  memory: mtk-smi: Add sleep ctrl function
  memory: mtk-smi: mt8186: Add smi support

 .../mediatek,smi-common.yaml  |  4 +-
 .../memory-controllers/mediatek,smi-larb.yaml |  5 +-
 drivers/memory/mtk-smi.c  | 52 +--
 3 files changed, 55 insertions(+), 6 deletions(-)

-- 
2.18.0


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH 3/4] memory: mtk-smi: Add sleep ctrl function

2021-12-02 Thread Yong Wu
sleep control means that when the larb go to sleep, we should wait a bit
until all the current commands are finished. thus, when the larb runtime
suspend, we need enable this function to wait until all the existed
command are finished. when the larb resume, just disable this function.
This function only improve the safe of bus. Add a new flag for this
function. Prepare for mt8186.

Signed-off-by: Anan Sun 
Signed-off-by: Yong Wu 
---
 drivers/memory/mtk-smi.c | 39 +++
 1 file changed, 35 insertions(+), 4 deletions(-)

diff --git a/drivers/memory/mtk-smi.c b/drivers/memory/mtk-smi.c
index b883dcc0bbfa..4b59b28e4d73 100644
--- a/drivers/memory/mtk-smi.c
+++ b/drivers/memory/mtk-smi.c
@@ -8,6 +8,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -32,6 +33,10 @@
 #define SMI_DUMMY  0x444
 
 /* SMI LARB */
+#define SMI_LARB_SLP_CON0x00c
+#define SLP_PROT_EN BIT(0)
+#define SLP_PROT_RDYBIT(16)
+
 #define SMI_LARB_CMD_THRT_CON  0x24
 #define SMI_LARB_THRT_RD_NU_LMT_MSKGENMASK(7, 4)
 #define SMI_LARB_THRT_RD_NU_LMT(5 << 4)
@@ -81,6 +86,7 @@
 
 #define MTK_SMI_FLAG_THRT_UPDATE   BIT(0)
 #define MTK_SMI_FLAG_SW_FLAG   BIT(1)
+#define MTK_SMI_FLAG_SLEEP_CTL BIT(2)
 #define MTK_SMI_CAPS(flags, _x)(!!((flags) & (_x)))
 
 struct mtk_smi_reg_pair {
@@ -371,6 +377,24 @@ static const struct of_device_id mtk_smi_larb_of_ids[] = {
{}
 };
 
+static int mtk_smi_larb_sleep_ctrl(struct device *dev, bool to_sleep)
+{
+   struct mtk_smi_larb *larb = dev_get_drvdata(dev);
+   int ret = 0;
+   u32 tmp;
+
+   if (to_sleep) {
+   writel_relaxed(SLP_PROT_EN, larb->base + SMI_LARB_SLP_CON);
+   ret = readl_poll_timeout_atomic(larb->base + SMI_LARB_SLP_CON,
+   tmp, !!(tmp & SLP_PROT_RDY), 
10, 1000);
+   if (ret)
+   dev_warn(dev, "sleep ctrl is not ready(0x%x).\n", tmp);
+   } else {
+   writel_relaxed(0, larb->base + SMI_LARB_SLP_CON);
+   }
+   return ret;
+}
+
 static int mtk_smi_device_link_common(struct device *dev, struct device 
**com_dev)
 {
struct platform_device *smi_com_pdev;
@@ -477,24 +501,31 @@ static int __maybe_unused mtk_smi_larb_resume(struct 
device *dev)
 {
struct mtk_smi_larb *larb = dev_get_drvdata(dev);
const struct mtk_smi_larb_gen *larb_gen = larb->larb_gen;
-   int ret;
+   int ret = 0;
 
ret = clk_bulk_prepare_enable(larb->smi.clk_num, larb->smi.clks);
-   if (ret < 0)
+   if (ret)
return ret;
 
+   if (MTK_SMI_CAPS(larb->larb_gen->flags_general, MTK_SMI_FLAG_SLEEP_CTL))
+   ret = mtk_smi_larb_sleep_ctrl(dev, false);
+
/* Configure the basic setting for this larb */
larb_gen->config_port(dev);
 
-   return 0;
+   return ret;
 }
 
 static int __maybe_unused mtk_smi_larb_suspend(struct device *dev)
 {
struct mtk_smi_larb *larb = dev_get_drvdata(dev);
+   int ret = 0;
+
+   if (MTK_SMI_CAPS(larb->larb_gen->flags_general, MTK_SMI_FLAG_SLEEP_CTL))
+   ret = mtk_smi_larb_sleep_ctrl(dev, true);
 
clk_bulk_disable_unprepare(larb->smi.clk_num, larb->smi.clks);
-   return 0;
+   return ret;
 }
 
 static const struct dev_pm_ops smi_larb_pm_ops = {
-- 
2.18.0

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH 4/4] memory: mtk-smi: mt8186: Add smi support

2021-12-02 Thread Yong Wu
Add mt8186 SMI support.

Signed-off-by: Yong Wu 
---
 drivers/memory/mtk-smi.c | 13 +
 1 file changed, 13 insertions(+)

diff --git a/drivers/memory/mtk-smi.c b/drivers/memory/mtk-smi.c
index 4b59b28e4d73..29d7cd1cc8f8 100644
--- a/drivers/memory/mtk-smi.c
+++ b/drivers/memory/mtk-smi.c
@@ -355,6 +355,11 @@ static const struct mtk_smi_larb_gen mtk_smi_larb_mt8183 = 
{
  /* IPU0 | IPU1 | CCU */
 };
 
+static const struct mtk_smi_larb_gen mtk_smi_larb_mt8186 = {
+   .config_port= mtk_smi_larb_config_port_gen2_general,
+   .flags_general  = MTK_SMI_FLAG_SLEEP_CTL,
+};
+
 static const struct mtk_smi_larb_gen mtk_smi_larb_mt8192 = {
.config_port= mtk_smi_larb_config_port_gen2_general,
 };
@@ -372,6 +377,7 @@ static const struct of_device_id mtk_smi_larb_of_ids[] = {
{.compatible = "mediatek,mt8167-smi-larb", .data = 
_smi_larb_mt8167},
{.compatible = "mediatek,mt8173-smi-larb", .data = 
_smi_larb_mt8173},
{.compatible = "mediatek,mt8183-smi-larb", .data = 
_smi_larb_mt8183},
+   {.compatible = "mediatek,mt8186-smi-larb", .data = 
_smi_larb_mt8186},
{.compatible = "mediatek,mt8192-smi-larb", .data = 
_smi_larb_mt8192},
{.compatible = "mediatek,mt8195-smi-larb", .data = 
_smi_larb_mt8195},
{}
@@ -575,6 +581,12 @@ static const struct mtk_smi_common_plat 
mtk_smi_common_mt8183 = {
F_MMU1_LARB(7),
 };
 
+static const struct mtk_smi_common_plat mtk_smi_common_mt8186 = {
+   .type = MTK_SMI_GEN2,
+   .has_gals = true,
+   .bus_sel  = F_MMU1_LARB(1) | F_MMU1_LARB(4) | F_MMU1_LARB(7),
+};
+
 static const struct mtk_smi_common_plat mtk_smi_common_mt8192 = {
.type = MTK_SMI_GEN2,
.has_gals = true,
@@ -609,6 +621,7 @@ static const struct of_device_id mtk_smi_common_of_ids[] = {
{.compatible = "mediatek,mt8167-smi-common", .data = 
_smi_common_gen2},
{.compatible = "mediatek,mt8173-smi-common", .data = 
_smi_common_gen2},
{.compatible = "mediatek,mt8183-smi-common", .data = 
_smi_common_mt8183},
+   {.compatible = "mediatek,mt8186-smi-common", .data = 
_smi_common_mt8186},
{.compatible = "mediatek,mt8192-smi-common", .data = 
_smi_common_mt8192},
{.compatible = "mediatek,mt8195-smi-common-vdo", .data = 
_smi_common_mt8195_vdo},
{.compatible = "mediatek,mt8195-smi-common-vpp", .data = 
_smi_common_mt8195_vpp},
-- 
2.18.0

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2 00/33] Separate struct slab from struct page

2021-12-02 Thread Vlastimil Babka
On 12/1/21 19:14, Vlastimil Babka wrote:
> Folks from non-slab subsystems are Cc'd only to patches affecting them, and
> this cover letter.
> 
> Series also available in git, based on 5.16-rc3:
> https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=slab-struct_slab-v2r2

I have pushed a v3, but not going to resent immediately to avoid unnecessary
spamming, the differences is just that some patches are removed and other
reordered, so the current v2 posting should be still sufficient for on-list
review:

https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=slab-struct_slab-v3r1

patch 29/33 iommu: Use put_pages_list
- removed as this version is broken and Robin Murphy has meanwhile
incorporated it partially to his series:
https://lore.kernel.org/lkml/cover.1637671820.git.robin.mur...@arm.com/

patch 30/33 mm: Remove slab from struct page
- removed and postponed for later as this can be only be applied after the
iommu use of page.freelist is resolved

patch 27/33 zsmalloc: Stop using slab fields in struct page
patch 28/33 bootmem: Use page->index instead of page->freelist
- moved towards the end of series, to further separate the part that adjusts
non-slab users of slab fields towards removing those fields from struct page.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

2021-12-02 Thread Jason Gunthorpe via iommu
On Thu, Dec 02, 2021 at 01:01:42AM +0100, Thomas Gleixner wrote:
> Jason,
> 
> On Wed, Dec 01 2021 at 21:21, Thomas Gleixner wrote:
> > On Wed, Dec 01 2021 at 14:14, Jason Gunthorpe wrote:
> > Which in turn is consistent all over the place and does not require any
> > special case for anything. Neither for interrupts nor for anything else.
> 
> that said, feel free to tell me that I'm getting it all wrong.
> 
> The reason I'm harping on this is that we are creating ABIs on several
> ends and we all know that getting that wrong is a major pain.

I don't really like coupling the method to fetch IRQs with needing
special struct devices. Struct devices have a sysfs presence and it is
not always appropriate to create sysfs stuff just to allocate some
IRQs.

A queue is simply not a device, it doesn't make any sense. A queue is
more like a socket().

That said, we often have enough struct devices floating about to make
this work. Between netdev/ib_device/aux device/mdev we can use them to
do this.

I think it is conceptual nonsense to attach an IMS IRQ domain to a
netdev or a cdev, but it will solve this problem.

However, I would really prefer that there be no uAPI here. 

I looked at the msi_irqs/ stuff and could not find a user. Debian code
search found nothing, Greg redid the entire uAPI in 2013
(1c51b50c2995), so I think it is just dead. (maybe delete it?)

So lets not create any sysfs for IMS with the msi_irqs/ dir.  We can
revise the in-kenel mechanism someday if it turns out to be a problem.

As to your question:

> So again, why would we want to make software managed subdevices look
> exactly the opposite way like hardware/firmware managed subdevices?

That isn't my thinking at all.

Something like mlx5 has a hw/fw managed VF and there is an RPC call
from driver to device to 'create a queue'. The device has no hard
division inside along a VF, the device simply checks resource limits
and security properties and returns one of the >>10k queues. Again
think more like socket() than a hard partitioning.

It is the same as I suggest for IDXD & VFIO where the PCIe IDXD layer
takes the place of hw/fw and has a 'create a queue' API call for the
VFIO layer to use. Instead of using a VF as the security identity, it
uses a PASID.

This is a logical partitioning and it matches the partioning we'd have
if it was a real device.

> So if a queue is represented as a subdevice, then VFIO can just build
> a wrapper around that subdevice.

I think that oversimplifies the picture.

IDXD is a multi queue device that uses PASID as a security context. It
has a cdev /dev/idxd interface where userspace can use an IOCTL and
get a queue to use. The queue is bound to a PASID that is linked to an
IO Page table that mirrors the process page table. Userspace operates
the queue and does whatever with it.

VFIO is just another interface that should logically be considered a
peer of the cdev. Using VFIO userspace can get a queue, bind it to a
PASID and operate it. The primary difference between the cdev and the
VFIO mdev is user programming API - VFIO uses IOCTLs that carry
emulated MMIO read/write operations.

I consider *neither* to be a subdevice. They are just a user API,
however convoluted, to create a queue, associate it with a PASID
security context and allow userspace to operate the queue. It is much
closer to socket() than a PCI VF subdevice.

Internally the driver should be built so that the PCI driver is doing
all the device operation and the two uAPI layers are only concerend
with translating their repsective uAPIs to the internal device API.

Further, there is no reason why IMS should be reserved exclusively for
VFIO! Why shouldn't the cdev be able to use IMS vectors too? It is
just a feature of the PCI device like MSI. If the queue has a PASID it
can use IDXD's IMS.

If we really need a 2nd struct device to turn on IMS then, I'd suggest
picking the cdev, as it keeps IMS and its allocator inside the IDXD
PCIe driver and not in the VFIO world.

Regards,
Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

2021-12-02 Thread Greg Kroah-Hartman
On Thu, Dec 02, 2021 at 09:55:02AM -0400, Jason Gunthorpe wrote:
> Further, there is no reason why IMS should be reserved exclusively for
> VFIO! Why shouldn't the cdev be able to use IMS vectors too? It is
> just a feature of the PCI device like MSI. If the queue has a PASID it
> can use IDXD's IMS.

No, sorry, but a cdev is not for anything resembling any real resource
at all.

It is ONLY for the /dev/NODE interface that controls the character
device api to userspace.  The struct device involved in it is ONLY for
that, nothing else.  Any attempt to add things to it will be gleefully
rejected.

The cdev api today (in the kernel) exposes too much mess and there's at
least 4 or 5 different ways to use it.  It's on my long-term TODO list
to fix this up to not even allow abuses like you are considering here,
so please don't do that.

> If we really need a 2nd struct device to turn on IMS then, I'd suggest
> picking the cdev, as it keeps IMS and its allocator inside the IDXD
> PCIe driver and not in the VFIO world.

No!  Again, a cdev is to control the lifespan/lifecycle of the /dev/NODE
only.  Anything other than that is not ok to do at all.

thanks,

greg k-h
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH V3 2/5] x86/hyper-v: Add hyperv Isolation VM check in the cc_platform_has()

2021-12-02 Thread Wei Liu
On Wed, Dec 01, 2021 at 11:02:53AM -0500, Tianyu Lan wrote:
> From: Tianyu Lan 
> 
> Hyper-V provides Isolation VM which has memory encrypt support. Add
> hyperv_cc_platform_has() and return true for check of GUEST_MEM_ENCRYPT
> attribute.
> 
> Signed-off-by: Tianyu Lan 
> ---
>  arch/x86/kernel/cc_platform.c | 15 +++
>  1 file changed, 15 insertions(+)
> 
> diff --git a/arch/x86/kernel/cc_platform.c b/arch/x86/kernel/cc_platform.c
> index 03bb2f343ddb..f3bb0431f5c5 100644
> --- a/arch/x86/kernel/cc_platform.c
> +++ b/arch/x86/kernel/cc_platform.c
> @@ -11,6 +11,7 @@
>  #include 
>  #include 
>  
> +#include 
>  #include 
>  
>  static bool __maybe_unused intel_cc_platform_has(enum cc_attr attr)
> @@ -58,9 +59,23 @@ static bool amd_cc_platform_has(enum cc_attr attr)
>  #endif
>  }
>  
> +static bool hyperv_cc_platform_has(enum cc_attr attr)
> +{
> +#ifdef CONFIG_HYPERV
> + if (attr == CC_ATTR_GUEST_MEM_ENCRYPT)
> + return true;
> + else
> + return false;

This can be simplified as

return attr == CC_ATTR_GUEST_MEM_ENCRYPT;


Wei.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH V3 1/5] Swiotlb: Add Swiotlb bounce buffer remap function for HV IVM

2021-12-02 Thread Tom Lendacky via iommu

On 12/1/21 10:02 AM, Tianyu Lan wrote:

From: Tianyu Lan 

In Isolation VM with AMD SEV, bounce buffer needs to be accessed via
extra address space which is above shared_gpa_boundary (E.G 39 bit
address line) reported by Hyper-V CPUID ISOLATION_CONFIG. The access
physical address will be original physical address + shared_gpa_boundary.
The shared_gpa_boundary in the AMD SEV SNP spec is called virtual top of
memory(vTOM). Memory addresses below vTOM are automatically treated as
private while memory above vTOM is treated as shared.

Expose swiotlb_unencrypted_base for platforms to set unencrypted
memory base offset and platform calls swiotlb_update_mem_attributes()
to remap swiotlb mem to unencrypted address space. memremap() can
not be called in the early stage and so put remapping code into
swiotlb_update_mem_attributes(). Store remap address and use it to copy
data from/to swiotlb bounce buffer.

Signed-off-by: Tianyu Lan 


This patch results in the following stack trace during a bare-metal boot
on my EPYC system with SME active (e.g. mem_encrypt=on):

[0.123932] BUG: Bad page state in process swapper  pfn:108001
[0.123942] page:(ptrval) refcount:0 mapcount:-128 
mapping: index:0x0 pfn:0x108001
[0.123946] flags: 0x17c000(node=0|zone=2|lastcpupid=0x1f)
[0.123952] raw: 0017c000 88904f2d5e80 88904f2d5e80 

[0.123954] raw:   ff7f 

[0.123955] page dumped because: nonzero mapcount
[0.123957] Modules linked in:
[0.123961] CPU: 0 PID: 0 Comm: swapper Not tainted 5.16.0-rc3-sos-custom #2
[0.123964] Hardware name: AMD Corporation
[0.123967] Call Trace:
[0.123971]  
[0.123975]  dump_stack_lvl+0x48/0x5e
[0.123985]  bad_page.cold+0x65/0x96
[0.123990]  __free_pages_ok+0x3a8/0x410
[0.123996]  memblock_free_all+0x171/0x1dc
[0.124005]  mem_init+0x1f/0x14b
[0.124011]  start_kernel+0x3b5/0x6a1
[0.124016]  secondary_startup_64_no_verify+0xb0/0xbb
[0.124022]  

I see ~40 of these traces, each for different pfns.

Thanks,
Tom


---
Change since v2:
* Leave mem->vaddr with phys_to_virt(mem->start) when fail
  to remap swiotlb memory.

Change since v1:
* Rework comment in the swiotlb_init_io_tlb_mem()
* Make swiotlb_init_io_tlb_mem() back to return void.
---
  include/linux/swiotlb.h |  6 ++
  kernel/dma/swiotlb.c| 47 -
  2 files changed, 48 insertions(+), 5 deletions(-)

diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index 569272871375..f6c3638255d5 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -73,6 +73,9 @@ extern enum swiotlb_force swiotlb_force;
   * @end:  The end address of the swiotlb memory pool. Used to do a quick
   *range check to see if the memory was in fact allocated by this
   *API.
+ * @vaddr: The vaddr of the swiotlb memory pool. The swiotlb memory pool
+ * may be remapped in the memory encrypted case and store virtual
+ * address for bounce buffer operation.
   * @nslabs:   The number of IO TLB blocks (in groups of 64) between @start and
   *@end. For default swiotlb, this is command line adjustable via
   *setup_io_tlb_npages.
@@ -92,6 +95,7 @@ extern enum swiotlb_force swiotlb_force;
  struct io_tlb_mem {
phys_addr_t start;
phys_addr_t end;
+   void *vaddr;
unsigned long nslabs;
unsigned long used;
unsigned int index;
@@ -186,4 +190,6 @@ static inline bool is_swiotlb_for_alloc(struct device *dev)
  }
  #endif /* CONFIG_DMA_RESTRICTED_POOL */
  
+extern phys_addr_t swiotlb_unencrypted_base;

+
  #endif /* __LINUX_SWIOTLB_H */
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 8e840fbbed7c..adb9d06af5c8 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -50,6 +50,7 @@
  #include 
  #include 
  
+#include 

  #include 
  #include 
  #include 
@@ -72,6 +73,8 @@ enum swiotlb_force swiotlb_force;
  
  struct io_tlb_mem io_tlb_default_mem;
  
+phys_addr_t swiotlb_unencrypted_base;

+
  /*
   * Max segment that we can provide which (if pages are contingous) will
   * not be bounced (unless SWIOTLB_FORCE is set).
@@ -155,6 +158,27 @@ static inline unsigned long nr_slots(u64 val)
return DIV_ROUND_UP(val, IO_TLB_SIZE);
  }
  
+/*

+ * Remap swioltb memory in the unencrypted physical address space
+ * when swiotlb_unencrypted_base is set. (e.g. for Hyper-V AMD SEV-SNP
+ * Isolation VMs).
+ */
+void *swiotlb_mem_remap(struct io_tlb_mem *mem, unsigned long bytes)
+{
+   void *vaddr = NULL;
+
+   if (swiotlb_unencrypted_base) {
+   phys_addr_t paddr = mem->start + swiotlb_unencrypted_base;
+
+   vaddr = memremap(paddr, bytes, MEMREMAP_WB);
+   if (!vaddr)
+   pr_err("Failed to map the 

Re: [PATCH V3 3/5] hyperv/IOMMU: Enable swiotlb bounce buffer for Isolation VM

2021-12-02 Thread Wei Liu
On Wed, Dec 01, 2021 at 11:02:54AM -0500, Tianyu Lan wrote:
[...]
> diff --git a/arch/x86/xen/pci-swiotlb-xen.c b/arch/x86/xen/pci-swiotlb-xen.c
> index 46df59aeaa06..30fd0600b008 100644
> --- a/arch/x86/xen/pci-swiotlb-xen.c
> +++ b/arch/x86/xen/pci-swiotlb-xen.c
> @@ -4,6 +4,7 @@
>  
>  #include 
>  #include 
> +#include 
>  #include 
>  
>  #include 
> @@ -91,6 +92,6 @@ int pci_xen_swiotlb_init_late(void)
>  EXPORT_SYMBOL_GPL(pci_xen_swiotlb_init_late);
>  
>  IOMMU_INIT_FINISH(pci_xen_swiotlb_detect,
> -   NULL,
> +   hyperv_swiotlb_detect,

It is not immediately obvious why this is needed just by reading the
code. Please consider copying some of the text in the commit message to
a comment here.

Thanks,
Wei.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

2021-12-02 Thread Jason Gunthorpe via iommu
On Thu, Dec 02, 2021 at 03:23:38PM +0100, Greg Kroah-Hartman wrote:
> On Thu, Dec 02, 2021 at 09:55:02AM -0400, Jason Gunthorpe wrote:
> > Further, there is no reason why IMS should be reserved exclusively for
> > VFIO! Why shouldn't the cdev be able to use IMS vectors too? It is
> > just a feature of the PCI device like MSI. If the queue has a PASID it
> > can use IDXD's IMS.
> 
> No, sorry, but a cdev is not for anything resembling any real resource
> at all.

My point is that when the user asks the driver to allocate a queue
through a cdev ioctl it should be able to get the queue attached to an
IMS, today it can only get a queue attached to a MSI.

> It is ONLY for the /dev/NODE interface that controls the character
> device api to userspace.  The struct device involved in it is ONLY for
> that, nothing else.  Any attempt to add things to it will be gleefully
> rejected.

I agree with you!
 
> > If we really need a 2nd struct device to turn on IMS then, I'd suggest
> > picking the cdev, as it keeps IMS and its allocator inside the IDXD
> > PCIe driver and not in the VFIO world.
> 
> No!  Again, a cdev is to control the lifespan/lifecycle of the /dev/NODE
> only.  Anything other than that is not ok to do at all.

Said the same thing in a prior email - which is why I think the only
logical choice here is to make IMS work on the pci_device

FWIW I feel the same way about the VFIO mdev - its *ONLY* purpose is
to control the lifecycle and we are close to stripping away all the
other abuses using it for other things.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu