from:"Jason Gunthorpe"

Re: [PATCH v17 3/3] vfio/nvgrace-gpu: Add vfio pci variant module for grace hopper

2024-02-09 Thread Jason Gunthorpe

On Fri, Feb 09, 2024 at 08:55:31AM -0700, Alex Williamson wrote:
> I think Kevin's point is also relative to this latter scenario, in the
> L1 instance of the nvgrace-gpu driver the mmap of the usemem BAR is
> cachable, but in the L2 instance of the driver where we only use the
> vfio-pci-core ops nothing maintains that cachable mapping.  Is that a
> problem?  An uncached mapping on top of a cachable mapping is often
> prone to problems.  

On these CPUs the ARM architecture won't permit it, the L0 level
blocks uncachable using FWB and page table attributes. The VM, no
matter what it does, cannot make the cachable memory uncachable.

Jason

Re: [RFC v1 0/8] vhost-vdpa: add support for iommufd

2023-11-10 Thread Jason Gunthorpe

On Thu, Nov 09, 2023 at 06:48:46PM -0500, Michael S. Tsirkin wrote:
> On Tue, Nov 07, 2023 at 11:52:17AM -0400, Jason Gunthorpe wrote:
> > On Tue, Nov 07, 2023 at 09:30:21AM -0500, Michael S. Tsirkin wrote:
> > > On Tue, Nov 07, 2023 at 10:12:37AM -0400, Jason Gunthorpe wrote:
> > > > Big company's should take the responsibility to train and provide
> > > > skill development for their own staff.
> > > 
> > > That would result in a beautiful cathedral of a patch. I know this is
> > > how some companies work. We are doing more of a bazaar thing here,
> > > though. In a bunch of subsystems it seems that you don't get the
> > > necessary skills until you have been publically shouted at by
> > > maintainers - better to start early ;). Not a nice environment for
> > > novices, for sure.
> > 
> > In my view the "shouting from maintainers" is harmful to the people
> > buidling skills and it is an unkind thing to dump employees into that
> > kind of situation.
> > 
> > They should have help to establish the basic level of competence where
> > they may do the wrong thing, but all the process and presentation of
> > the wrong thing is top notch. You get a much better reception.
> 
> What - like e.g. mechanically fixing checkpatch warnings without
> understanding? 

No, not at all. I mean actually going through and explaining what the
idea is to another person and ensuing that the commit messages convey
that idea, that the patches reflect the idea, that everything is
convayed, and it isn't obviously internally illogical.

Like, why did this series have a giant block of #ifdef 0'd code with
no explanation at all? That isn't checkpatch nitpicks, that is not
meeting the minimum standard to convey an idea in an RFC.

Jason

Re: [PATCH for-next v3 4/7] RDMA/rxe: Add page invalidation support

2023-01-16 Thread Jason Gunthorpe

On Fri, Dec 23, 2022 at 03:51:55PM +0900, Daisuke Matsuda wrote:

> +static bool rxe_ib_invalidate_range(struct mmu_interval_notifier *mni,
> + const struct mmu_notifier_range *range,
> + unsigned long cur_seq)
> +{
> + struct ib_umem_odp *umem_odp =
> + container_of(mni, struct ib_umem_odp, notifier);
> + unsigned long start;
> + unsigned long end;
> +
> + if (!mmu_notifier_range_blockable(range))
> + return false;
> +
> + mutex_lock(_odp->umem_mutex);
> + mmu_interval_set_seq(mni, cur_seq);
> +
> + start = max_t(u64, ib_umem_start(umem_odp), range->start);
> + end = min_t(u64, ib_umem_end(umem_odp), range->end);
> +
> + ib_umem_odp_unmap_dma_pages(umem_odp, start, end);

After bob's xarray conversion this can be done alot faster, it just an
xa_for_each_range and make the xarray items non-present

non-present is probably just a null struct page in the xarray.

Jason

Re: [PATCH for-next v3 3/7] RDMA/rxe: Cleanup code for responder Atomic operations

2023-01-16 Thread Jason Gunthorpe

On Fri, Dec 23, 2022 at 03:51:54PM +0900, Daisuke Matsuda wrote:
> @@ -733,60 +734,83 @@ static enum resp_states process_flush(struct rxe_qp *qp,
>  /* Guarantee atomicity of atomic operations at the machine level. */
>  static DEFINE_SPINLOCK(atomic_ops_lock);
>  
> -static enum resp_states atomic_reply(struct rxe_qp *qp,
> -  struct rxe_pkt_info *pkt)
> +enum resp_states rxe_process_atomic(struct rxe_qp *qp,
> + struct rxe_pkt_info *pkt, u64 *vaddr)
>  {
> - u64 *vaddr;
>   enum resp_states ret;
> - struct rxe_mr *mr = qp->resp.mr;
>   struct resp_res *res = qp->resp.res;
>   u64 value;
>  
> - if (!res) {
> - res = rxe_prepare_res(qp, pkt, RXE_ATOMIC_MASK);
> - qp->resp.res = res;
> + /* check vaddr is 8 bytes aligned. */
> + if (!vaddr || (uintptr_t)vaddr & 7) {
> + ret = RESPST_ERR_MISALIGNED_ATOMIC;
> + goto out;
>   }
>  
> - if (!res->replay) {
> - if (mr->state != RXE_MR_STATE_VALID) {
> - ret = RESPST_ERR_RKEY_VIOLATION;
> - goto out;
> - }
> + spin_lock(_ops_lock);
> + res->atomic.orig_val = value = *vaddr;
>  
> - vaddr = iova_to_vaddr(mr, qp->resp.va + qp->resp.offset,
> - sizeof(u64));

I think you need to properly fix the lifetime problem with iova_to_vaddr
function, not hack around it like this.

iova_to_vaddr should be able to return an IOVA for ODP just fine - the
reason it can't is the same bug it has with normal MRs, the mapping
can just change under the feet and there is no protective locking.

If you are going to follow the same ODP design as mlx5 then
fundamentally all ODP does to the MR is add a not-present bit and
allow the MR pages to churn rapidly.

Make the MR safe to changes in the page references against races and
ODP will work just fine.

This will be easier on top of Bob's xarray patch, please check what he
has there and test it.

Thanks,
Jason

Re: [PATCH] mm: fix missing wake-up event for FSDAX pages

2022-07-21 Thread Jason Gunthorpe

On Mon, Jul 11, 2022 at 07:39:17PM -0700, Dan Williams wrote:
> Muchun Song wrote:
> > On Mon, Jul 04, 2022 at 11:38:16AM +0100, Matthew Wilcox wrote:
> > > On Mon, Jul 04, 2022 at 03:40:54PM +0800, Muchun Song wrote:
> > > > FSDAX page refcounts are 1-based, rather than 0-based: if refcount is
> > > > 1, then the page is freed.  The FSDAX pages can be pinned through GUP,
> > > > then they will be unpinned via unpin_user_page() using a folio variant
> > > > to put the page, however, folio variants did not consider this special
> > > > case, the result will be to miss a wakeup event (like the user of
> > > > __fuse_dax_break_layouts()).
> > > 
> > > Argh, no.  The 1-based refcounts are a blight on the entire kernel.
> > > They need to go away, not be pushed into folios as well.  I think
> > 
> > I would be happy if this could go away.
> 
> Continue to agree that this blight needs to end.
> 
> One of the pre-requisites to getting back to normal accounting of FSDAX
> page pin counts was to first drop the usage of get_dev_pagemap() in the
> GUP path:
> 
> https://lore.kernel.org/linux-mm/161604048257.1463742.1374527716381197629.st...@dwillia2-desk3.amr.corp.intel.com/
> 
> That work stalled on notifying mappers of surprise removal events of FSDAX 
> pfns.

We already talked about this - once we have proper refcounting the
above is protected naturally by the proper refcounting. The reason it
is there is only because the refcount goes to 1 and the page is
re-used so the natural protection in GUP doesn't work.

We don't need surprise removal events to fix this, we need the FS side
to hold a reference when it puts the pages into the PTEs..

> So, once I dig out from a bit of CXL backlog and review that effort the
> next step that I see will be convert the FSDAX path to take typical
> references vmf_insert() time. Unless I am missing a shorter path to get
> this fixed up?

Yeah, just do this. IIRC Christoph already did all the infrastructure now,
just take the correct references and remove the special cases that
turn off the new infrastructure for fsdax.

When I looked at it a long while ago it seemd to require some
understanding of the fsdax code and exactly what the lifecycle model
was supposed to be there.

Jason

Re: RDMA/rpma + fsdax(ext4) was broken since 36f30e486d

2021-08-27 Thread Jason Gunthorpe

On Fri, Aug 27, 2021 at 09:42:21AM -0700, Dan Williams wrote:
> On Fri, Aug 27, 2021 at 6:05 AM Li, Zhijian  wrote:
> >
> >
> > on 2021/8/27 20:10, Jason Gunthorpe wrote:
> > > On Fri, Aug 27, 2021 at 08:15:40AM +, lizhij...@fujitsu.com wrote:
> > >> i looked over the change-log of hmm_vma_handle_pte(), and found that 
> > >> before
> > >> 4055062 ("mm/hmm: add missing call to hmm_pte_need_fault in 
> > >> HMM_PFN_SPECIAL handling")
> > >>
> > >> hmm_vma_handle_pte() will not check pte_special(pte) if pte_devmap(pte) 
> > >> is true.
> > >>
> > >> when we reached
> > >> "if (pte_special(pte) && !is_zero_pfn(pte_pfn(pte))) {"
> > >> the pte have already presented and its pte's flag already fulfilled the 
> > >> request flags.
> > >>
> > >>
> > >> My question is that
> > >> Per https://01.org/blogs/dave/2020/linux-consumption-x86-page-table-bits,
> > >> pte_devmap(pte) and pte_special(pte) could be both true in fsdax user 
> > >> case, right ?
> > > How? what code creates that?
> > >
> > > I see:
> > >
> > > insert_pfn():
> > >   /* Ok, finally just insert the thing.. */
> > >   if (pfn_t_devmap(pfn))
> > >   entry = pte_mkdevmap(pfn_t_pte(pfn, prot));
> > >   else
> > >   entry = pte_mkspecial(pfn_t_pte(pfn, prot));
> > >
> > > So what code path ends up setting both bits?
> >
> >   pte_mkdevmap() will set both _PAGE_SPECIAL | PAGE_DEVMAP
> >
> >   395 static inline pte_t pte_mkdevmap(pte_t pte)
> >   396 {
> >   397 return pte_set_flags(pte, _PAGE_SPECIAL|_PAGE_DEVMAP);
> >   398 }
> 
> I can't recall why _PAGE_SPECIAL is there. I'll take a look, but I
> think setting _PAGE_SPECIAL in pte_mkdevmap() is overkill.

This is my feeling too, but every arch does it, so hmm should check
it, at least for now as a stable fix

devmap has a struct page so it should be refcounted inside the VMA and
that is the main thing that PAGE_SPECIAL disabled, AFAICR..

The only places where pte_special are used that I wonder if are OK for
devmap have to do with CPU cache maintenance

vm_normal_page(), hmm_vma_handle_pte(), gup_pte_range() all look OK to
drop the special bit

Jason

Re: RDMA/rpma + fsdax(ext4) was broken since 36f30e486d

2021-08-27 Thread Jason Gunthorpe

On Fri, Aug 27, 2021 at 09:05:21PM +0800, Li, Zhijian wrote:
> 
> on 2021/8/27 20:10, Jason Gunthorpe wrote:
> > On Fri, Aug 27, 2021 at 08:15:40AM +, lizhij...@fujitsu.com wrote:
> > > i looked over the change-log of hmm_vma_handle_pte(), and found that 
> > > before
> > > 4055062 ("mm/hmm: add missing call to hmm_pte_need_fault in 
> > > HMM_PFN_SPECIAL handling")
> > > 
> > > hmm_vma_handle_pte() will not check pte_special(pte) if pte_devmap(pte) 
> > > is true.
> > > 
> > > when we reached
> > > "if (pte_special(pte) && !is_zero_pfn(pte_pfn(pte))) {"
> > > the pte have already presented and its pte's flag already fulfilled the 
> > > request flags.
> > > 
> > > 
> > > My question is that
> > > Per https://01.org/blogs/dave/2020/linux-consumption-x86-page-table-bits,
> > > pte_devmap(pte) and pte_special(pte) could be both true in fsdax user 
> > > case, right ?
> > How? what code creates that?
> > 
> > I see:
> > 
> > insert_pfn():
> > /* Ok, finally just insert the thing.. */
> > if (pfn_t_devmap(pfn))
> > entry = pte_mkdevmap(pfn_t_pte(pfn, prot));
> > else
> > entry = pte_mkspecial(pfn_t_pte(pfn, prot));
> > 
> > So what code path ends up setting both bits?
> 
>  pte_mkdevmap() will set both _PAGE_SPECIAL | PAGE_DEVMAP
> 
>  395 static inline pte_t pte_mkdevmap(pte_t pte)
>  396 {
>  397 return pte_set_flags(pte, _PAGE_SPECIAL|_PAGE_DEVMAP);
>  398 }
> 
> below is a calltrace example
> 
> [  400.728559] Call Trace:
> [  400.731595] dump_stack+0x6d/0x8b
> [  400.735536] insert_pfn+0x16c/0x180
> [  400.739596] __vm_insert_mixed+0x84/0xc0
> [  400.744144] dax_iomap_pte_fault+0x845/0x870
> [  400.749089] ext4_dax_huge_fault+0x171/0x1e0
> [  400.754096] __do_fault+0x31/0xe0
> [  400.758090]  ? pmd_devmap_trans_unstable+0x37/0x90
> [  400.763541] handle_mm_fault+0x11b1/0x1680
> [  400.768260] exc_page_fault+0x2f4/0x570
> [  400.772788]  ? asm_exc_page_fault+0x8/0x30
> [  400.777539]  asm_exc_page_fault+0x1e/0x30
> 
> 
> So is my previous change reasonable ?

Yes, can you send a proper patch and include the mm mailing list?

Jason

Re: RDMA/rpma + fsdax(ext4) was broken since 36f30e486d

2021-08-27 Thread Jason Gunthorpe

On Fri, Aug 27, 2021 at 08:15:40AM +, lizhij...@fujitsu.com wrote:
> i looked over the change-log of hmm_vma_handle_pte(), and found that before
> 4055062 ("mm/hmm: add missing call to hmm_pte_need_fault in HMM_PFN_SPECIAL 
> handling")
> 
> hmm_vma_handle_pte() will not check pte_special(pte) if pte_devmap(pte) is 
> true.
> 
> when we reached
> "if (pte_special(pte) && !is_zero_pfn(pte_pfn(pte))) {"
> the pte have already presented and its pte's flag already fulfilled the 
> request flags.
> 
> 
> My question is that
> Per https://01.org/blogs/dave/2020/linux-consumption-x86-page-table-bits,
> pte_devmap(pte) and pte_special(pte) could be both true in fsdax user case, 
> right ?

How? what code creates that?

I see:

insert_pfn():
/* Ok, finally just insert the thing.. */
if (pfn_t_devmap(pfn))
entry = pte_mkdevmap(pfn_t_pte(pfn, prot));
else
entry = pte_mkspecial(pfn_t_pte(pfn, prot));

So what code path ends up setting both bits?

Jason

Re: [RFC PATCH 2/3] vfio/hisilicon: register the driver to vfio

2021-04-20 Thread Jason Gunthorpe

On Tue, Apr 20, 2021 at 04:04:57PM -0600, Alex Williamson wrote:

> > The migration control registers must be on a different VF from the VF
> > being plugged into a guest and the two VFs have to be in different
> > IOMMU groups to ensure they are isolated from each other.
> 
> I think that's a solution, I don't know if it's the only solution.

Maybe, but that approach does offer DMA access for the migration. For
instance to implement something that needs a lot of data like
migrating a complicated device state, or dirty page tracking or
whatver.

This driver seems very simple - it has only 17 state elements - and
doesn't use DMA.

I can't quite tell, but does this pass the hypervisor BAR into the
guest anyhow? That would certainly be an adquate statement that it is
safe, assuming someone did a good security analysis.

> ways and it's not very interesting.  If the user can manipulate device
> state in order to trigger an exploit of the host-side kernel driver,
> that's obviously more of a problem.

Well, for instance, we have an implementation of
(VFIO_DEVICE_STATE_SAVING | VFIO_DEVICE_STATE_RUNNING) which means the
guest CPUs are still running and a hostile guest can be manipulating
the device.

But this driver is running code, like vf_qm_state_pre_save() in this
state. Looks very suspicious.

One quick attack I can imagine is to use the guest CPU to DOS the
migration and permanently block it, eg by causing qm_mb() or other
looping functions to fail.

There may be worse things possible, it is a bit hard to tell just from
the code.

.. also drivers should not be open coding ARM assembly as in
qm_mb_write()

.. and also, code can not randomly call pci_get_drvdata() on a struct
device it isn't attached to haven't verified the right driver is
bound, or locked correctly.

> manipulate the BAR size to expose only the operational portion of MMIO
> to the VM and use the remainder to support migration itself.  I'm
> afraid that just like mdev, the vfio migration uAPI is going to be used
> as an excuse to create kernel drivers simply to be able to make use of
> that uAPI.

I thought that is the general direction people had agreed on during
the IDXD mdev discussion?

People want the IOCTLs from VFIO to be the single API to program all
the VMMs to and to not implement user space drivers..

This actually seems like a great candidate for a userspace driver.

I would like to know we are still settled on this direction as the
mlx5 drivers we are working on also have some complicated option to be
user space only.

Jason

Re: [PATCH] RDMA/mlx4: remove an unused variable

2021-04-20 Thread Jason Gunthorpe

On Tue, Apr 20, 2021 at 05:28:33PM +0200, Christophe JAILLET wrote:
> 'in6' is unused. It is just declared and filled-in.
> It can be removed.
> 
> This is a left over from commit 5ea8bbfc4929
> ("mlx4: Implement IP based gids support for RoCE/SRIOV")
> 
> Signed-off-by: Christophe JAILLET 
> ---
>  drivers/infiniband/hw/mlx4/qp.c | 3 ---
>  1 file changed, 3 deletions(-)

Applied to for-next, thanks

Jason

Re: [PATCH] infiniband: ulp: Remove unnecessary struct declaration

2021-04-20 Thread Jason Gunthorpe

On Thu, Apr 15, 2021 at 05:21:16PM +0800, Wan Jiabing wrote:
> struct ipoib_cm_tx is defined at 245th line.
> And the definition is independent on the MACRO.
> The declaration here is unnecessary. Remove it.
> 
> Signed-off-by: Wan Jiabing 
> Reviewed-by: Christoph Lameter 
> ---
>  drivers/infiniband/ulp/ipoib/ipoib.h | 2 --
>  1 file changed, 2 deletions(-)

Applied to for-next, thanks

Jason

Re: [RFC PATCH 2/3] vfio/hisilicon: register the driver to vfio

2021-04-20 Thread Jason Gunthorpe

On Tue, Apr 20, 2021 at 09:28:46PM +0800, liulongfang wrote:
> >> So, I still don't understand what the security risk you are talking about 
> >> is,
> >> and what do you think the security design should look like?
> >> Can you elaborate on it?
> > 
> > Each security domain must have its own PCI BDF.
> > 
> The basic unit to perform the live migration function is the VF, and the basic
> function of the VF is the business function of the device. If the live 
> migration
> function and the business function are completely separated, and they are 
> bound
> to their respective VFs, it will result in the ability to execute the business
> in the guest A functional VF cannot perform live migration, and a VF with a 
> live
> migration function cannot perform business functions in the guest.
> 
> In fact, the scenario that requires live migration is to migrate its business
> functions to the VFs of other hosts when the VF's business functions are 
> heavily
> loaded.
> This usage scenario itself requires that the VF needs to have these two 
> functions
> at the same time.

Other vendors are managing, it is not an unsolvable problem.

I think these patches can't advance in any form without somehow
addressing the isolation issue.

Jason

Re: [RFC PATCH 2/3] vfio/hisilicon: register the driver to vfio

2021-04-20 Thread Jason Gunthorpe

On Tue, Apr 20, 2021 at 08:50:12PM +0800, liulongfang wrote:
> On 2021/4/19 20:33, Jason Gunthorpe wrote:
> > On Mon, Apr 19, 2021 at 08:24:40PM +0800, liulongfang wrote:
> > 
> >>> I'm also confused how this works securely at all, as a general rule a
> >>> VFIO PCI driver cannot access the MMIO memory of the function it is
> >>> planning to assign to the guest. There is a lot of danger that the
> >>> guest could access that MMIO space one way or another.
> >>
> >> VF's MMIO memory is divided into two parts, one is the guest part,
> >> and the other is the live migration part. They do not affect each other,
> >> so there is no security problem.
> > 
> > AFAIK there are several scenarios where a guest can access this MMIO
> > memory using DMA even if it is not mapped into the guest for CPU
> > access.
> > 
> The hardware divides VF's MMIO memory into two parts. The live migration
> driver in the host uses the live migration part, and the device driver in
> the guest uses the guest part. They obtain the address of VF's MMIO memory
> in their respective drivers, although these two parts The memory is
> continuous on the hardware device, but due to the needs of the drive function,
> they will not perform operations on another part of the memory, and the
> device hardware also independently responds to the operation commands of
> the two parts.

It doesn't matter, the memory is still under the same PCI BDF and VFIO
supports scenarios where devices in the same IOMMU group are not
isolated from each other.

This is why the granual of isolation is a PCI BDF - VFIO directly
blocks kernel drivers from attaching to PCI BDFs that are not
completely isolated from VFIO BDF.

Bypassing this prevention and attaching a kernel driver directly to
the same BDF being exposed to the guest breaks that isolation model.

> So, I still don't understand what the security risk you are talking about is,
> and what do you think the security design should look like?
> Can you elaborate on it?

Each security domain must have its own PCI BDF.

The migration control registers must be on a different VF from the VF
being plugged into a guest and the two VFs have to be in different
IOMMU groups to ensure they are isolated from each other.

Jason

Re: KASAN: use-after-free Read in cma_cancel_operation, rdma_listen

2021-04-19 Thread Jason Gunthorpe

On Tue, Apr 13, 2021 at 10:19:25PM +0800, Hao Sun wrote:
> Jason Gunthorpe  于2021年4月13日周二 下午9:45写道：
> >
> > On Tue, Apr 13, 2021 at 09:42:43PM +0800, Hao Sun wrote:
> > > Jason Gunthorpe  于2021年4月13日周二 下午9:34写道：
> > > >
> > > > On Tue, Apr 13, 2021 at 11:36:41AM +0800, Hao Sun wrote:
> > > > > Hi
> > > > >
> > > > > When using Healer(https://github.com/SunHao-0/healer/tree/dev) to fuzz
> > > > > the Linux kernel, I found two use-after-free bugs which have been
> > > > > reported a long time ago by Syzbot.
> > > > > Although the corresponding patches have been merged into upstream,
> > > > > these two bugs can still be triggered easily.
> > > > > The original information about Syzbot report can be found here:
> > > > > https://syzkaller.appspot.com/bug?id=8dc0bcd9dd6ec915ba10b3354740eb420884acaa
> > > > > https://syzkaller.appspot.com/bug?id=95f89b8fb9fdc42e28ad586e657fea074e4e719b
> > > >
> > > > Then why hasn't syzbot seen this in a year's time? Seems strange
> > > >
> > >
> > > Seems strange to me too, but the fact is that the reproduction program
> > > in attachment can trigger these two bugs quickly.
> >
> > Do you have this in the C format?
> >
> 
> Just tried to use syz-prog2c to convert the repro-prog to C format.
> The repro program of  rdma_listen was successfully reproduced
> (uploaded in attachment), the other one failed. it looks like
> syz-prog2c may not be able to do the equivalent conversion.
> You can use syz-execprog to execute the reprogram directly, this
> method can reproduce both crashes, I have tried it.

I tried this program and it reliably deadlocks my kernel :|

So there is certainly something here

Jason

Re: [RFC PATCH 2/3] vfio/hisilicon: register the driver to vfio

2021-04-19 Thread Jason Gunthorpe

On Mon, Apr 19, 2021 at 08:24:40PM +0800, liulongfang wrote:

> > I'm also confused how this works securely at all, as a general rule a
> > VFIO PCI driver cannot access the MMIO memory of the function it is
> > planning to assign to the guest. There is a lot of danger that the
> > guest could access that MMIO space one way or another.
> 
> VF's MMIO memory is divided into two parts, one is the guest part,
> and the other is the live migration part. They do not affect each other,
> so there is no security problem.

AFAIK there are several scenarios where a guest can access this MMIO
memory using DMA even if it is not mapped into the guest for CPU
access.

> If pci_release_mem_regions() is not added here,
> The guests using pci_request_mem_regions() will return an error.
> Then, the guest will not be able to obtain the MMIO address of the VF.

Which is why VFIO has this protection to prevent sharing MMIO regions
on the VF assigned to the guest

Jason

Re: [PATCH] node: fix device cleanups in error handling code

2021-04-16 Thread Jason Gunthorpe

On Fri, Apr 09, 2021 at 02:01:57PM +0300, Dan Carpenter wrote:
> We can't use kfree() to free device managed resources so the kfree(dev)
> is against the rules.
> 
> It's easier to write this code if we open code the device_register() as
> a device_initialize() and device_add().  That way if dev_set_name() set
> name fails we can call put_device() and it will clean up correctly.
> 
> Fixes: acc02a109b04 ("node: Add memory-side caching attributes")
> Signed-off-by: Dan Carpenter 
> ---
>  drivers/base/node.c | 26 --
>  1 file changed, 12 insertions(+), 14 deletions(-)

Wow, yikes, that "kfree_const(dev->kobj.name);" is really creative

Reviewed-by: Jason Gunthorpe 

Though I dislike ignoring the error code from dev_set_name(), I think
Greg would prefer this:

diff --git a/drivers/base/node.c b/drivers/base/node.c
index f449dbb2c74666..80079e440add12 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -140,20 +140,16 @@ static struct node_access_nodes 
*node_init_node_access(struct node *node,
dev->parent = >dev;
dev->release = node_access_release;
dev->groups = node_access_node_groups;
-   if (dev_set_name(dev, "access%u", access))
-   goto free;
 
-   if (device_register(dev))
-   goto free_name;
+   dev_set_name(dev, "access%u", access);
+   if (device_register(dev)) {
+   put_device(dev);
+   return NULL;
+   }
 
pm_runtime_no_callbacks(dev);
list_add_tail(_node->list_node, >access_list);
return access_node;
-free_name:
-   kfree_const(dev->kobj.name);
-free:
-   kfree(access_node);
-   return NULL;
 }
 

(arguably a device_register_name() would be even better, if you are
handy with coccinelle there could quickly be alot of users)

Jason

Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-16 Thread Jason Gunthorpe

On Fri, Apr 16, 2021 at 10:23:32AM -0700, Jacob Pan wrote:

> Perhaps similar to cgroup v1 vs v2, it took a long time and with known
> limitations in v1.

cgroup v2 is still having transition problems, if anything it is a
cautionary tale to think really hard about uAPI because transitioning
can be really hard.

It might be very wise to make /dev/ioasid and /dev/vfio ioctl
compatible in some way so existing software has a smoother upgrade
path.

For instance by defining a default IOASID

Jason

Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-16 Thread Jason Gunthorpe

On Fri, Apr 16, 2021 at 04:26:19PM +0200, Auger Eric wrote:

> This was largely done during several confs including plumber, KVM forum,
> for several years. Also API docs were shared on the ML. I don't remember
> any voice was raised at those moments.

I don't think anyone objects to the high level ideas, but
implementation does matter. I don't think anyone presented "hey we
will tunnel an uAPI through VFIO to the IOMMU subsystem" - did they?

Look at the fairly simple IMS situation, for example. This was
presented at plumbers too, and the slides were great - but the
implementation was too hacky. It required a major rework of the x86
interrupt handling before it was OK.

Jason

Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-16 Thread Jason Gunthorpe

On Fri, Apr 16, 2021 at 03:38:02PM +0200, Auger Eric wrote:

> The redesign requirement came pretty late in the development process.
> The iommu user API is upstream for a while, the VFIO interfaces have
> been submitted a long time ago and under review for a bunch of time.
> Redesigning everything with a different API, undefined at this point, is
> a major setback for our work and will have a large impact on the
> introduction of features companies are looking forward, hence our
> frustration.

I will answer both you and Jacob at once.

This is uAPI, once it is set it can never be changed.

The kernel process and philosophy is to invest heavily in uAPI
development and review to converge on the best uAPI possible.

Many past submissions have take a long time to get this right, there
are several high profile uAPI examples.

Do you think this case is so special, or the concerns so minor, that it
should get to bypass all of the normal process?

Ask yourself, is anyone advocating for the current direction on
technical merits alone?

Certainly the patches I last saw where completely disgusting from a
uAPI design perspective.

It was against the development process to organize this work the way
it was done. Merging a wack of dead code to the kernel to support a
uAPI vision that was never clearly articulated was a big mistake.

Start from the beginning. Invest heavily in defining a high quality
uAPI. Clearly describe the uAPI to all stake holders. Break up the
implementation into patch series without dead code. Make the
patches. Remove the dead code this group has already added.

None of this should be a surprise. The VDPA discussion and related
"what is a mdev" over a year ago made it pretty clear VFIO is not the
exclusive user of "IOMMU in userspace" and that places limits on what
kind of uAPIs expansion it should experience going forward.

Jason

Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-15 Thread Jason Gunthorpe

On Thu, Apr 15, 2021 at 03:11:19PM +0200, Auger Eric wrote:
> Hi Jason,
> 
> On 4/1/21 6:03 PM, Jason Gunthorpe wrote:
> > On Thu, Apr 01, 2021 at 02:08:17PM +, Liu, Yi L wrote:
> > 
> >> DMA page faults are delivered to root-complex via page request message and
> >> it is per-device according to PCIe spec. Page request handling flow is:
> >>
> >> 1) iommu driver receives a page request from device
> >> 2) iommu driver parses the page request message. Get the RID,PASID, faulted
> >>page and requested permissions etc.
> >> 3) iommu driver triggers fault handler registered by device driver with
> >>iommu_report_device_fault()
> > 
> > This seems confused.
> > 
> > The PASID should define how to handle the page fault, not the driver.
> 
> In my series I don't use PASID at all. I am just enabling nested stage
> and the guest uses a single context. I don't allocate any user PASID at
> any point.
> 
> When there is a fault at physical level (a stage 1 fault that concerns
> the guest), this latter needs to be reported and injected into the
> guest. The vfio pci driver registers a fault handler to the iommu layer
> and in that fault handler it fills a circ bugger and triggers an eventfd
> that is listened to by the VFIO-PCI QEMU device. this latter retrives
> the faault from the mmapped circ buffer, it knowns which vIOMMU it is
> attached to, and passes the fault to the vIOMMU.
> Then the vIOMMU triggers and IRQ in the guest.
> 
> We are reusing the existing concepts from VFIO, region, IRQ to do that.
> 
> For that use case, would you also use /dev/ioasid?

/dev/ioasid could do all the things you described vfio-pci as doing,
it can even do them the same way you just described.

Stated another way, do you plan to duplicate all of this code someday
for vfio-cxl? What about for vfio-platform? ARM SMMU can be hooked to
platform devices, right?

I feel what you guys are struggling with is some choice in the iommu
kernel APIs that cause the events to be delivered to the pci_device
owner, not the PASID owner.

That feels solvable.

Jason

Re: [RFC PATCH 2/3] vfio/hisilicon: register the driver to vfio

2021-04-15 Thread Jason Gunthorpe

On Tue, Apr 13, 2021 at 11:36:22AM +0800, Longfang Liu wrote:
> Register the live migration driver of the accelerator module to vfio
> 
> Signed-off-by: Longfang Liu 
>  drivers/vfio/pci/vfio_pci.c | 11 +++
>  drivers/vfio/pci/vfio_pci_private.h |  9 +
>  2 files changed, 20 insertions(+)
> 
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index 65e7e6b..e1b0e37 100644
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -407,6 +407,17 @@ static int vfio_pci_enable(struct vfio_pci_device *vdev)
>   }
>   }
>  
> + if (pdev->vendor == PCI_VENDOR_ID_HUAWEI &&
> + IS_ENABLED(CONFIG_VFIO_PCI_HISI_MIGRATION)) {
> + ret = vfio_pci_hisilicon_acc_init(vdev);

This is exactly what we want to avoid with the work we are doing on
the vfio-pci modularity.

It is a complete mess to just keep adding more stuff here, and as
we've discussed to death really ugly to have such a ridiculously wide
ID match.

You should be working with us on that project and base your work on
top of Max's series.. Alex given the interest here I think we should
find some way forward while we work on completed version of the mlx5
pci vfio driver..

I'm also confused how this works securely at all, as a general rule a
VFIO PCI driver cannot access the MMIO memory of the function it is
planning to assign to the guest. There is a lot of danger that the
guest could access that MMIO space one way or another.

Here I see the driver obtaining a devm_ioremap() of the same pdev it
is going to assign (and I really wonder why pci_release_mem_regions()
exists at all..)

This is why the mlx5 RFC was showing how to juggle two PCI devices via
the aux device connector.

Jason

Re: linux-next: manual merge of the vfio tree with the drm tree

2021-04-15 Thread Jason Gunthorpe

On Thu, Apr 15, 2021 at 04:47:34PM +1000, Stephen Rothwell wrote:
> Hi all,
> 
> Today's linux-next merge of the vfio tree got a conflict in:
> 
>   drivers/gpu/drm/i915/gvt/gvt.c
> 
> between commit:
> 
>   9ff06c385300 ("drm/i915/gvt: Remove references to struct drm_device.pdev")
> 
> from the drm tree and commit:
> 
>   383987fd15ba ("vfio/gvt: Use mdev_get_type_group_id()")
> 
> from the vfio tree.
> 
> I fixed it up (I used the latter version) and can carry the fix as
> necessary.

Yes that is right, thank you

Jason

Re: [PATCH 2/2] iommu/sva: Remove mm parameter from SVA bind API

2021-04-15 Thread Jason Gunthorpe

On Thu, Apr 15, 2021 at 01:33:24PM +0800, Lu Baolu wrote:
> Hi Jason,
> 
> On 4/14/21 7:26 PM, Jason Gunthorpe wrote:
> > On Wed, Apr 14, 2021 at 02:22:09PM +0800, Lu Baolu wrote:
> > 
> > > I still worry about supervisor pasid allocation.
> > > 
> > > If we use iommu_sva_alloc_pasid() to allocate a supervisor pasid, which
> > > mm should the pasid be set? I've ever thought about passing _mm to
> > > iommu_sva_alloc_pasid(). But if you add "mm != current->mm", this seems
> > > not to work. Or do you prefer a separated interface for supervisor pasid
> > > allocation/free?
> > 
> > Without a mm_struct it is not SVA, so don't use SVA APIs for whatever
> > a 'supervisor pasid' is
> 
> The supervisor PASID has its mm_struct. The only difference is that the
> device will set priv=1 in its DMA transactions with the PASID.

Soemthing like init_mm should not be used with 'SVA'

Jason

Re: [PATCH rdma-next 00/10] Enable relaxed ordering for ULPs

2021-04-14 Thread Jason Gunthorpe

On Wed, Apr 14, 2021 at 02:41:52PM +, David Laight wrote:

> So whatever driver initialises the target needs to configure whatever
> target-specific register enables the RO transfers themselves.

RDMA in general, and mlx5 in particular, is a layered design:

mlx5_core <- owns the PCI function, should turn on RO at the PCI
 function level
 mlx5_en  <- Commands the chip to use RO for queues used in ethernet
ib_core
  ib_uverbs
mlx5_ib <- Commands the chip to use RO for some queues used in
   userspace
  ib_srp* <- A ULP driver built on RDMA - this patch commands the chip
 to use RO on SRP queues
  nvme-rdma <- Ditto
  ib_iser <- Ditto
  rds <- Ditto

So this series is about expanding the set of queues running on mlx5
that have RO turned when the PCI function is already running with RO
enabled.

We want as many queues as possible RO enabled because it brings big
performance wins

Jason

Re: [PATCH rdma-next 00/10] Enable relaxed ordering for ULPs

2021-04-14 Thread Jason Gunthorpe

On Wed, Apr 14, 2021 at 10:16:28AM -0400, Tom Talpey wrote:
> On 4/12/2021 6:48 PM, Jason Gunthorpe wrote:
> > On Mon, Apr 12, 2021 at 04:20:47PM -0400, Tom Talpey wrote:
> > 
> > > So the issue is only in testing all the providers and platforms,
> > > to be sure this new behavior isn't tickling anything that went
> > > unnoticed all along, because no RDMA provider ever issued RO.
> > 
> > The mlx5 ethernet driver has run in RO mode for a long time, and it
> > operates in basically the same way as RDMA. The issues with Haswell
> > have been worked out there already.
> > 
> > The only open question is if the ULPs have errors in their
> > implementation, which I don't think we can find out until we apply
> > this series and people start running their tests aggressively.
> 
> I agree that the core RO support should go in. But turning it on
> by default for a ULP should be the decision of each ULP maintainer.
> It's a huge risk to shift all the storage drivers overnight. How
> do you propose to ensure the aggressive testing happens?

Realistically we do test most of the RDMA storage ULPs at NVIDIA over
mlx5 which is the only HW that will enable this for now.

I disagree it is a "huge risk".

Additional wider testing is welcomed and can happen over the 16 week
release cycle for a kernel. I would aim to get the relaxed ordering
changed merged to linux-next a week or so after the merge window.

Further testing happens before these changes would get picked up in a
distro on something like MLNX_OFED.

I don't think we need to make the patch design worse or over think the
submission process for something that, so far, hasn't discovered any
issues and alread has a proven track record in other ULPs.

Any storage ULP that has a problem here is mis-using verbs and the DMA
API and thus has an existing data-corruption bug that they are simply
lucky to have not yet discovered.

> One thing that worries me is the patch02 on-by-default for the dma_lkey.
> There's no way for a ULP to prevent IB_ACCESS_RELAXED_ORDERING
> from being set in __ib_alloc_pd().

The ULPs are being forced into relaxed_ordering. They don't get to
turn it off one by one. The v2 will be more explicit about this as
there will be no ULP patches, just the verbs core code being updated.

Jason

Re: [PATCH 2/2] iommu/sva: Remove mm parameter from SVA bind API

2021-04-14 Thread Jason Gunthorpe

On Wed, Apr 14, 2021 at 02:22:09PM +0800, Lu Baolu wrote:

> I still worry about supervisor pasid allocation.
> 
> If we use iommu_sva_alloc_pasid() to allocate a supervisor pasid, which
> mm should the pasid be set? I've ever thought about passing _mm to
> iommu_sva_alloc_pasid(). But if you add "mm != current->mm", this seems
> not to work. Or do you prefer a separated interface for supervisor pasid
> allocation/free?

Without a mm_struct it is not SVA, so don't use SVA APIs for whatever
a 'supervisor pasid' is

Jason

Re: [syzbot] WARNING in unsafe_follow_pfn

2021-04-13 Thread Jason Gunthorpe

On Tue, Apr 13, 2021 at 07:20:12PM +0200, Dmitry Vyukov wrote:
> > > Plus users are going to be seeing this as well.  According to the commit
> > > message for 69bacee7f9ad ("mm: Add unsafe_follow_pfn") "Unfortunately
> > > there's some users where this is not fixable (like v4l userptr of iomem
> > > mappings)".  It sort of seems crazy to dump this giant splat and then
> > > tell users to ignore it forever because it can't be fixed...  0_0
> >
> > I think the discussion conclusion was that this interface should not
> > be used by userspace anymore, it is obsolete by some new interface?
> >
> > It should be protected by some kconfig and the kconfig should be
> > turned off for syzkaller runs.
> 
> If this is not a kernel bug, then it must not use WARN_ON[_ONCE]. It
> makes the kernel untestable for both automated systems and humans:

It is a kernel security bug triggerable by userspace.

> And if it's a kernel bug reachable from user-space, then I think this
> code should be removed entirely, not just on all testing systems. Or
> otherwise if we are not removing it for some reason, then it needs to
> be fixed.

Legacy embedded systems apparently require it.

It should be blocked by a kconfig. Distributions and syzkaller runs
should not enable that kconfig. What else can we do for insane uapi?

Jason

Re: [PATCH] RDMA/hw/qib/qib_iba7322: remove useless function

2021-04-13 Thread Jason Gunthorpe

On Tue, Apr 13, 2021 at 05:11:03PM +0800, Jiapeng Chong wrote:
> Fix the following clang warning:
> 
> drivers/infiniband/hw/qib/qib_iba7322.c:803:19: warning: unused function
> 'qib_read_ureg' [-Wunused-function].
> 
> Reported-by: Abaci Robot 
> Signed-off-by: Jiapeng Chong 
> ---
>  drivers/infiniband/hw/qib/qib_iba7322.c | 22 --
>  1 file changed, 22 deletions(-)

Applied to for-next, thanks

Jason

Re: [PATCH for-next v3 0/2] Introduce rdma_set_min_rnr_timer() and use it in RDS

2021-04-13 Thread Jason Gunthorpe

On Tue, Apr 13, 2021 at 09:29:41AM +0300, Leon Romanovsky wrote:
> On Mon, Apr 12, 2021 at 07:58:47PM -0300, Jason Gunthorpe wrote:
> > On Wed, Mar 31, 2021 at 08:43:12PM +0200, Håkon Bugge wrote:
> > > ib_modify_qp() is an expensive operation on some HCAs running
> > > virtualized. This series removes two ib_modify_qp() calls from RDS.
> > > 
> > > I am sending this as a v3, even though it is the first sent to
> > > net. This because the IB Core commit has reach v3.
> > > 
> > > Håkon Bugge (2):
> > >   IB/cma: Introduce rdma_set_min_rnr_timer()
> > >   rds: ib: Remove two ib_modify_qp() calls
> > 
> > Applied to rdma for-next, thanks
> 
> Jason,
> 
> It should be 
> + WARN_ON(id->qp_type != IB_QPT_RC && id->qp_type != IB_QPT_XRC_TGT);
> 
> and not
> + if (WARN_ON(id->qp_type != IB_QPT_RC && id->qp_type != IB_QPT_XRC_TGT))
> + return -EINVAL;

Unless we can completely remove the return code the if statement is a
reasonable way to use the WARN_ON here

Jason

Re: KASAN: use-after-free Read in cma_cancel_operation, rdma_listen

2021-04-13 Thread Jason Gunthorpe

On Tue, Apr 13, 2021 at 09:42:43PM +0800, Hao Sun wrote:
> Jason Gunthorpe  于2021年4月13日周二 下午9:34写道：
> >
> > On Tue, Apr 13, 2021 at 11:36:41AM +0800, Hao Sun wrote:
> > > Hi
> > >
> > > When using Healer(https://github.com/SunHao-0/healer/tree/dev) to fuzz
> > > the Linux kernel, I found two use-after-free bugs which have been
> > > reported a long time ago by Syzbot.
> > > Although the corresponding patches have been merged into upstream,
> > > these two bugs can still be triggered easily.
> > > The original information about Syzbot report can be found here:
> > > https://syzkaller.appspot.com/bug?id=8dc0bcd9dd6ec915ba10b3354740eb420884acaa
> > > https://syzkaller.appspot.com/bug?id=95f89b8fb9fdc42e28ad586e657fea074e4e719b
> >
> > Then why hasn't syzbot seen this in a year's time? Seems strange
> >
> 
> Seems strange to me too, but the fact is that the reproduction program
> in attachment can trigger these two bugs quickly.

Do you have this in the C format?

Jason

Re: [PATCH] nvme: Drop WQ_MEM_RECLAIM flag from core workqueues

2021-04-13 Thread Jason Gunthorpe

On Tue, Apr 13, 2021 at 10:54:04AM +0200, Daniel Wagner wrote:

> Hmm, I am struggling with your last statement. If a worker does an
> allocation it might block. I understand this is something which a worker
> in a WQ_MEM_RECLAIM context is not allowed to do.
> 
> My aim is still to get rid of the warning triggered by the rdma
> code.

The WQ_MEM_RECLAIM is placed on a workqueue not because of what is
*inside* the work, but because of what is *waiting* on the work.

Jason

Re: KASAN: use-after-free Read in cma_cancel_operation, rdma_listen

2021-04-13 Thread Jason Gunthorpe

On Tue, Apr 13, 2021 at 11:36:41AM +0800, Hao Sun wrote:
> Hi
> 
> When using Healer(https://github.com/SunHao-0/healer/tree/dev) to fuzz
> the Linux kernel, I found two use-after-free bugs which have been
> reported a long time ago by Syzbot.
> Although the corresponding patches have been merged into upstream,
> these two bugs can still be triggered easily.
> The original information about Syzbot report can be found here:
> https://syzkaller.appspot.com/bug?id=8dc0bcd9dd6ec915ba10b3354740eb420884acaa
> https://syzkaller.appspot.com/bug?id=95f89b8fb9fdc42e28ad586e657fea074e4e719b

Then why hasn't syzbot seen this in a year's time? Seems strange
 
> Current report Information:
> commit:   89698becf06d341a700913c3d89ce2a914af69a2

What commit is this?

Jason

Re: [PATCH for-next v3 0/2] Introduce rdma_set_min_rnr_timer() and use it in RDS

2021-04-12 Thread Jason Gunthorpe

On Wed, Mar 31, 2021 at 08:43:12PM +0200, Håkon Bugge wrote:
> ib_modify_qp() is an expensive operation on some HCAs running
> virtualized. This series removes two ib_modify_qp() calls from RDS.
> 
> I am sending this as a v3, even though it is the first sent to
> net. This because the IB Core commit has reach v3.
> 
> Håkon Bugge (2):
>   IB/cma: Introduce rdma_set_min_rnr_timer()
>   rds: ib: Remove two ib_modify_qp() calls

Applied to rdma for-next, thanks

Jason

Re: [PATCH rdma-next 00/10] Enable relaxed ordering for ULPs

2021-04-12 Thread Jason Gunthorpe

On Mon, Apr 12, 2021 at 04:20:47PM -0400, Tom Talpey wrote:

> So the issue is only in testing all the providers and platforms,
> to be sure this new behavior isn't tickling anything that went
> unnoticed all along, because no RDMA provider ever issued RO.

The mlx5 ethernet driver has run in RO mode for a long time, and it
operates in basically the same way as RDMA. The issues with Haswell
have been worked out there already.

The only open question is if the ULPs have errors in their
implementation, which I don't think we can find out until we apply
this series and people start running their tests aggressively.

Jason

Re: [PATCH for-next v3 0/2] Introduce rdma_set_min_rnr_timer() and use it in RDS

2021-04-12 Thread Jason Gunthorpe

On Mon, Apr 12, 2021 at 06:35:35PM +, Haakon Bugge wrote:
> 
> 
> > On 7 Apr 2021, at 18:41, Haakon Bugge  wrote:
> > 
> > 
> > 
> >> On 1 Apr 2021, at 19:51, Jason Gunthorpe  wrote:
> >> 
> >> On Wed, Mar 31, 2021 at 07:54:17PM +, Santosh Shilimkar wrote:
> >>> [...]
> >>> 
> >>> Thanks Haakon. Patchset looks fine by me.
> >>> Acked-by: Santosh Shilimkar 
> >> 
> >> Jakub/Dave are you OK if I take this RDS patch rdma to rdma's tree?
> > 
> > Let me know if this is lingering due to Leon's comment about using 
> > WARN_ON() instead of error returns.
> 
> A gentle ping.

I will take it with Santos' ack.

Jason

Re: [PATCH -next] IB/hfi1: Fix error return code in parse_platform_config()

2021-04-12 Thread Jason Gunthorpe

On Thu, Apr 08, 2021 at 11:31:40AM +, Wang Wensheng wrote:
> Fix to return a negative error code from the error handling
> case instead of 0, as done elsewhere in this function.
> 
> Fixes: 7724105686e7 ("IB/hfi1: add driver files")
> Reported-by: Hulk Robot 
> Signed-off-by: Wang Wensheng 
> ---
>  drivers/infiniband/hw/hfi1/firmware.c | 1 +
>  1 file changed, 1 insertion(+)

Applied to for-next, thanks

Jason

Re: [PATCH -next] RDMA/bnxt_re: Fix error return code in bnxt_qplib_cq_process_terminal()

2021-04-12 Thread Jason Gunthorpe

On Thu, Apr 08, 2021 at 11:31:37AM +, Wang Wensheng wrote:
> Fix to return a negative error code from the error handling
> case instead of 0, as done elsewhere in this function.
> 
> Fixes: 1ac5a4047975 ("RDMA/bnxt_re: Add bnxt_re RoCE driver")
> Reported-by: Hulk Robot 
> Signed-off-by: Wang Wensheng 
> ---
>  drivers/infiniband/hw/bnxt_re/qplib_fp.c | 1 +
>  1 file changed, 1 insertion(+)

Applied to for-next, thanks

Jason

Re: [PATCH -next] RDMA/qedr: Fix error return code in qedr_iw_connect()

2021-04-12 Thread Jason Gunthorpe

On Thu, Apr 08, 2021 at 11:31:35AM +, Wang Wensheng wrote:
> Fix to return a negative error code from the error handling
> case instead of 0, as done elsewhere in this function.
> 
> Fixes: 82af6d19d8d9 ("RDMA/qedr: Fix synchronization methods and memory leaks 
> in qedr")
> Reported-by: Hulk Robot 
> Signed-off-by: Wang Wensheng 
> Acked-by: Michal Kalderon 
> ---
>  drivers/infiniband/hw/qedr/qedr_iw_cm.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)

Applied to for-next, thanks

Jason

Re: [PATCH -next] RDMA/srpt: Fix error return code in srpt_cm_req_recv()

2021-04-12 Thread Jason Gunthorpe

On Thu, Apr 08, 2021 at 09:50:30AM -0700, Bart Van Assche wrote:
> On 4/8/21 4:31 AM, Wang Wensheng wrote:
> > Fix to return a negative error code from the error handling
> > case instead of 0, as done elsewhere in this function.
> > 
> > Fixes: db7683d7deb2 ("IB/srpt: Fix login-related race conditions")
> > Reported-by: Hulk Robot 
> > Signed-off-by: Wang Wensheng 
> >  drivers/infiniband/ulp/srpt/ib_srpt.c | 1 +
> >  1 file changed, 1 insertion(+)
> > 
> > diff --git a/drivers/infiniband/ulp/srpt/ib_srpt.c 
> > b/drivers/infiniband/ulp/srpt/ib_srpt.c
> > index 98a393d..ea44780 100644
> > +++ b/drivers/infiniband/ulp/srpt/ib_srpt.c
> > @@ -2382,6 +2382,7 @@ static int srpt_cm_req_recv(struct srpt_device *const 
> > sdev,
> > pr_info("rejected SRP_LOGIN_REQ because target %s_%d is not 
> > enabled\n",
> > dev_name(>device->dev), port_num);
> > mutex_unlock(>mutex);
> > +   ret = -EINVAL;
> > goto reject;
> > }
> 
> Please fix the Hulk Robot. The following code occurs three lines above
> the modified code:
> 
>   ret = -EINVAL;

That is a different if.

The patch looks correct to me, please check again:

ret = srpt_create_ch_ib(ch);
if (ret) {
// Ret is proven to be 0

[..]

if (!sport->enabled) {
rej->reason = cpu_to_be32(
SRP_LOGIN_REJ_INSUFFICIENT_RESOURCES);
pr_info("rejected SRP_LOGIN_REQ because target %s_%d is not 
enabled\n",
dev_name(>device->dev), port_num);
goto reject; // Ret is 0

If there is an assignment to ret between those two blocks it is hidden..

Jason

Re: [PATCH] nvme: Drop WQ_MEM_RECLAIM flag from core workqueues

2021-04-12 Thread Jason Gunthorpe

On Mon, Apr 12, 2021 at 02:49:09PM +0200, Daniel Wagner wrote:

> I've grepped through the code and didn't find anything which supports
> the guarantee claim. Neither mm nor schedule seems to care about this
> flag nor workqueue.c (except the early init bits). Or I must miss
> something.

It is pretty complicated, but the WQ_MEM_RECLAIM preallocates a thread:

static int init_rescuer(struct workqueue_struct *wq)
{
if (!(wq->flags & WQ_MEM_RECLAIM))
return 0;

rescuer = alloc_worker(NUMA_NO_NODE);

This comment explains it:

 * Workqueue rescuer thread function.  There's one rescuer for each
 * workqueue which has WQ_MEM_RECLAIM set.
 *
 * Regular work processing on a pool may block trying to create a new
 * worker which uses GFP_KERNEL allocation which has slight chance of
 * developing into deadlock if some works currently on the same queue
 * need to be processed to satisfy the GFP_KERNEL allocation.  This is
 * the problem rescuer solves.
 *
 * When such condition is possible, the pool summons rescuers of all
 * workqueues which have works queued on the pool and let them process
 * those works so that forward progress can be guaranteed.
 *
 * This should happen rarely.

Basically the allocation of importance in the workqueue is assigning a
worker, so pre-allocating a worker ensures the work can continue to
progress without becoming dependent on allocations.

This is why work under the WQ_MEM_RECLAIM cannot recurse back into the
allocator as it would get a rescurer thread stuck at a point when all
other threads are already stuck.

To remove WQ_MEM_RECLAIM you have to make assertions about the calling
contexts and blocking contexts of the workqueue, not what the work
itself is doing.

Jason

Re: [PATCH] nvme: Drop WQ_MEM_RECLAIM flag from core workqueues

2021-04-12 Thread Jason Gunthorpe

On Mon, Apr 12, 2021 at 02:23:30PM +0200, Daniel Wagner wrote:
> Drop the WQ_MEM_RECLAIM flag as it is not needed and introduces
> warnings.
> 
> The documentation says "all wq which might be used in the memory
> reclaim paths MUST have this flag set. The wq is guaranteed to have at
> least one execution context regardless of memory pressure."
> 
> By setting WQ_MEM_RECLAIM the threads are ready be running during
> early init. The claim it guarantees at least one execution context
> regardless of memory pressure is not supported by the implementation.
> 
> As the nvme core does not depend on early init we can remove the
> WQ_MEM_RECLAIM flag. This resolves a warning in the rdma path:

What does early init have to do with WQ_MEM_RECLAIM?

WQ_MEM_RECLIAM is required when any thread in a reclaim context goes
to sleep waiting for a WQ to complete. For instance by calling
flush_workqueue() or many other things.

The sleeping reclaim context must be guarenteed that the work can be
completed without the work, work queue machinery, or anything the work
has become interconnected with, recursing back into a reclaim.

IIRC the issue here was some destroy or flush work in some error
condition that happened to be under a reclaim context?

I don't see the kind of analysis I'd expect in this commit message to
justify this change.

Jason

Re: [PATCH] HSI: core: fix resource leaks in hsi_add_client_from_dt()

2021-04-09 Thread Jason Gunthorpe

On Fri, Apr 09, 2021 at 02:08:17PM +0300, Dan Carpenter wrote:
> If some of the allocations fail between the dev_set_name() and the
> device_register() then the name will not be freed.  Fix this by
> moving dev_set_name() directly in front of the call to device_register().
> 
> Fixes: a2aa24734d9d ("HSI: Add common DT binding for HSI client devices")
> Signed-off-by: Dan Carpenter 
> ---
> Jason, this is the most common type of error I see with device_register().
> Is there a downside to calling dev_set_name() later?  Presumably it's
> printed out somewhere, but I feel like just moving the dev_set_name() is
> almost always the best and simplest fix.

It is hard to tell without detailed analysis.. ie a dev_err()/etc call
will use the name. It is why I don't like this design pattern of
avoiding device_initialize() and using device_registrR()

This movement looks OK though

Reviewed-by: Jason Gunthorpe 

Jason

Re: [PATCH v1 01/14] vfio: Create vfio_fs_type with inode per device

2021-04-09 Thread Jason Gunthorpe

On Fri, Apr 09, 2021 at 08:24:00AM -0600, Alex Williamson wrote:

> > >  #define DRIVER_VERSION   "0.3"
> > >  #define DRIVER_AUTHOR"Alex Williamson "
> > >  #define DRIVER_DESC  "VFIO - User Level meta-driver"
> > > 
> > > +#define VFIO_MAGIC 0x5646494f /* "VFIO" */  
> > Move to include/uapi/linux/magic.h ? 
> 
> Hmm, yeah, I suppose it probably should go there.  Thanks.

>From my review we haven't been doing that unless it is actually uapi
relavent for some reason (this is not)

In particular with CH having a patch series to eliminate all of these
cases and drop the constants..

Jason

Re: linux-next: Signed-off-by missing for commit in the rdma tree

2021-04-09 Thread Jason Gunthorpe

On Fri, Apr 09, 2021 at 12:20:09PM -0400, Dennis Dalessandro wrote:
> On 4/8/2021 6:00 PM, Stephen Rothwell wrote:
> > Hi all,
> > 
> > Commit
> > 
> >042a00f93aad ("IB/{ipoib,hfi1}: Add a timeout handler for rdma_netdev")
> > 
> > is missing a Signed-off-by from its author.
> 
> Doh! That's my fault. I must have fat fingered the delete button instead of
> editing the line when I was converting our email addresses to the new name.
> 
> Jason do you want a v2 of the patch?

It is a bit late at this point..

Jason

Re: [PATCH rdma-next 00/10] Enable relaxed ordering for ULPs

2021-04-09 Thread Jason Gunthorpe

On Fri, Apr 09, 2021 at 10:26:21AM -0400, Tom Talpey wrote:

> My belief is that the biggest risk is from situations where completions
> are batched, and therefore polling is used to detect them without
> interrupts (which explicitly).

We don't do this in the kernel.

All kernel ULPs only read data after they observe the CQE. We do not
have "last data polling" and our interrupt model does not support some
hacky "interrupt means go and use the data" approach.

ULPs have to be designed this way to use the DMA API properly.
Fencing a DMA before it is completed by the HW will cause IOMMU
errors.

Userspace is a different story, but that will remain as-is with
optional relaxed ordering.

Jason

[GIT PULL] Please pull RDMA subsystem changes

2021-04-08 Thread Jason Gunthorpe

Hi Linus,

Nothing very exciting here, just a few small bug fixes. No red flags
for this release have shown up.

Thanks,
Jason

The following changes since commit a5e13c6df0e41702d2b2c77c8ad41677ebb065b3:

  Linux 5.12-rc5 (2021-03-28 15:48:16 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git tags/for-linus

for you to fetch changes up to d1c803a9ccd7bd3aff5e989ccfb39ed3b799b975:

  RDMA/addr: Be strict with gid size (2021-04-08 16:14:56 -0300)


RDMA 5.12 third rc pull request

Several bug fixes:

- Regression from the last pull request in cxgb4 related to the ipv6 fixes

- KASAN crasher in rtrs

- oops in hfi1 related to a buggy BIOS

- Userspace could oops qedr's XRC support

- Uninitialized memory when parsing a LS_NLA_TYPE_DGID netlink message


Kamal Heib (1):
  RDMA/qedr: Fix kernel panic when trying to access recv_cq

Leon Romanovsky (1):
  RDMA/addr: Be strict with gid size

Md Haris Iqbal (1):
  RDMA/rtrs-clt: Close rtrs client conn before destroying rtrs clt session 
files

Mike Marciniszyn (1):
  IB/hfi1: Fix probe time panic when AIP is enabled with a buggy BIOS

Potnuri Bharat Teja (1):
  RDMA/cxgb4: check for ipv6 address properly while destroying listener

 drivers/infiniband/core/addr.c |  4 +++-
 drivers/infiniband/hw/cxgb4/cm.c   |  3 ++-
 drivers/infiniband/hw/hfi1/affinity.c  | 21 +
 drivers/infiniband/hw/hfi1/hfi.h   |  1 +
 drivers/infiniband/hw/hfi1/init.c  | 10 +-
 drivers/infiniband/hw/hfi1/netdev_rx.c |  3 +--
 drivers/infiniband/hw/qedr/verbs.c |  3 ++-
 drivers/infiniband/ulp/rtrs/rtrs-clt.c |  2 +-
 8 files changed, 24 insertions(+), 23 deletions(-)


signature.asc
Description: PGP signature

Re: [PATCH v1 2/2] drivers/gpu/drm: don't select DMA_CMA or CMA from aspeed or etnaviv

2021-04-08 Thread Jason Gunthorpe

On Thu, Apr 08, 2021 at 01:38:17PM +0200, Daniel Vetter wrote:

> If you want to change this, we need automatic conflict resolution like apt
> and other package managers have, with suggestions how to fix the config if
> you want to enable a driver, but some of its requirements are missing. The
> current approach of hiding driver symbols complete if any of their
> dependencies are off is really not great.

+1 to this.. 

Though Kconfig is basically already unusuable unless you have hours to
carefully craft the kconfig you need to get out.

I'm not sure trying to optimize this by abusing the existing language
rules is such a good idea.

I gave a very half hearted go at a simple heuristic solution solve for
kconfig a while ago. It is good enough to sometimes automate a kconfig
task, but it is not so nice.

I use it to do things like "turn on all RDMA drivers" which is quite
a hard to do by hand.

It looks liked heursitics need a lot of fine tuning as the
conditionals are complex enough that it is hard to guess which branch
is going to yield a success.

Jason

Re: [PATCH rdma-next 4/8] IB/core: Skip device which doesn't have necessary capabilities

2021-04-08 Thread Jason Gunthorpe

On Wed, Apr 07, 2021 at 03:44:35PM +, Parav Pandit wrote:

> > If it returns EOPNOTUPP then the remove is never called so if it allocated
> > memory and left it allocated then it is leaking memory.
> > 
> I probably confused you. There is no leak today because add_one
> allocates memory, and later on when SA/CM etc per port cap is not
> present, it is unused left there which is freed on remove_one().
> Returning EOPNOTUPP is fine at start of add_one() before allocation.

Most of ULPs are OK, eg umad does:

umad_dev = kzalloc(struct_size(umad_dev, ports, e - s + 1), GFP_KERNEL);
if (!umad_dev)
return -ENOMEM;
for (i = s; i <= e; ++i) {
if (!rdma_cap_ib_mad(device, i))
continue;

if (!count) {
ret = -EOPNOTSUPP;
goto free;
free:
/* balances kref_init */
ib_umad_dev_put(umad_dev);

It looks like only cm.c and cma.c need fixing, just fix those two.

The CM using ULPs have a different issue though..

Jason

Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-08 Thread Jason Gunthorpe

On Wed, Apr 07, 2021 at 11:50:02PM +, Tian, Kevin wrote:
> > From: Jason Gunthorpe 
> > Sent: Wednesday, April 7, 2021 8:21 PM
> > 
> > On Wed, Apr 07, 2021 at 02:08:33AM +, Tian, Kevin wrote:
> > 
> > > > Because if you don't then we enter insane world where a PASID is being
> > > > created under /dev/ioasid but its translation path flows through setup
> > > > done by VFIO and the whole user API becomes an incomprehensible
> > mess.
> > > >
> > > > How will you even associate the PASID with the other translation??
> > >
> > > PASID is attached to a specific iommu domain (created by VFIO/VDPA),
> > which
> > > has GPA->HPA mappings already configured. If we view that mapping as an
> > > attribute of the iommu domain, it's reasonable to have the userspace-
> > bound
> > > pgtable through /dev/ioasid to nest on it.
> > 
> > A user controlled page table should absolutely not be an attribute of
> > a hidden kernel object, nor should two parts of the kernel silently
> > connect to each other via a hidden internal objects like this.
> > 
> > Security is important - the kind of connection must use some explicit
> > FD authorization to access shared objects, not be made implicit!
> > 
> > IMHO this direction is a dead end for this reason.
> > 
> 
> Could you elaborate what exact security problem is brought with this 
> approach? Isn't ALLOW_PASID the authorization interface for the
> connection?

If the kernel objects don't come out of FDs then no.

> Is it really the only practice in Linux that any new feature has to be
> blocked as long as a refactoring work is identified? 

The practice is to define uAPIs that make sense and have a good chance
to be supported over a long time period, as the software evolves, not
to hacky hacky a gaint uAPI mess just to get some feature out the
door. 

This proposal as it was oringial shown is exactly the kind of hacky
hacky uapi nobody wants to see. Tunneling an IOMMU uapi through a
whole bunch of other FDs is completely nutz.

Intel should basically be investing most of its time building a robust
and well designed uAPI here, and don't complain that the community is
not doing Intel's job for free.

> Don't people accept any balance between enabling new features and
> completing refactoring work through a staging approach, as long as
> we don't introduce an uAPI specifically for the staging purpose? ☹

Since this is all uapi I don't see it as applicable here.

Jason

Re: [PATCH v3] IB/mlx5: Reduce max order of memory allocated for xlt update

2021-04-07 Thread Jason Gunthorpe

On Sat, Apr 03, 2021 at 04:53:55AM +, Praveen Kumar Kannoju wrote:
> To update xlt (during mlx5_ib_reg_user_mr()), the driver can request up to
> 1 MB (order-8) memory, depending on the size of the MR. This costly
> allocation can sometimes take very long to return (a few seconds). This
> causes the calling application to hang for a long time, especially when the
> system is fragmented.  To avoid these long latency spikes, the calls the
> higher order allocations need to fail faster in case they are not
> available. In order to acheive this we need __GFP_NORETRY flag in the
> gfp_mask before during fetching the free pages. This patch adds this flag
> to the mask.
> 
> Signed-off-by: Praveen Kumar Kannoju 
> Acked-by: Leon Romanovsky 
> ---
>  drivers/infiniband/hw/mlx5/mr.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

Applied to for-next, thanks

Jason

Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-07 Thread Jason Gunthorpe

On Wed, Apr 07, 2021 at 08:43:50PM +0200, Jean-Philippe Brucker wrote:

> * Get a container handle out of /dev/ioasid (or /dev/iommu, really.)
>   No operation available since we don't know what the device and IOMMU
>   capabilities are.
>
> * Attach the handle to a VF. With VFIO that would be
>   VFIO_GROUP_SET_CONTAINER. That causes the kernel to associate an IOMMU
>   with the handle, and decide which operations are available.

Right, this is basically the point, - the VFIO container (/dev/vfio)
and the /dev/ioasid we are talking about have a core of
similarity. ioasid is the generalized, modernized, and cross-subsystem
version of the same idea. Instead of calling it "vfio container" we
call it something that evokes the idea of controlling the iommu.

The issue is to seperate /dev/vfio generic functionality from vfio and
share it with every subsystem.

It may be that /dev/vfio and /dev/ioasid end up sharing a lot of code,
with a different IOCTL interface around it. The vfio_iommu_driver_ops
is not particularly VFIOy.

Creating /dev/ioasid may primarily start as a code reorganization
exercise.

> * With a map/unmap vIOMMU (or shadow mappings), a single translation level
>   is supported. With a nesting vIOMMU, we're populating the level-2
>   translation (some day maybe by binding the KVM page tables, but
>   currently with map/unmap ioctl).
> 
>   Single-level translation needs single VF per container. 

Really? Why?

Jason

Re: [PATCH rdma-next 01/10] RDMA: Add access flags to ib_alloc_mr() and ib_mr_pool_init()

2021-04-07 Thread Jason Gunthorpe

On Tue, Apr 06, 2021 at 04:54:57PM +0200, Christoph Hellwig wrote:
> On Tue, Apr 06, 2021 at 11:40:39AM -0300, Jason Gunthorpe wrote:
> > Yes, but the complexity is how the drivers are constructed they are
> > designed to reject flags they don't know about..
> > 
> > Hum, it looks like someone has already been in here and we now have a
> > IB_ACCESS_OPTIONAL concept. 
> > 
> > Something like this would be the starting point:
> >
> > [...]
> >
> > However I see only EFA actually uses IB_ACCESS_OPTIONAL, so the lead
> > up would be to audit all the drivers to process optional access_flags
> > properly. Maybe this was done, but I don't see much evidence of it..
> >
> > Sigh. It is a big mess cleaning adventure in drivers really.
> 
> Yes.  When passing flags to drivers we need to have a good pattern
> in place for distinguishing between mandatory and optional flags.
> That is something everyone gets wrong at first (yourself included)
> and which then later needs painful untangling.  So I think we need
> to do that anyway.

It looks like we did actually do this right here, when the
RELAXED_ORDERING was originally added it was done as an optional flag
and a range of bits were set aside in the uAPI for future optional
flags.

This should be quite easy to do then

Jason

Re: [PATCH rdma-next 4/8] IB/core: Skip device which doesn't have necessary capabilities

2021-04-07 Thread Jason Gunthorpe

On Wed, Apr 07, 2021 at 03:06:35PM +, Parav Pandit wrote:
> 
> 
> > From: Jason Gunthorpe 
> > Sent: Tuesday, April 6, 2021 9:17 PM
> > 
> > On Mon, Apr 05, 2021 at 08:49:56AM +0300, Leon Romanovsky wrote:
> > > @@ -2293,6 +2295,17 @@ static void ib_sa_event(struct ib_event_handler
> > *handler,
> > >   }
> > >  }
> > >
> > > +static bool ib_sa_client_supported(struct ib_device *device) {
> > > + unsigned int i;
> > > +
> > > + rdma_for_each_port(device, i) {
> > > + if (rdma_cap_ib_sa(device, i))
> > > + return true;
> > > + }
> > > + return false;
> > > +}
> > 
> > This is already done though:

> It is but, ib_sa_device() allocates ib_sa_device worth of struct for
> each port without checking the rdma_cap_ib_sa().  This results into
> allocating 40 * 512 = 20480 rounded of to power of 2 to 32K bytes of
> memory for the rdma device with 512 ports.  Other modules are also
> similarly wasting such memory.

If it returns EOPNOTUPP then the remove is never called so if it
allocated memory and left it allocated then it is leaking memory.

If you are saying 32k bytes of temporary allocation matters during
device startup then it needs benchmarks and a use case.

> > The add_one function should return -EOPNOTSUPP if it doesn't want to run
> > on this device and any supported checks should just be at the front - this 
> > is
> > how things work right now

> I am ok to fold this check at the beginning of add callback.  When
> 512 to 1K RoCE devices are used, they do not have SA, CM, CMA etc
> caps on and all the client needs to go through refcnt + xa + sem and
> unroll them.  Is_supported() routine helps to cut down all of it. I
> didn't calculate the usec saved with it.

If that is the reason then explain in the cover letter and provide
benchmarks

Jason

Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-07 Thread Jason Gunthorpe

On Wed, Apr 07, 2021 at 02:08:33AM +, Tian, Kevin wrote:

> > Because if you don't then we enter insane world where a PASID is being
> > created under /dev/ioasid but its translation path flows through setup
> > done by VFIO and the whole user API becomes an incomprehensible mess.
> > 
> > How will you even associate the PASID with the other translation??
> 
> PASID is attached to a specific iommu domain (created by VFIO/VDPA), which
> has GPA->HPA mappings already configured. If we view that mapping as an
> attribute of the iommu domain, it's reasonable to have the userspace-bound
> pgtable through /dev/ioasid to nest on it.

A user controlled page table should absolutely not be an attribute of
a hidden kernel object, nor should two parts of the kernel silently
connect to each other via a hidden internal objects like this.

Security is important - the kind of connection must use some explicit
FD authorization to access shared objects, not be made implicit!

IMHO this direction is a dead end for this reason.

> > The entire translation path for any ioasid or PASID should be defined
> > only by /dev/ioasid. Everything else is a legacy API.
> > 
> > > If following your suggestion then VFIO must deny VFIO MAP operations
> > > on sva1 (assume userspace should not mix sva1 and sva2 in the same
> > > container and instead use /dev/ioasid to map for sva1)?
> > 
> > No, userspace creates an iosaid for the guest physical mapping and
> > passes this ioasid to VFIO PCI which will assign it as the first layer
> > mapping on the RID
> 
> Is it an dummy ioasid just for providing GPA mappings for nesting purpose
> of other IOASIDs? Then we waste one per VM?

Generic ioasid's are "free" they are just software constructs in the
kernel.

> > When PASIDs are allocated the uAPI will be told to logically nested
> > under the first ioasid. When VFIO authorizes a PASID for a RID it
> > checks that all the HW rules are being followed.
> 
> As I explained above, why cannot we just use iommu domain to connect 
> the dots? 

Security.

> Every passthrough framework needs to create an iommu domain
> first. and It needs to support both devices w/ PASID and devices w/o
> PASID.  For devices w/o PASID it needs to invent its own MAP
> interface anyway.  

No, it should consume a ioasid from /dev/ioasid, use a common ioasid
map interface and assign that ioasid to a RID.

Don't get so fixated on PASID as a special case

> Then why do we bother creating another MAP interface through
> /dev/ioasid which not only duplicates but also creating transition
> burden between two set of MAP interfaces when the guest turns on/off
> the pasid capability on the device?

Don't transition. Always use the new interface. qemu detects the
kernel supports /dev/ioasid and *all iommu page table configuration*
goes through there. VFIO and VDPA APIs become unused for iommu
configuration.

> 'universally' upon from which angle you look at this problem. From IOASID
> p.o.v possibly yes, but from device passthrough p.o.v. it's the opposite
> since the passthrough framework needs to handle devices w/o PASID anyway
> (or even for device w/ PASID it could send traffic w/o PASID) thus 
> 'universally'
> makes more sense if the passthrough framework can use one interface of its
> own to manage GPA mappings for all consumers (apply to the case when a
> PASID is allowed/authorized).

You correctly named it /dev/ioasid, it is a generic way to allocate,
manage and assign IOMMU page tables, which when generalized, only some
of which may consume a limited PASID.

RID and RID,PASID are the same thing, just a small difference in how
they match TLPs.

Jason

Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-07 Thread Jason Gunthorpe

On Wed, Apr 07, 2021 at 08:17:50AM +, Tian, Kevin wrote:
> > From: Jason Gunthorpe
> > Sent: Tuesday, April 6, 2021 8:43 PM
> > 
> > On Tue, Apr 06, 2021 at 09:35:17AM +0800, Jason Wang wrote:
> > 
> > > > VFIO and VDPA has no buisness having map/unmap interfaces once we
> > have
> > > > /dev/ioasid. That all belongs in the iosaid side.
> > > >
> > > > I know they have those interfaces today, but that doesn't mean we have
> > > > to keep using them for PASID use cases, they should be replaced with a
> > > > 'do dma from this pasid on /dev/ioasid' interface certainly not a
> > > > 'here is a pasid from /dev/ioasid, go ahead and configure it youself'
> > > > interface
> > >
> > > So it looks like the PASID was bound to SVA in this design. I think it's 
> > > not
> > > necessairly the case:
> > 
> > No, I wish people would stop talking about SVA.
> > 
> > SVA and vSVA are a very special narrow configuration of a PASID. There
> > are lots of other PASID configurations! That is the whole point, a
> > PASID is complicated, there are many configuration scenarios, they
> > need to be in one place with a very clearly defined uAPI
> > 
> 
> I feel it also makes sense to allow a subsystem to specify which 
> configurations
> are permitted when allowing a PASID on its device

huh? why?

> e.g. excluding things like
> GPA mappings that existing subsystems (VFIO/VDPA) already handle well:

They don't "handle well", they have some historical baggage that is no
longer suitable for the complexity this area has in the modern world.

Forget about the existing APIs and replace them in /dev/ioasid.

> - Share GPA mappings between multiple devices (w/ or w/o PASID) for 
> better IOTLB efficiency;
>
> - Share GPA mappings between transactions w/ PASID and transactions w/o
> PASID from the same device (e.g. GPU) for better IOTLB efficiency;
> 
> - Use the same page table for GPA mappings before and after the guest 
> turns on/off the PASID capability;

All of these are cases you need to design the /dev/ioasid to handle.

It is pretty clear to me that you'll need non-PASID IOASID's as
well.

Ideally a generic IOASID would just be a page table and it doesn't
crystalize into a RID or RID,PASID routing until devices are attached
to it.

Since IOASID can be nested the only thing that makes any sense is for
each level of the nest to be visible under /dev/ioasid. 

What a complete mess it would be if vfio-pci owns the GPA table,
/dev/ioasid has a nested PASID, and vfio-mdev is running a mdev on top
of that PASID.

> All above are given as long as we continue to let VFIO/VDPA manage the
> iommu domain and associated GPA mappings for PASID.

So don't do that. Don't I keep saying this weird split is making a
horrible mess?

You can't reasonably build the complex PASID scenarios you talk about
above unless the entire translation path is owned by one entity:
/dev/ioasid.

You need to focus on figuring out what that looks like then figure out
how to move VDPA and VFIO to consume /dev/ioasid for all of their
translation instead of open-coding half-baked internal versions.

Jason

Re: [PATCH v8 7/9] vfio/mdev: Add iommu related member in mdev_device

2021-04-07 Thread Jason Gunthorpe

On Wed, Apr 07, 2021 at 09:58:05AM +0800, Lu Baolu wrote:

> I've ever tried to implement a bus iommu_ops for mdev devices.
> 
> https://lore.kernel.org/lkml/20201030045809.957927-1-baolu...@linux.intel.com/
> 
> Any comments?

You still have the symbol_get, so something continues to be wrong with
that series:

+   mdev_bus = symbol_get(mdev_bus_type);
+   if (mdev_bus) {
+   if (bus == mdev_bus && !iommu_present(bus)) {
+   symbol_put(mdev_bus_type);

Jason

Re: [PATCH v8 7/9] vfio/mdev: Add iommu related member in mdev_device

2021-04-06 Thread Jason Gunthorpe

On Mon, Mar 25, 2019 at 09:30:34AM +0800, Lu Baolu wrote:
> A parent device might create different types of mediated
> devices. For example, a mediated device could be created
> by the parent device with full isolation and protection
> provided by the IOMMU. One usage case could be found on
> Intel platforms where a mediated device is an assignable
> subset of a PCI, the DMA requests on behalf of it are all
> tagged with a PASID. Since IOMMU supports PASID-granular
> translations (scalable mode in VT-d 3.0), this mediated
> device could be individually protected and isolated by an
> IOMMU.
> 
> This patch adds a new member in the struct mdev_device to
> indicate that the mediated device represented by mdev could
> be isolated and protected by attaching a domain to a device
> represented by mdev->iommu_device. It also adds a helper to
> add or set the iommu device.
> 
> * mdev_device->iommu_device
>   - This, if set, indicates that the mediated device could
> be fully isolated and protected by IOMMU via attaching
> an iommu domain to this device. If empty, it indicates
> using vendor defined isolation, hence bypass IOMMU.
> 
> * mdev_set/get_iommu_device(dev, iommu_device)
>   - Set or get the iommu device which represents this mdev
> in IOMMU's device scope. Drivers don't need to set the
> iommu device if it uses vendor defined isolation.
> 
> Cc: Ashok Raj 
> Cc: Jacob Pan 
> Cc: Kevin Tian 
> Cc: Liu Yi L 
> Suggested-by: Kevin Tian 
> Suggested-by: Alex Williamson 
> Signed-off-by: Lu Baolu 
> Reviewed-by: Jean-Philippe Brucker 
> --- 
>  drivers/vfio/mdev/mdev_core.c| 18 ++
>  drivers/vfio/mdev/mdev_private.h |  1 +
>  include/linux/mdev.h | 14 ++
>  3 files changed, 33 insertions(+)
> 
> diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c
> index b96fedc77ee5..1b6435529166 100644
> +++ b/drivers/vfio/mdev/mdev_core.c
> @@ -390,6 +390,24 @@ int mdev_device_remove(struct device *dev, bool 
> force_remove)
>   return 0;
>  }
>  
> +int mdev_set_iommu_device(struct device *dev, struct device *iommu_device)
> +{
> + struct mdev_device *mdev = to_mdev_device(dev);
> +
> + mdev->iommu_device = iommu_device;
> +
> + return 0;
> +}
> +EXPORT_SYMBOL(mdev_set_iommu_device);

I was looking at these functions when touching the mdev stuff and I
have some concerns.

1) Please don't merge dead code. It is a year later and there is still
   no in-tree user for any of this. This is not our process. Even
   worse it was exported so it looks like this dead code is supporting
   out of tree modules.

2) Why is this like this? Every struct device already has a connection
   to the iommu layer and every mdev has a struct device all its own.

   Why did we need to add special 'if (mdev)' stuff all over the
   place? This smells like the same abuse Thomas
   and I pointed out for the interrupt domains.

   After my next series the mdev drivers will have direct access to
   the vfio_device. So an alternative to using the struct device, or
   adding 'if mdev' is to add an API to the vfio_device world to
   inject what iommu configuration is needed from that direction
   instead of trying to discover it from a struct device.

3) The vfio_bus_is_mdev() and related symbol_get() nonsense in
   drivers/vfio/vfio_iommu_type1.c has to go, for the same reasons
   it was not acceptable to do this for the interrupt side either.

4) It seems pretty clear to me this will be heavily impacted by the
   /dev/ioasid discussion. Please consider removing the dead code now.

Basically, please fix this before trying to get idxd mdev merged as
the first user.

Jason

Re: [PATCH rdma-next 4/8] IB/core: Skip device which doesn't have necessary capabilities

2021-04-06 Thread Jason Gunthorpe

On Mon, Apr 05, 2021 at 08:49:56AM +0300, Leon Romanovsky wrote:
> @@ -2293,6 +2295,17 @@ static void ib_sa_event(struct ib_event_handler 
> *handler,
>   }
>  }
>  
> +static bool ib_sa_client_supported(struct ib_device *device)
> +{
> + unsigned int i;
> +
> + rdma_for_each_port(device, i) {
> + if (rdma_cap_ib_sa(device, i))
> + return true;
> + }
> + return false;
> +}

This is already done though:

for (i = 0; i <= e - s; ++i) {
spin_lock_init(_dev->port[i].ah_lock);
if (!rdma_cap_ib_sa(device, i + 1))
continue;
[..]

if (!count) {
ret = -EOPNOTSUPP;
goto free;

Why does it need to be duplicated? The other patches are all basically
like that too.

The add_one function should return -EOPNOTSUPP if it doesn't want to
run on this device and any supported checks should just be at the
front - this is how things work right now

Jason

Re: [PATCH rdma-next 01/10] RDMA: Add access flags to ib_alloc_mr() and ib_mr_pool_init()

2021-04-06 Thread Jason Gunthorpe

On Tue, Apr 06, 2021 at 04:15:52PM +0200, Christoph Hellwig wrote:
> On Tue, Apr 06, 2021 at 11:04:37AM -0300, Jason Gunthorpe wrote:
> > It might be idiodic, but I have to keep the uverbs thing working
> > too.
> > 
> > There is a lot of assumption baked in to all the drivers that
> > user/kernel is the same thing, we'd have to go in and break this.
> > 
> > Essentially #2 ends up as deleting IB_ACCESS_RELAXED_ORDERING kernel
> > side and instead doing some IB_ACCESS_DISABLE_RO in kernel,
> > translating uverbs IBV_ACCESS_* to this then finding and inverting all
> > the driver logic and also finding and unblocking all the places that
> > enforce valid access flags in the drivers. It is complicated enough
> 
> Inverting the polarity of a flag at the uapi boundary is pretty
> trivial and we already do it all over the kernel.

Yes, but the complexity is how the drivers are constructed they are
designed to reject flags they don't know about..

Hum, it looks like someone has already been in here and we now have a
IB_ACCESS_OPTIONAL concept. 

Something like this would be the starting point:

diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index bed4cfe50554f7..fcb107df0eefc6 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -1440,9 +1440,11 @@ enum ib_access_flags {
IB_ZERO_BASED = IB_UVERBS_ACCESS_ZERO_BASED,
IB_ACCESS_ON_DEMAND = IB_UVERBS_ACCESS_ON_DEMAND,
IB_ACCESS_HUGETLB = IB_UVERBS_ACCESS_HUGETLB,
-   IB_ACCESS_RELAXED_ORDERING = IB_UVERBS_ACCESS_RELAXED_ORDERING,
 
IB_ACCESS_OPTIONAL = IB_UVERBS_ACCESS_OPTIONAL_RANGE,
+   _IB_ACCESS_RESERVED1 = IB_UVERBS_ACCESS_RELAXED_ORDERING,
+   IB_ACCESS_DISABLE_RELAXED_ORDERING,
+
IB_ACCESS_SUPPORTED =
((IB_ACCESS_HUGETLB << 1) - 1) | IB_ACCESS_OPTIONAL,
 };

However I see only EFA actually uses IB_ACCESS_OPTIONAL, so the lead
up would be to audit all the drivers to process optional access_flags
properly. Maybe this was done, but I don't see much evidence of it..

Sigh. It is a big mess cleaning adventure in drivers really.

> Do we actually ever need the strict ordering semantics in the kernel?

No, only for uverbs.

Jason

Re: [PATCH rdma-next 01/10] RDMA: Add access flags to ib_alloc_mr() and ib_mr_pool_init()

2021-04-06 Thread Jason Gunthorpe

On Tue, Apr 06, 2021 at 02:30:34PM +0200, Christoph Hellwig wrote:
> On Tue, Apr 06, 2021 at 09:13:12AM -0300, Jason Gunthorpe wrote:
> > So we broadly have two choice
> >  1) Diverge the kernel and user interfaces and make the RDMA drivers
> > special case all the kernel stuff to force
> > ACCESS_RELAXED_ORDERING when they are building MRs and processing
> > FMR WQE's
> >  2) Keep the two interfaces the same and push the
> > ACCESS_RELAXED_ORDERING to a couple of places in the ULPs so the
> > drivers see a consistent API
> > 
> > They are both poor choices, but I think #2 has a greater chance of
> > everyone doing their parts correctly.
> 
> No, 1 is the only sensible choice.  The userspace verbs interface is
> a mess and should not inflict pain on the kernel in any way.  We've moved
> away from a lot of the idiotic "Verbs API" concepts with things like
> how we handle the global lkey, the new CQ API and the RDMA R/W
> abstraction and that massively helped the kernel ecosystem.

It might be idiodic, but I have to keep the uverbs thing working
too.

There is a lot of assumption baked in to all the drivers that
user/kernel is the same thing, we'd have to go in and break this.

Essentially #2 ends up as deleting IB_ACCESS_RELAXED_ORDERING kernel
side and instead doing some IB_ACCESS_DISABLE_RO in kernel,
translating uverbs IBV_ACCESS_* to this then finding and inverting all
the driver logic and also finding and unblocking all the places that
enforce valid access flags in the drivers. It is complicated enough

Maybe Avihai will give it a try

Jason

Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-06 Thread Jason Gunthorpe

On Tue, Apr 06, 2021 at 09:35:17AM +0800, Jason Wang wrote:

> > VFIO and VDPA has no buisness having map/unmap interfaces once we have
> > /dev/ioasid. That all belongs in the iosaid side.
> > 
> > I know they have those interfaces today, but that doesn't mean we have
> > to keep using them for PASID use cases, they should be replaced with a
> > 'do dma from this pasid on /dev/ioasid' interface certainly not a
> > 'here is a pasid from /dev/ioasid, go ahead and configure it youself'
> > interface
>  
> So it looks like the PASID was bound to SVA in this design. I think it's not
> necessairly the case:

No, I wish people would stop talking about SVA.

SVA and vSVA are a very special narrow configuration of a PASID. There
are lots of other PASID configurations! That is the whole point, a
PASID is complicated, there are many configuration scenarios, they
need to be in one place with a very clearly defined uAPI

> 1) PASID can be implemented without SVA, in this case a map/unmap interface
> is still required

Any interface to manipulate a PASID should be under /dev/ioasid. We do
not want to duplicate this into every subsystem.

> 2) For the case that hypervisor want to do some mediation in the middle for
> a virtqueue. e.g in the case of control vq that is implemented in the
> VF/ADI/SF itself, the hardware virtqueue needs to be controlled by Qemu,
> Though binding qemu's page table to cvq can work but it looks like a
> overkill, a small dedicated buffers that is mapped for this PASID seems more
> suitalbe.

/dev/ioasid should allow userspace to setup any PASID configuration it
wants. There are many choices. That is the whole point, instead of
copying all the PASID configuration option into every
subsystem we have on place to configure it.

If you want a PASID (or generic ioasid) that has the guest physical
map, which is probably all that VDPA would ever want, then /dev/ioasid
should be able to prepare that.

If you just want to map a few buffers into a PASID then it should be
able to do that too.

> So do you mean the device should not expose the PASID confiugration API to
> guest? I think it could happen if we assign the whole device and let guest
> to configure it for nested VMs.

This always needs co-operating with the vIOMMU driver. We can't have
nested PASID use without both parts working together.

The vIOMMU driver configures the PASID and assigns the mappings
(however complicated that turns out to be)

The VDPA/mdev driver authorizes the HW to use the ioasid mapping, eg
by authorizing a queue to issue PCIe TLPs with a specific PASID.

The authorization is triggered by the guest telling the vIOMMU to
allow a vRID to talk to a PASID, which qemu would have to translate to
telling something like the VDPA driver under the vRID that it can use
a PASID from /dev/ioasid

For security a VDPA/mdev device MUST NOT issue PASIDs that the vIOMMU
has not authorized its vRID to use. Otherwise the security model of
something like VFIO in the guest becomes completely broken.

Jason

Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-06 Thread Jason Gunthorpe

On Tue, Apr 06, 2021 at 01:27:15AM +, Tian, Kevin wrote:
> 
> and here is one example why using existing VFIO/VDPA interface makes
> sense. say dev1 (w/ sva) and dev2 (w/o sva) are placed in a single VFIO 
> container. 

Forget about SVA, it is an irrelevant detail of how a PASID is
configured.

> The container is associated to an iommu domain which contains a
> single 2nd-level page table, shared by both devices (when attached
> to the domain).

This level should be described by an ioasid.

> The VFIO MAP operation is applied to the 2nd-level
> page table thus naturally applied to both devices. Then userspace
> could use /dev/ioasid to further allocate IOASIDs and bind multiple
> 1st-level page tables for dev1, nested on the shared 2nd-level page
> table.

Because if you don't then we enter insane world where a PASID is being
created under /dev/ioasid but its translation path flows through setup
done by VFIO and the whole user API becomes an incomprehensible mess.

How will you even associate the PASID with the other translation??

The entire translation path for any ioasid or PASID should be defined
only by /dev/ioasid. Everything else is a legacy API.

> If following your suggestion then VFIO must deny VFIO MAP operations
> on sva1 (assume userspace should not mix sva1 and sva2 in the same
> container and instead use /dev/ioasid to map for sva1)? 

No, userspace creates an iosaid for the guest physical mapping and
passes this ioasid to VFIO PCI which will assign it as the first layer
mapping on the RID

When PASIDs are allocated the uAPI will be told to logically nested
under the first ioasid. When VFIO authorizes a PASID for a RID it
checks that all the HW rules are being followed.

If there are rules like groups of VFIO devices must always use the
same IOASID then VFIO will check these too (and realistically qemu
will have only one guest physical map ioasid anyhow)

There is no real difference between setting up an IOMMU table for a
(RID,PASID) tuple or just a RID. We can do it universally with
one interface for all consumers.

I wanted this when we were doing VDPA for the first time, now that we
are doing pasid and more difficult stuff I view it as essential.

Jason

Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-06 Thread Jason Gunthorpe

On Tue, Apr 06, 2021 at 01:02:05AM +, Tian, Kevin wrote:
> > From: Jason Gunthorpe 
> > Sent: Tuesday, April 6, 2021 7:40 AM
> > 
> > On Fri, Apr 02, 2021 at 07:58:02AM +, Tian, Kevin wrote:
> > > > From: Jason Gunthorpe 
> > > > Sent: Thursday, April 1, 2021 9:47 PM
> > > >
> > > > On Thu, Apr 01, 2021 at 01:43:36PM +, Liu, Yi L wrote:
> > > > > > From: Jason Gunthorpe 
> > > > > > Sent: Thursday, April 1, 2021 9:16 PM
> > > > > >
> > > > > > On Thu, Apr 01, 2021 at 01:10:48PM +, Liu, Yi L wrote:
> > > > > > > > From: Jason Gunthorpe 
> > > > > > > > Sent: Thursday, April 1, 2021 7:47 PM
> > > > > > > [...]
> > > > > > > > I'm worried Intel views the only use of PASID in a guest is with
> > > > > > > > ENQCMD, but that is not consistent with the industry. We need to
> > see
> > > > > > > > normal nested PASID support with assigned PCI VFs.
> > > > > > >
> > > > > > > I'm not quire flow here. Intel also allows PASID usage in guest
> > without
> > > > > > > ENQCMD. e.g. Passthru a PF to guest, and use PASID on it without
> > > > > > ENQCMD.
> > > > > >
> > > > > > Then you need all the parts, the hypervisor calls from the vIOMMU,
> > and
> > > > > > you can't really use a vPASID.
> > > > >
> > > > > This is a diagram shows the vSVA setup.
> > > >
> > > > I'm not talking only about vSVA. Generic PASID support with arbitary
> > > > mappings.
> > > >
> > > > And how do you deal with the vPASID vs pPASID issue if the system has
> > > > a mix of physical devices and mdevs?
> > > >
> > >
> > > We plan to support two schemes. One is vPASID identity-mapped to
> > > pPASID then the mixed scenario just works, with the limitation of
> > > lacking of live migration support. The other is non-identity-mapped
> > > scheme, where live migration is supported but physical devices and
> > > mdevs should not be mixed in one VM if both expose SVA capability
> > > (requires some filtering check in Qemu).
> > 
> > That just becomes "block vPASID support if any device that
> > doesn't use ENQCMD is plugged into the guest"
> 
> The limitation is only for physical device. and in reality it is not that
> bad. To support live migration with physical device we anyway need 
> additional work to migrate the device state (e.g. based on Max's work), 
> then it's not unreasonable to also mediate guest programming of 
> device specific PASID register to enable vPASID (need to translate in
> the whole VM lifespan but likely is not a hot path).

IMHO that is pretty unreasonable.. More likely we end up with vPASID
tables in each migratable device like KVM has.

> > Which needs a special VFIO capability of some kind so qemu knows to
> > block it. This really needs to all be layed out together so someone
> > can understand it :(
> 
> Or could simply based on whether the VFIO device supports live migration.

You need to define affirmative caps that indicate that vPASID will be
supported by the VFIO device.

> > Why doesn't the siov cookbook explaining this stuff??
> > 
> > > We hope the /dev/ioasid can support both schemes, with the minimal
> > > requirement of allowing userspace to tag a vPASID to a pPASID and
> > > allowing mdev to translate vPASID into pPASID, i.e. not assuming that
> > > the guest will always use pPASID.
> > 
> > What I'm a unclear of is if /dev/ioasid even needs to care about
> > vPASID or if vPASID is just a hidden artifact of the KVM connection to
> > setup the translation table and the vIOMMU driver in qemu.
> 
> Not just for KVM. Also required by mdev, which needs to translate
> vPASID into pPASID when ENQCMD is not used.

Do we have any mdev's that will do this?

> should only care about the operations related to pPASID. VFIO could
> carry vPASID information to mdev.

It depends how common this is, I suppose

Jason

Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-06 Thread Jason Gunthorpe

On Tue, Apr 06, 2021 at 12:37:35AM +, Tian, Kevin wrote:

> With nested translation it is GVA->GPA->HPA. The kernel needs to
> fix fault related to GPA->HPA (managed by VFIO/VDPA) while 
> handle_mm_fault only handles HVA->HPA. In this case, the 2nd-level
> page fault is expected to be delivered to VFIO/VDPA first which then
> find HVA related to GPA, call handle_mm_fault to fix HVA->HPA,
> and then call iommu_map to fix GPA->HPA in the IOMMU page table.
> This is exactly like how CPU EPT violation is handled.

No, it should all be in the /dev/ioasid layer not duplicated into
every user.

> > If the fault needs to be fixed in the guest, then it needs to be
> > delivered over /dev/ioasid in some way and injected into the
> > vIOMMU. VFIO and VDPA have nothing to do with vIOMMU driver in quemu.
> > 
> > You need to have an interface under /dev/ioasid to create both page
> > table levels and part of that will be to tell the kernel what VA is
> > mapped and how to handle faults.
> 
> VFIO/VDPA already have their own interface to manage GPA->HPA
> mappings. Why do we want to duplicate it in /dev/ioasid? 

They have their own interface to manage other types of HW, we should
not duplicate PASID programming into there too.

Jason

Re: [PATCH rdma-next 01/10] RDMA: Add access flags to ib_alloc_mr() and ib_mr_pool_init()

2021-04-06 Thread Jason Gunthorpe

On Tue, Apr 06, 2021 at 08:58:54AM +0300, Leon Romanovsky wrote:
> On Tue, Apr 06, 2021 at 07:27:17AM +0200, Christoph Hellwig wrote:
> > On Tue, Apr 06, 2021 at 08:23:28AM +0300, Leon Romanovsky wrote:
> > > The same proposal (enable unconditionally) was raised during
> > > submission preparations and we decided to follow same pattern
> > > as other verbs objects which receive flag parameter.
> > 
> > A flags argument can be added when it actually is needed.  Using it
> > to pass an argument enabled by all ULPs just gets us back to the bad
> > old days of complete crap APIs someone drew up on a whiteboard.
> 
> Let's wait till Jason wakes up, before jumping to conclusions.
> It was his request to update all ULPs.

We are stuck in a bad spot here

Only the kernel can universally support ACCESS_RELAXED_ORDERING
because all the ULPs are required to work sanely already, but
userspace does not have this limitation and we know of enough popular
verbs users that will break if we unconditionally switch on
ACCESS_RELAXED_ORDERING.

Further, we have the problem with the FMR interface that technically
lets the caller control the entire access_flags, including
ACCESS_RELAXED_ORDERING.

So we broadly have two choice
 1) Diverge the kernel and user interfaces and make the RDMA drivers
special case all the kernel stuff to force
ACCESS_RELAXED_ORDERING when they are building MRs and processing
FMR WQE's
 2) Keep the two interfaces the same and push the
ACCESS_RELAXED_ORDERING to a couple of places in the ULPs so the
drivers see a consistent API

They are both poor choices, but I think #2 has a greater chance of
everyone doing their parts correctly.

Jason

Re: [PATCH rdma-next 00/10] Enable relaxed ordering for ULPs

2021-04-06 Thread Jason Gunthorpe

On Tue, Apr 06, 2021 at 08:09:43AM +0300, Leon Romanovsky wrote:
> On Tue, Apr 06, 2021 at 10:37:38AM +0800, Honggang LI wrote:
> > On Mon, Apr 05, 2021 at 08:23:54AM +0300, Leon Romanovsky wrote:
> > > From: Leon Romanovsky 
> > > 
> > > From Avihai,
> > > 
> > > Relaxed Ordering is a PCIe mechanism that relaxes the strict ordering
> > > imposed on PCI transactions, and thus, can improve performance.
> > > 
> > > Until now, relaxed ordering could be set only by user space applications
> > > for user MRs. The following patch series enables relaxed ordering for the
> > > kernel ULPs as well. Relaxed ordering is an optional capability, and as
> > > such, it is ignored by vendors that don't support it.
> > > 
> > > The following test results show the performance improvement achieved
> > 
> > Did you test this patchset with CPU does not support relaxed ordering?
> 
> I don't think so, the CPUs that don't support RO are Intel's 
> fourth/fifth-generation
> and they are not interesting from performance point of view.
> 
> > 
> > We observed significantly performance degradation when run perftest with
> > relaxed ordering enabled over old CPU.
> > 
> > https://github.com/linux-rdma/perftest/issues/116
> 
> The perftest is slightly different, but you pointed to the valid point.
> We forgot to call pcie_relaxed_ordering_enabled() before setting RO bit
> and arguably this was needed to be done in perftest too.

No, the PCI device should not have the RO bit set in this situation.
It is something mlx5_core needs to do. We can't push this into
applications.

There should be no performance difference from asking for
IBV_ACCESS_RELAXED_ORDERING when RO is disabled at the PCI config and
not asking for it at all.

Either the platform has working relaxed ordering that gives a
performance gain and the RO config spec bit should be set, or it
doesn't and the bit should be clear.

This is not something to decide in userspace, or in RDMA. At worst it
becomes another platform specific PCI tunable people have to set.

I thought the old haswell systems were quirked to disable RO globally
anyhow?

Jason

Re: [PATCH rdma-next 00/10] Enable relaxed ordering for ULPs

2021-04-06 Thread Jason Gunthorpe

On Mon, Apr 05, 2021 at 11:42:31PM +, Chuck Lever III wrote:

> We need to get a better idea what correctness testing has been done,
> and whether positive correctness testing results can be replicated
> on a variety of platforms.

RO has been rolling out slowly on mlx5 over a few years and storage
ULPs are the last to change. eg the mlx5 ethernet driver has had RO
turned on for a long time, userspace HPC applications have been using
it for a while now too.

We know there are platforms with broken RO implementations (like
Haswell) but the kernel is supposed to globally turn off RO on all
those cases. I'd be a bit surprised if we discover any more from this
series.

On the other hand there are platforms that get huge speed ups from
turning this on, AMD is one example, there are a bunch in the ARM
world too.

Still, obviously people should test on the platforms they have.

Jason

Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-05 Thread Jason Gunthorpe

On Fri, Apr 02, 2021 at 08:22:28AM +, Tian, Kevin wrote:
> > From: Jason Gunthorpe 
> > Sent: Tuesday, March 30, 2021 9:29 PM
> > 
> > >
> > > First, userspace may use ioasid in a non-SVA scenario where ioasid is
> > > bound to specific security context (e.g. a control vq in vDPA) instead of
> > > tying to mm. In this case there is no pgtable binding initiated from user
> > > space. Instead, ioasid is allocated from /dev/ioasid and then programmed
> > > to the intended security context through specific passthrough framework
> > > which manages that context.
> > 
> > This sounds like the exact opposite of what I'd like to see.
> > 
> > I do not want to see every subsystem gaining APIs to program a
> > PASID. All of that should be consolidated in *one place*.
> > 
> > I do not want to see VDPA and VFIO have two nearly identical sets of
> > APIs to control the PASID.
> > 
> > Drivers consuming a PASID, like VDPA, should consume the PASID and do
> > nothing more than authorize the HW to use it.
> > 
> > quemu should have general code under the viommu driver that drives
> > /dev/ioasid to create PASID's and manage the IO mapping according to
> > the guest's needs.
> > 
> > Drivers like VDPA and VFIO should simply accept that PASID and
> > configure/authorize their HW to do DMA's with its tag.
> > 
> 
> I agree with you on consolidating things in one place (especially for the
> general SVA support). But here I was referring to an usage without 
> pgtable binding (Possibly Jason. W can say more here), where the 
> userspace just wants to allocate PASIDs, program/accept PASIDs to 
> various workqueues (device specific), and then use MAP/UNMAP 
> interface to manage address spaces associated with each PASID. 
> I just wanted to point out that the latter two steps are through 
> VFIO/VDPA specific interfaces. 

No, don't do that.

VFIO and VDPA has no buisness having map/unmap interfaces once we have
/dev/ioasid. That all belongs in the iosaid side.

I know they have those interfaces today, but that doesn't mean we have
to keep using them for PASID use cases, they should be replaced with a
'do dma from this pasid on /dev/ioasid' interface certainly not a
'here is a pasid from /dev/ioasid, go ahead and configure it youself'
interface

This is because PASID is *complicated* in the general case! For
instance all the two level stuff you are talking about must not leak
into every user!

Jason

Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-05 Thread Jason Gunthorpe

On Fri, Apr 02, 2021 at 07:58:02AM +, Tian, Kevin wrote:
> > From: Jason Gunthorpe 
> > Sent: Thursday, April 1, 2021 9:47 PM
> > 
> > On Thu, Apr 01, 2021 at 01:43:36PM +, Liu, Yi L wrote:
> > > > From: Jason Gunthorpe 
> > > > Sent: Thursday, April 1, 2021 9:16 PM
> > > >
> > > > On Thu, Apr 01, 2021 at 01:10:48PM +, Liu, Yi L wrote:
> > > > > > From: Jason Gunthorpe 
> > > > > > Sent: Thursday, April 1, 2021 7:47 PM
> > > > > [...]
> > > > > > I'm worried Intel views the only use of PASID in a guest is with
> > > > > > ENQCMD, but that is not consistent with the industry. We need to see
> > > > > > normal nested PASID support with assigned PCI VFs.
> > > > >
> > > > > I'm not quire flow here. Intel also allows PASID usage in guest 
> > > > > without
> > > > > ENQCMD. e.g. Passthru a PF to guest, and use PASID on it without
> > > > ENQCMD.
> > > >
> > > > Then you need all the parts, the hypervisor calls from the vIOMMU, and
> > > > you can't really use a vPASID.
> > >
> > > This is a diagram shows the vSVA setup.
> > 
> > I'm not talking only about vSVA. Generic PASID support with arbitary
> > mappings.
> > 
> > And how do you deal with the vPASID vs pPASID issue if the system has
> > a mix of physical devices and mdevs?
> > 
> 
> We plan to support two schemes. One is vPASID identity-mapped to
> pPASID then the mixed scenario just works, with the limitation of 
> lacking of live migration support. The other is non-identity-mapped 
> scheme, where live migration is supported but physical devices and 
> mdevs should not be mixed in one VM if both expose SVA capability 
> (requires some filtering check in Qemu). 

That just becomes "block vPASID support if any device that
doesn't use ENQCMD is plugged into the guest"

Which needs a special VFIO capability of some kind so qemu knows to
block it. This really needs to all be layed out together so someone
can understand it :(

Why doesn't the siov cookbook explaining this stuff??

> We hope the /dev/ioasid can support both schemes, with the minimal
> requirement of allowing userspace to tag a vPASID to a pPASID and
> allowing mdev to translate vPASID into pPASID, i.e. not assuming that
> the guest will always use pPASID.

What I'm a unclear of is if /dev/ioasid even needs to care about
vPASID or if vPASID is just a hidden artifact of the KVM connection to
setup the translation table and the vIOMMU driver in qemu.

Since the physical HW never sees the vPASID I'm inclined to think the
latter.

Jason

Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-05 Thread Jason Gunthorpe

On Fri, Apr 02, 2021 at 07:30:23AM +, Tian, Kevin wrote:
> > From: Jason Gunthorpe 
> > Sent: Friday, April 2, 2021 12:04 AM
> > 
> > On Thu, Apr 01, 2021 at 02:08:17PM +, Liu, Yi L wrote:
> > 
> > > DMA page faults are delivered to root-complex via page request message
> > and
> > > it is per-device according to PCIe spec. Page request handling flow is:
> > >
> > > 1) iommu driver receives a page request from device
> > > 2) iommu driver parses the page request message. Get the RID,PASID,
> > faulted
> > >page and requested permissions etc.
> > > 3) iommu driver triggers fault handler registered by device driver with
> > >iommu_report_device_fault()
> > 
> > This seems confused.
> > 
> > The PASID should define how to handle the page fault, not the driver.
> > 
> > I don't remember any device specific actions in ATS, so what is the
> > driver supposed to do?
> > 
> > > 4) device driver's fault handler signals an event FD to notify userspace 
> > > to
> > >fetch the information about the page fault. If it's VM case, inject the
> > >page fault to VM and let guest to solve it.
> > 
> > If the PASID is set to 'report page fault to userspace' then some
> > event should come out of /dev/ioasid, or be reported to a linked
> > eventfd, or whatever.
> > 
> > If the PASID is set to 'SVM' then the fault should be passed to
> > handle_mm_fault
> > 
> > And so on.
> > 
> > Userspace chooses what happens based on how they configure the PASID
> > through /dev/ioasid.
> > 
> > Why would a device driver get involved here?
> > 
> > > Eric has sent below series for the page fault reporting for VM with 
> > > passthru
> > > device.
> > > https://lore.kernel.org/kvm/20210223210625.604517-5-
> > eric.au...@redhat.com/
> > 
> > It certainly should not be in vfio pci. Everything using a PASID needs
> > this infrastructure, VDPA, mdev, PCI, CXL, etc.
> > 
> 
> This touches an interesting fact:
> 
> The fault may be triggered in either 1st-level or 2nd-level page table, 
> when nested translation is enabled (in vSVA case). The 1st-level is bound 
> by the user space, which therefore needs to receive the fault event. The 
> 2nd-level is managed by VFIO (or vDPA), which needs to fix the fault in 
> kernel (e.g. find HVA per faulting GPA, call handle_mm_fault and map 
> GPA->HPA to IOMMU). Yi's current proposal lets VFIO to register the 
> device fault handler, which then forward the event through /dev/ioasid 
> to userspace only if it is a 1st-level fault. Are you suggesting a pgtable-
> centric fault reporting mechanism to separate handlers in each level, 
> i.e. letting VFIO register handler only for 2nd-level fault and then /dev/
> ioasid register handler for 1st-level fault?

This I'm struggling to understand. /dev/ioasid should handle all the
faults cases, why would VFIO ever get involved in a fault? What would
it even do?

If the fault needs to be fixed in the hypervisor then it is a kernel
fault and it does handle_mm_fault. This absolutely should not be in
VFIO or VDPA

If the fault needs to be fixed in the guest, then it needs to be
delivered over /dev/ioasid in some way and injected into the
vIOMMU. VFIO and VDPA have nothing to do with vIOMMU driver in quemu.

You need to have an interface under /dev/ioasid to create both page
table levels and part of that will be to tell the kernel what VA is
mapped and how to handle faults.

VFIO/VDPA do *nothing* more than authorize the physical device to use
the given PASID.

In the VDPA case you might need to have SW access to the PASID, but
that should be provided by a generic iommu layer interface like
'copy_to/from_pasid()' not by involving VDPA in the address mapping.

Jason

Re: [PATCH rdma-next 00/10] Enable relaxed ordering for ULPs

2021-04-05 Thread Jason Gunthorpe

On Mon, Apr 05, 2021 at 03:41:15PM +0200, Christoph Hellwig wrote:
> On Mon, Apr 05, 2021 at 08:23:54AM +0300, Leon Romanovsky wrote:
> > From: Leon Romanovsky 
> > 
> > >From Avihai,
> > 
> > Relaxed Ordering is a PCIe mechanism that relaxes the strict ordering
> > imposed on PCI transactions, and thus, can improve performance.
> > 
> > Until now, relaxed ordering could be set only by user space applications
> > for user MRs. The following patch series enables relaxed ordering for the
> > kernel ULPs as well. Relaxed ordering is an optional capability, and as
> > such, it is ignored by vendors that don't support it.
> > 
> > The following test results show the performance improvement achieved
> > with relaxed ordering. The test was performed on a NVIDIA A100 in order
> > to check performance of storage infrastructure over xprtrdma:
> 
> Isn't the Nvidia A100 a GPU not actually supported by Linux at all?
> What does that have to do with storage protocols?

I think it is a typo (or at least mit makes no sense to be talking
about NFS with a GPU chip) Probably it should be a DGX A100 which is a
dual socket AMD server with alot of PCIe, and xptrtrdma is a NFS-RDMA
workload.

AMD dual socket systems are well known to benefit from relaxed
ordering, people have been doing this in userspace for a while now
with the opt in.

What surprises me is the performance difference, I hadn't heard it is
4x!

Jason

Re: CFI violation in drivers/infiniband/core/sysfs.c

2021-04-04 Thread Jason Gunthorpe

On Fri, Apr 02, 2021 at 06:29:55PM -0700, Kees Cook wrote:
> On Fri, Apr 02, 2021 at 08:30:18PM -0300, Jason Gunthorpe wrote:
> > On Fri, Apr 02, 2021 at 04:03:30PM -0700, Kees Cook wrote:
> > 
> > > > relevant. It seems to me that the hw_counters 'struct attribute_group'
> > > > should probably be its own kobj within both of these structures so they
> > > > can have their own sysfs ops (unless there is some other way to do this
> > > > that I am missing).
> > 
> > Err, yes, every subclass of the attribute should be accompanied by a
> > distinct kobject type to relay the show methods with typesafety, this
> > is how this design pattern is intended to be used.
> > 
> > If I understand your report properly the hw_stats_attribute is being
> > assigned to a 'port_type' kobject and it only works by pure luck because
> > the show/store happens to overlap between port and hsa attributes?
> 
> "happens to" :) Yeah, they're all like this, unfortunately:
> https://lore.kernel.org/lkml/202006112217.2E6CE093@keescook/

All? I think these are all bugs, no?

struct kobj_attribute is only to be used with a kobj_sysfs_ops type
kobject

To cross it over to a 'struct device' kobj is completely wrong, the
same basic wrongness being done here.
 
> I'm not convinced that just backing everything off to kobject isn't
> simpler?

It might be simpler, but isn't right - everything should continue to
work after a patch like this:

--- a/drivers/infiniband/core/sysfs.c
+++ b/drivers/infiniband/core/sysfs.c
@@ -67,6 +67,7 @@ struct ib_port {
 
 struct port_attribute {
struct attribute attr;
+   uu64 pad[2];
ssize_t (*show)(struct ib_port *, struct port_attribute *, char *buf);
ssize_t (*store)(struct ib_port *, struct port_attribute *,
 const char *buf, size_t count);

If it doesn't it is still broken.

Using container_of() with the wrong types is an unconditional
error. A kasn test to catch this would be very cool (think like RTTI
and dynamic_cast<>() in C++)

> > And then two show/set functions that bounce through the correct types
> > to the data.
> 
> I'd like to make these things compile-time safe (there is not type
> associated with use the __ATTR() macro, for example). That I haven't
> really figured out how to do right.

They are in many places, for instance.

int device_create_file(struct device *dev,
   const struct device_attribute *attr)

We loose the type safety when working with attribute arrays, and
people can just bypass the "proper" APIs to raw sysfs ones whenever
they like.

It is fundamentally completely wrong to attach a 'struct
kobject_attribute' to a 'struct device' kobject.

Which is what is happening here and the link above.

Jason

Re: CFI violation in drivers/infiniband/core/sysfs.c

2021-04-02 Thread Jason Gunthorpe

On Fri, Apr 02, 2021 at 04:03:30PM -0700, Kees Cook wrote:

> > relevant. It seems to me that the hw_counters 'struct attribute_group'
> > should probably be its own kobj within both of these structures so they
> > can have their own sysfs ops (unless there is some other way to do this
> > that I am missing).

Err, yes, every subclass of the attribute should be accompanied by a
distinct kobject type to relay the show methods with typesafety, this
is how this design pattern is intended to be used.

If I understand your report properly the hw_stats_attribute is being
assigned to a 'port_type' kobject and it only works by pure luck because
the show/store happens to overlap between port and hsa attributes?

> > I would appreciate someone else taking a look and seeing if I am off
> > base or if there is an easier way to solve this.
> 
> So, it seems that the reason for a custom kobj_type here is to use the
> .release callback. 

Every kobject should be associated with a specific, fixed, attribute
type. The purpose of the wrappers is to inject type safety so the
attribute implementations know they are working on the right stuff.

In turn, an attribute of a specific attribute type can only be
injected into a kobject that has the matching type.

This code is insane because it does this:

hsag->attrs[i] = alloc_hsa(i, port_num, stats->names[i]);

// This is port kobject
struct kobject *kobj = >kobj;
ret = sysfs_create_group(kobj, hsag); 
[..]
// This is a struct device kobject
struct kobject *kobj = >dev.kobj;
ret = sysfs_create_group(kobj, hsag); 

Which is absolutely not allowed, you can't have a generic attribute
handler and stuff it into two different types! Things MUST use the
proper attribute subclass for what they are being attached to.

The answer is that the setup_hw_stats_() for port and device must
be split up and the attribute implementations for each of them have to
subclass starting from the correct type.

So we'd end up with two attributes for hw_stats

struct port_hw_stats {
struct port_attr attr;
struct hw_stats_data data;
};

struct device_hw_stats {
struct device_attr attr;
struct hw_stats_data data;
};

And then two show/set functions that bounce through the correct types
to the data.

And so on.

Jason

Re: [PATCH -next] RDMA/uverbs: Fix -Wunused-function warning

2021-04-01 Thread Jason Gunthorpe

On Thu, Apr 01, 2021 at 10:10:28AM +0800, YueHaibing wrote:
> make W=1 warns this:
> 
> In file included from drivers/infiniband/sw/rdmavt/mmap.c:51:0:
> ./include/rdma/uverbs_ioctl.h:937:1:
>  warning: ‘_uverbs_get_const_unsigned’ defined but not used 
> [-Wunused-function]
>  _uverbs_get_const_unsigned(u64 *to,
>  ^~
> ./include/rdma/uverbs_ioctl.h:930:1:
>  warning: ‘_uverbs_get_const_signed’ defined but not used [-Wunused-function]
>  _uverbs_get_const_signed(s64 *to, const struct uverbs_attr_bundle 
> *attrs_bundle,
>  ^~~~
> 
> Make these functions inline to fix this warnings.
> 
> Signed-off-by: YueHaibing 
> ---
>  include/rdma/uverbs_ioctl.h | 7 ---
>  1 file changed, 4 insertions(+), 3 deletions(-)

Applied to for-next

I added
Fixes: 2904bb37b35d ("IB/core: Split uverbs_get_const/default to consider 
target type")
 
Jason

Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-01 Thread Jason Gunthorpe

On Thu, Apr 01, 2021 at 07:04:01AM +, Liu, Yi L wrote:

> After reading your reply in 
> https://lore.kernel.org/linux-iommu/20210331123801.gd1463...@nvidia.com/#t
> So you mean /dev/ioasid FD is per-VM instead of per-ioasid, so above skeleton
> doesn't suit your idea.

You can do it one PASID per FD or multiple PASID's per FD. Most likely
we will have high numbers of PASID's in a qemu process so I assume
that number of FDs will start to be a contraining factor, thus
multiplexing is reasonable.

It doesn't really change anything about the basic flow.

digging deeply into it either seems like a reasonable choice.

> +-+---+
> |  userspace  |   kernel space
> |
> +-+---+
> | ioasid_fd = | /dev/ioasid does below:   
> |
> | open("/dev/ioasid", O_RDWR);|   struct ioasid_fd_ctx {  
> |
> | |   struct list_head ioasid_list;   
> |
> | |   ... 
> |
> | |   } ifd_ctx; // ifd_ctx is per ioasid_fd  
> |

Sure, possibly an xarray not a list

> +-+---+
> | ioctl(ioasid_fd,| /dev/ioasid does below:   
> |
> |   ALLOC, );  |   struct ioasid_data {
> |
> | |   ioasid_t ioasid;
> |
> | |   struct list_head device_list;   
> |
> | |   struct list_head next;  
> |
> | |   ... 
> |
> | |   } id_data; // id_data is per ioasid 
> |
> | |   
> |
> | |   list_add(_data.next, 
> |
> | |_ctx.ioasid_list);
> |

Yes, this should have a kref in it too

> +-+---+
> | ioctl(device_fd,| VFIO does below:  
> |
> |   DEVICE_ALLOW_IOASID,  | 1) get ioasid_fd, check if ioasid_fd is valid 
> |
> |   ioasid_fd,| 2) check if ioasid is allocated from 
> ioasid_fd|
> |   ioasid);  | 3) register device/domain info to /dev/ioasid 
> |
> | |tracked in id_data.device_list 
> |
> | | 4) record the ioasid in VFIO's per-device 
> |
> | |ioasid list for future security check  
> |

You would provide a function that does steps 1&2 look at eventfd for
instance.

I'm not sure we need to register the device with the ioasid. device
should incr the kref on the ioasid_data at this point.

> +-+---+
> | ioctl(ioasid_fd,| /dev/ioasid does below:   
> |
> |   BIND_PGTBL,   | 1) find ioasid's id_data  
> |
> |   pgtbl_data,   | 2) loop the id_data.device_list and tell 
> iommu|
> |   ioasid);  |give ioasid access to the devices
> |

This seems backwards, DEVICE_ALLOW_IOASID should tell the iommu to
give the ioasid to the device.

Here the ioctl should be about assigning a memory map from the the current
mm_struct to the pasid

> +-+---+
> | ioctl(ioasid_fd,| /dev/ioasid does below:   
> |
> |   UNBIND_PGTBL, | 1) find ioasid's id_data  
> |
> |   ioasid);  | 2) loop the id_data.device_list and tell 
> iommu|
> | |clear ioasid access to the devices 
> |

Also seems backwards. The ioctl here should be 'destroy ioasid' which
wipes out the page table, halts DMA access and parks the PASID until
all users are done.

> +-+---+
> | ioctl(device_fd,| VFIO does below:  
> |
> |  DEVICE_DISALLOW_IOASID,| 1) check if ioasid is associated in VFIO's
> |
> |   ioasid_fd,|device ioasid list.
> |
> |   ioasid);  | 2) unregister device/domain info from 
> |
> | |/dev/ioasid, clear in id_data.device_list  
> |

This should disconnect the iommu and kref_put the ioasid_data

Remember the layering, only the device_fd knows what the pci_device

Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-01 Thread Jason Gunthorpe

On Thu, Apr 01, 2021 at 01:43:36PM +, Liu, Yi L wrote:
> > From: Jason Gunthorpe 
> > Sent: Thursday, April 1, 2021 9:16 PM
> > 
> > On Thu, Apr 01, 2021 at 01:10:48PM +, Liu, Yi L wrote:
> > > > From: Jason Gunthorpe 
> > > > Sent: Thursday, April 1, 2021 7:47 PM
> > > [...]
> > > > I'm worried Intel views the only use of PASID in a guest is with
> > > > ENQCMD, but that is not consistent with the industry. We need to see
> > > > normal nested PASID support with assigned PCI VFs.
> > >
> > > I'm not quire flow here. Intel also allows PASID usage in guest without
> > > ENQCMD. e.g. Passthru a PF to guest, and use PASID on it without
> > ENQCMD.
> > 
> > Then you need all the parts, the hypervisor calls from the vIOMMU, and
> > you can't really use a vPASID.
> 
> This is a diagram shows the vSVA setup.

I'm not talking only about vSVA. Generic PASID support with arbitary
mappings.

And how do you deal with the vPASID vs pPASID issue if the system has
a mix of physical devices and mdevs?

Jason

Re: [PATCH -next] IB/srpt: Fix passing zero to 'PTR_ERR'

2021-04-01 Thread Jason Gunthorpe

On Wed, Mar 24, 2021 at 10:09:39PM +0800, YueHaibing wrote:
> Fix smatch warning:
> 
> drivers/infiniband/ulp/srpt/ib_srpt.c:2341 srpt_cm_req_recv() warn: passing 
> zero to 'PTR_ERR'
> 
> Use PTR_ERR_OR_ZERO instead of PTR_ERR
> 
> Fixes: 847462de3a0a ("IB/srpt: Fix srpt_cm_req_recv() error path (1/2)")
> Signed-off-by: YueHaibing 
>  drivers/infiniband/ulp/srpt/ib_srpt.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/infiniband/ulp/srpt/ib_srpt.c 
> b/drivers/infiniband/ulp/srpt/ib_srpt.c
> index 6be60aa5ffe2..3ff24b5048ac 100644
> +++ b/drivers/infiniband/ulp/srpt/ib_srpt.c
> @@ -2338,7 +2338,7 @@ static int srpt_cm_req_recv(struct srpt_device *const 
> sdev,
>  
>   if (IS_ERR_OR_NULL(ch->sess)) {
>   WARN_ON_ONCE(ch->sess == NULL);
> - ret = PTR_ERR(ch->sess);
> + ret = PTR_ERR_OR_ZERO(ch->sess);

This whole flow is a mess, if someone fixes properly then fine, but
I'm not convinced returning 0 here is correct either.

Jason

Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-01 Thread Jason Gunthorpe

On Thu, Apr 01, 2021 at 01:10:48PM +, Liu, Yi L wrote:
> > From: Jason Gunthorpe 
> > Sent: Thursday, April 1, 2021 7:47 PM
> [...]
> > I'm worried Intel views the only use of PASID in a guest is with
> > ENQCMD, but that is not consistent with the industry. We need to see
> > normal nested PASID support with assigned PCI VFs.
> 
> I'm not quire flow here. Intel also allows PASID usage in guest without
> ENQCMD. e.g. Passthru a PF to guest, and use PASID on it without ENQCMD.

Then you need all the parts, the hypervisor calls from the vIOMMU, and
you can't really use a vPASID.

I'm not sure how Intel intends to resolve all of this.

> > > - this per-ioasid SVA operations is not aligned with the native SVA usage
> > >   model. Native SVA bind is per-device.
> > 
> > Seems like that is an error in native SVA.
> > 
> > SVA is a particular mode of the PASID's memory mapping table, it has
> > nothing to do with a device.
> 
> I think it still has relationship with device. This is determined by the
> DMA remapping hierarchy in hardware. e.g. Intel VT-d, the DMA isolation is
> enforced first in device granularity and then PASID granularity. SVA makes
> usage of both PASID and device granularity isolation.

When the device driver authorizes a PASID the VT-d stuff should setup
the isolation parameters for the give pci_device and PASID.

Do not leak implementation details like this as uAPI. Authorization
and memory map are distinct ideas with distinct interfaces. Do not mix
them.

Jason

Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-01 Thread Jason Gunthorpe

On Thu, Apr 01, 2021 at 01:38:46PM +, Liu, Yi L wrote:
> > From: Jean-Philippe Brucker 
> > Sent: Thursday, April 1, 2021 8:05 PM
> [...]
> > 
> > Also wondering about:
> > 
> > * Querying IOMMU nesting capabilities before binding page tables (which
> >   page table formats are supported?). We were planning to have a VFIO cap,
> >   but I'm guessing we need to go back to the sysfs solution?
> 
> I think it can also be with /dev/ioasid.

Sure, anything to do with page table formats and setting page tables
should go through ioasid.

> > * Invalidation, probably an ioasid_fd ioctl?
> 
> yeah, if we are doing bind/unbind_pagtable via ioasid_fd, then yes,
> invalidation should go this way as well. This is why I worried it may
> fail to meet the requirement from you and Eric.

Yes, all manipulation of page tables, including removing memory ranges, or
setting memory ranges to trigger a page fault behavior should go
through here.

> > * Page faults, page response. From and to devices, and don't necessarily
> >   have a PASID. But needed by vdpa as well, so that's also going through
> >   /dev/ioasid?
> 
> page faults should still be per-device, but the fault event fd may be stored
> in /dev/ioasid. page response would be in /dev/ioasid just like invalidation.

Here you mean non-SVA page faults that are delegated to userspace to handle?

Why would that be per-device?

Can you show the flow you imagine?

Jason

Re: [bug report] node: Add memory-side caching attributes

2021-04-01 Thread Jason Gunthorpe

On Thu, Apr 01, 2021 at 05:06:52PM +0300, Dan Carpenter wrote:

> > diff --git a/drivers/base/node.c b/drivers/base/node.c
> > index f449dbb2c74666..89c28952863977 100644
> > +++ b/drivers/base/node.c
> > @@ -319,25 +319,24 @@ void node_add_cache(unsigned int nid, struct 
> > node_cache_attrs *cache_attrs)
> > return;
> >  
> > dev = >dev;
> > +   device_initialize(dev)
> > dev->parent = node->cache_dev;
> > dev->release = node_cacheinfo_release;
> > dev->groups = cache_groups;
> > if (dev_set_name(dev, "index%d", cache_attrs->level))
> 
> Is calling dev_set_name() without doing a device_initialize() a bug?  I
> could write a check for that.

IMHO, yes.

However, Greg may not agree as dev_set_name() with no error check
followed by device_register() is a very common pattern. If the user
omits the device_initialize() then dev_set_name() must be immediately
before only device_register() with no error unwind between them. It
must not error unwind dev_set_name() to kfree. (This is really
tricky)

Greg and I have argued on the merits of device_initialize() several
times before.

I argue the error control flow is simpler and easier to get right, he
argues the extra statement is unneeded complexity and people will get
the error unwind wrong.

Every time you find a bug like this is someone getting the complexity
around error handling and device_register() wrong, so my advices is to
stop using device_register (aka device_init_and_add) and make it
explicit so the goto unwind has logical pairing. put_device() pairs
with device_initialize().

The tricky bit is establishing the release function, as complex
release functions often have complex init sequences and you need the
setup done enough to go to release. When things become this complex I
advocate splitting alloc into a function:

/* Caller must use put_device(>dev) */
struct foo *foo_allocate()
{ 
   foo = kzalloc
   allocate release freeing thing 1;
   allocate release freeing thing 2;
   allocate release freeing thing 3;
   device_initialize();
   return foo;

err:
   free thing 3;
err:
   free thing 2;
err:
   free thing 1;
err:
   kfree(foo)
   return rc;
}

Thus there is a clear logical seperation between the world of 'unwind
to kfree' and the world of 'unwind to put_device'. dev_set_name() can
not be inside an alloc function.

Simple cases, like here, should just do device_initialize()
immediately after kzalloc() and never have a kfree() error unwind.

I saw Christoph had a similar opinion on 'init and add' patterns being
bad, maybe he has additional colour to share.

Jason

Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-01 Thread Jason Gunthorpe

On Thu, Apr 01, 2021 at 02:05:00PM +0200, Jean-Philippe Brucker wrote:
> On Thu, Apr 01, 2021 at 07:04:01AM +, Liu, Yi L wrote:
> > > - how about AMD and ARM's vSVA support? Their PASID allocation and page
> > > table
> > >   happens within guest. They only need to bind the guest PASID table to
> > > host.
> 
> In this case each VM has its own IOASID space, and the host IOASID
> allocator doesn't participate. Plus this only makes sense when assigning a
> whole VF to a guest, and VFIO is the tool for this. So I wouldn't shoehorn
> those ops into /dev/ioasid, though we do need a transport for invalidate
> commands.

How does security work? Devices still have to be authorized to use the
PASID and this approach seems like it completely excludes mdev/vdpa
from ever using a PASID, and those are the most logical users.

> > >   Above model seems unable to fit them. (Jean, Eric, Jacob please feel 
> > > free
> > >   to correct me)
> > > - this per-ioasid SVA operations is not aligned with the native SVA usage
> > >   model. Native SVA bind is per-device.
> 
> Bare-metal SVA doesn't need /dev/ioasid either. 

It depends what you are doing. /dev/ioasid would provide fine grained
control over the memory mapping. It is not strict SVA, but I can see
applications where using a GPU with a pre-configured optimized mapping
could be interesting.

> A program uses a device handle to either ask whether SVA is enabled,
> or to enable it explicitly.  With or without /dev/ioasid, that step
> is required. OpenCL uses the first method - automatically enable
> "fine-grain system SVM" if available, and provide a flag to
> userspace.

SVA can be done with ioasid, we can decide if it makes sense to have
shortcuts in every driver

> So userspace does not need to know about PASID. It's only one method for
> doing SVA (some GPUs are context-switching page tables instead).

Sure, there are lots of approaches. Here we are only talking about
PASID enablement. PASID has more options.

> * Page faults, page response. From and to devices, and don't necessarily
>   have a PASID. But needed by vdpa as well, so that's also going through
>   /dev/ioasid?

Only real PASID's should use this interface. All the not-PASID stuff
is on its own.

VPDA should accept a PASID from here and configure the real
HW to attach the PASID to all DMA's connected to the virtio queues.

Jason

Re: [v2 PATCH] mm: gup: remove FOLL_SPLIT

2021-04-01 Thread Jason Gunthorpe

On Tue, Mar 30, 2021 at 01:38:59PM -0700, Yang Shi wrote:
> Since commit 5a52c9df62b4 ("uprobe: use FOLL_SPLIT_PMD instead of FOLL_SPLIT")
> and commit ba925fa35057 ("s390/gmap: improve THP splitting") FOLL_SPLIT
> has not been used anymore.  Remove the dead code.
> 
> Reviewed-by: John Hubbard 
> Signed-off-by: Yang Shi 
> ---
> v2: Remove the reference in documentation.
> 
>  Documentation/vm/transhuge.rst |  5 -
>  include/linux/mm.h |  1 -
>  mm/gup.c   | 28 ++--
>  3 files changed, 2 insertions(+), 32 deletions(-)

Reviewed-by: Jason Gunthorpe 

Jason

Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-01 Thread Jason Gunthorpe

On Thu, Apr 01, 2021 at 10:23:55AM -0700, Jacob Pan wrote:
> Hi Jason,
> 
> On Wed, 31 Mar 2021 21:37:05 -0300, Jason Gunthorpe  wrote:
> 
> > On Wed, Mar 31, 2021 at 04:46:21PM -0700, Jacob Pan wrote:
> > > Hi Jason,
> > > 
> > > On Wed, 31 Mar 2021 09:38:01 -0300, Jason Gunthorpe 
> > > wrote: 
> > > > > > Get rid of the ioasid set.
> > > > > >
> > > > > > Each driver has its own list of allowed ioasids.
> > > >  [...]  
> > > > 
> > > > The /dev/ioasid FD replaces this security check. By becoming FD
> > > > centric you don't need additional kernel security objects.
> > > > 
> > > > Any process with access to the /dev/ioasid FD is allowed to control
> > > > those PASID. The seperation between VMs falls naturally from the
> > > > seperation of FDs without creating additional, complicated, security
> > > > infrastrucure in the kernel.
> > > > 
> > > > This is why all APIs must be FD focused, and you need to have a
> > > > logical layering of responsibility.
> > > > 
> > > >  Allocate a /dev/ioasid FD
> > > >  Allocate PASIDs inside the FD
> Just to be super clear. Do we allocate a FD for each PASID and return the
> FD to the user? Or return the plain PASID number back to the user space?

I would do multiple PASID's per /dev/ioasid FD because we expect alot
of PASIDs to be in use and we'd run into FDno limits.

> > > >  Assign memory to the PASIDS
> > > > 
> > > >  Open a device FD, eg from VFIO or VDP
> > > >  Instruct the device FD to authorize the device to access PASID A in
> > > >  an ioasid FD  
> > > How do we know user provided PASID A was allocated by the ioasid FD?  
> > 
> > You pass in the ioasid FD and use a 'get pasid from fdno' API to
> > extract the required kernel structure.
> > 
> Seems you are talking about two FDs:
> - /dev/ioasid FD

No, just this one.

> - per IOASID FD
> This API ioasid = get_pasid_from_fd(dev_ioasid_fd, ioasid_fd);
> dev_ioasid_fd will find the xarray for all the PASIDs allocated under it,
> ioasid_fd wil be the index into the xarray to retrieve the actual ioasid.
> Correct?

'ioasid_fd' is just the ioasid number in whatever numberspace the
/dev/ioasid FD's use.

> > Why only one?  Each interaction with the other FDs should include the
> > PASID/FD pair. There is no restriction to just one.

> OK, one per subsystem-VM. For example, if a VM has a VFIO and a VDPA
> device, it should only two /dev/ioasid FDs respectively. Correct?

No, only one.

For something like qemu's use case I mostly expect the vIOMMU driver
will open /dev/ioasid for each vIOMMU instance it creates (basically
only one)

> > The act of programming the page tables and the act of authorizing a
> > PCI BDF to use a PASID are distinct things with two different IOCTLs.
> > 
> Why separate? 

Because they have different owners and different layers in the
software.

It is not about use case, it is about putting the control points where
they naturally belong.

> For a complex stack like vSVA, I feel we have to reduce moving parts and do
> some divide and conquer.

uAPI should have all come together with a user and user application.

uAPI is hardest and most important part.

Jason

Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-01 Thread Jason Gunthorpe

On Thu, Apr 01, 2021 at 02:08:17PM +, Liu, Yi L wrote:

> DMA page faults are delivered to root-complex via page request message and
> it is per-device according to PCIe spec. Page request handling flow is:
> 
> 1) iommu driver receives a page request from device
> 2) iommu driver parses the page request message. Get the RID,PASID, faulted
>page and requested permissions etc.
> 3) iommu driver triggers fault handler registered by device driver with
>iommu_report_device_fault()

This seems confused.

The PASID should define how to handle the page fault, not the driver.

I don't remember any device specific actions in ATS, so what is the
driver supposed to do?

> 4) device driver's fault handler signals an event FD to notify userspace to
>fetch the information about the page fault. If it's VM case, inject the
>page fault to VM and let guest to solve it.

If the PASID is set to 'report page fault to userspace' then some
event should come out of /dev/ioasid, or be reported to a linked
eventfd, or whatever.

If the PASID is set to 'SVM' then the fault should be passed to
handle_mm_fault

And so on.

Userspace chooses what happens based on how they configure the PASID
through /dev/ioasid.

Why would a device driver get involved here?

> Eric has sent below series for the page fault reporting for VM with passthru
> device.
> https://lore.kernel.org/kvm/20210223210625.604517-5-eric.au...@redhat.com/

It certainly should not be in vfio pci. Everything using a PASID needs
this infrastructure, VDPA, mdev, PCI, CXL, etc.

Jason

Re: [PATCH for-next v3 0/2] Introduce rdma_set_min_rnr_timer() and use it in RDS

2021-04-01 Thread Jason Gunthorpe

On Wed, Mar 31, 2021 at 07:54:17PM +, Santosh Shilimkar wrote:
> [...]
> 
> Thanks Haakon. Patchset looks fine by me.
> Acked-by: Santosh Shilimkar 

Jakub/Dave are you OK if I take this RDS patch rdma to rdma's tree?

Thanks,
Jason

Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-01 Thread Jason Gunthorpe

On Thu, Apr 01, 2021 at 04:38:44AM +, Liu, Yi L wrote:
> > From: Jason Gunthorpe 
> > Sent: Wednesday, March 31, 2021 8:41 PM
> > 
> > On Wed, Mar 31, 2021 at 07:38:36AM +, Liu, Yi L wrote:
> > 
> > > The reason is /dev/ioasid FD is per-VM since the ioasid allocated to
> > > the VM should be able to be shared by all assigned device for the VM.
> > > But the SVA operations (bind/unbind page table, cache_invalidate) should
> > > be per-device.
> > 
> > It is not *per-device* it is *per-ioasid*
> >
> > And as /dev/ioasid is an interface for controlling multiple ioasid's
> > there is no issue to also multiplex the page table manipulation for
> > multiple ioasids as well.
> > 
> > What you should do next is sketch out in some RFC the exactl ioctls
> > each FD would have and show how the parts I outlined would work and
> > point out any remaining gaps.
> > 
> > The device FD is something like the vfio_device FD from VFIO, it has
> > *nothing* to do with PASID beyond having a single ioctl to authorize
> > the device to use the PASID. All control of the PASID is in
> > /dev/ioasid.
> 
> good to see this reply. Your idea is much clearer to me now. If I'm getting
> you correctly. I think the skeleton is something like below:
> 
> 1) userspace opens a /dev/ioasid, meanwhile there will be an ioasid
>allocated and a per-ioasid context which can be used to do bind page
>table and cache invalidate, an ioasid FD returned to userspace.
> 2) userspace passes the ioasid FD to VFIO, let it associated with a device
>FD (like vfio_device FD).
> 3) userspace binds page table on the ioasid FD with the page table info.
> 4) userspace unbinds the page table on the ioasid FD
> 5) userspace de-associates the ioasid FD and device FD
> 
> Does above suit your outline?

Seems so

> If yes, I still have below concern and wish to see your opinion.
> - the ioasid FD and device association will happen at runtime instead of
>   just happen in the setup phase.

Of course, this is required for security. The vIOMMU must perform the
device association when the guest requires it. Otherwise a guest
cannot isolate a PASID to a single process/device pair.

I'm worried Intel views the only use of PASID in a guest is with
ENQCMD, but that is not consistent with the industry. We need to see
normal nested PASID support with assigned PCI VFs.

> - how about AMD and ARM's vSVA support? Their PASID allocation and page table
>   happens within guest. They only need to bind the guest PASID table to host.
>   Above model seems unable to fit them. (Jean, Eric, Jacob please feel free
>   to correct me)

No, everything needs the device association step or it is not
secure. 

You can give a PASID to a guest and allow it to manipulate it's memory
map directly, nested under the guest's CPU page tables.

However the guest cannot authorize a PCI BDF to utilize that PASID
without going through some kind of step in the hypervisor. A Guest
should not be able to authorize a PASID for a BDF it doesn't have
access to - only the hypervisor can enforce this.

This all must also fit into the mdev model where only the
device-specific mdev driver can do the device specific PASID
authorization. A hypercall is essential, or we need to stop pretending
mdev is a good idea.

I'm sure there will be some small differences, and you should clearly
explain the entire uAPI surface so that soneone from AMD and ARM can
review it.

> - this per-ioasid SVA operations is not aligned with the native SVA usage
>   model. Native SVA bind is per-device.

Seems like that is an error in native SVA.

SVA is a particular mode of the PASID's memory mapping table, it has
nothing to do with a device.

Jason

Re: [PATCH v7 5/8] mm: Device exclusive memory access

2021-04-01 Thread Jason Gunthorpe

On Thu, Apr 01, 2021 at 01:20:05PM +1100, Alistair Popple wrote:
> On Thursday, 1 April 2021 11:48:13 AM AEDT Jason Gunthorpe wrote:
> > On Thu, Apr 01, 2021 at 11:45:57AM +1100, Alistair Popple wrote:
> > > On Thursday, 1 April 2021 12:46:04 AM AEDT Jason Gunthorpe wrote:
> > > > On Thu, Apr 01, 2021 at 12:27:52AM +1100, Alistair Popple wrote:
> > > > > On Thursday, 1 April 2021 12:18:54 AM AEDT Jason Gunthorpe wrote:
> > > > > > On Wed, Mar 31, 2021 at 11:59:28PM +1100, Alistair Popple wrote:
> > > > > > 
> > > > > > > I guess that makes sense as the split could go either way at the
> > > > > > > moment but I should add a check to make sure this isn't used with
> > > > > > > pinned pages anyway.
> > > > > > 
> > > > > > Is it possible to have a pinned page under one of these things? If I
> > > > > > pin it before you migrate it then it remains pinned but hidden under
> > > > > > the swap entry?
> > > > > 
> > > > > At the moment yes. But I had planned (and this reminded me) to add a 
> check 
> > > to 
> > > > > prevent marking pinned pages for exclusive access. 
> > > > 
> > > > How do you even do that without races with GUP fast?
> > > 
> > > Unless I've missed something I think I've convinced myself it should be 
> safe 
> > > to do the pin check after make_device_exclusive() has replaced all the 
> PTEs 
> > > with exclusive entries.
> > > 
> > > GUP fast sequence:
> > > 1. Read PTE
> > > 2. Pin page
> > > 3. Check PTE
> > > 4. if PTE changed -> unpin and fallback
> > > 
> > > If make_device_exclusive() runs after (1) it will either succeed or see 
> the 
> > > pin from (2) and fail (as desired). GUP should always see the PTE change 
> and 
> > > fallback which will revoke the exclusive access.
> > 
> > AFAICT the user can trigger fork at that instant and fork will try to
> > copy the desposited migration entry before it has been checked
> 
> In that case the child will get a read-only exclusive entry and eventually a 
> page copy via do_wp_page() 

Having do_wp_page() do a copy is a security bug. We closed it with the
at-fork checks.

Jason

Re: [PATCH 8/9] vfio/pci: export nvlink2 support into vendor vfio_pci drivers

2021-04-01 Thread Jason Gunthorpe

On Mon, Mar 29, 2021 at 05:10:53PM -0600, Alex Williamson wrote:
> On Tue, 23 Mar 2021 16:32:13 -0300
> Jason Gunthorpe  wrote:
> 
> > On Mon, Mar 22, 2021 at 10:40:16AM -0600, Alex Williamson wrote:
> > 
> > > Of course if you start looking at features like migration support,
> > > that's more than likely not simply an additional region with optional
> > > information, it would need to interact with the actual state of the
> > > device.  For those, I would very much support use of a specific
> > > id_table.  That's not these.  
> > 
> > What I don't understand is why do we need two different ways of
> > inserting vendor code?
> 
> Because a PCI id table only identifies the device, these drivers are
> looking for a device in the context of firmware dependencies.

The firmware dependencies only exist for a defined list of ID's, so I
don't entirely agree with this statement. I agree with below though,
so lets leave it be.

> > I understood he ment that NVIDI GPUs *without* NVLINK can exist, but
> > the ID table we have here is supposed to be the NVLINK compatible
> > ID's.
> 
> Those IDs are just for the SXM2 variants of the device that can
> exist on a variety of platforms, only one of which includes the
> firmware tables to activate the vfio support.

AFAIK, SXM2 is a special physical form factor that has the nvlink
physical connection - it is only for this specific generation of power
servers that can accept the specific nvlink those cards have.

> I think you're looking for a significant inflection in vendor's stated
> support for vfio use cases, beyond the "best-effort, give it a try",
> that we currently have.

I see, so they don't want to. Lets leave it then.

Though if Xe breaks everything they need to add/maintain a proper ID
table, not more hackery.

> > And again, I feel this is all a big tangent, especially now that HCH
> > wants to delete the nvlink stuff we should just leave igd alone.
> 
> Determining which things stay in vfio-pci-core and which things are
> split to variant drivers and how those variant drivers can match the
> devices they intend to support seems very inline with this series.  

IMHO, the main litmus test for core is if variant drivers will need it
or not.

No variant driver should be stacked on an igd device, or if it someday
is, it should implement the special igd hackery internally (and have a
proper ID table). So when we split it up igd goes into vfio_pci.ko as
some special behavior vfio_pci.ko's universal driver provides for IGD.

Every variant driver will still need the zdev data to be exposed to
userspace, and every PCI device on s390 has that extra information. So
vdev goes to vfio_pci_core.ko

Future things going into vfio_pci.ko need a really good reason why
they can't be varian drivers instead.

Jason

Re: [bug report] node: Add memory-side caching attributes

2021-04-01 Thread Jason Gunthorpe

On Thu, Apr 01, 2021 at 12:00:39PM +0300, Dan Carpenter wrote:
> Hi Keith,
> 
> I've been trying to figure out ways Smatch can check for device managed
> resources.  Like adding rules that if we call dev_set_name(>bar)
> then it's device managaged and if there is a kfree(foo) without calling
> device_put(>bar) then that's a resource leak.

It seems to be working from what I can see

Also I wasn't able to convince myself that any locking around
node->cache_attrs exist..

> Of course one of the rules is that if you call device_register(dev) then
> you can't kfree(dev), it has to released with device_put(dev) and that's
> true even if the register fails.  But this code here feels very
> intentional so maybe there is an exception to the rule?

There is no exception. Open coding this:

>282  free_name:
>283  kfree_const(dev->kobj.name);

To avoid leaking memory from dev_set_name is a straight up layering
violation, WTF?

node_cacheinfo_release() is just kfree(), so there is no need.
Instead (please feel free to send this Dan):

diff --git a/drivers/base/node.c b/drivers/base/node.c
index f449dbb2c74666..89c28952863977 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -319,25 +319,24 @@ void node_add_cache(unsigned int nid, struct 
node_cache_attrs *cache_attrs)
return;
 
dev = >dev;
+   device_initialize(dev)
dev->parent = node->cache_dev;
dev->release = node_cacheinfo_release;
dev->groups = cache_groups;
if (dev_set_name(dev, "index%d", cache_attrs->level))
-   goto free_cache;
+   goto put_device;
 
info->cache_attrs = *cache_attrs;
-   if (device_register(dev)) {
+   if (device_add(dev)) {
dev_warn(>dev, "failed to add cache level:%d\n",
 cache_attrs->level);
-   goto free_name;
+   goto put_device
}
pm_runtime_no_callbacks(dev);
list_add_tail(>node, >cache_attrs);
return;
-free_name:
-   kfree_const(dev->kobj.name);
-free_cache:
-   kfree(info);
+put_device:
+   put_device(dev);
 }
 
 static void node_remove_caches(struct node *node)

Re: [PATCH v6 00/27] Memory Folios

2021-04-01 Thread Jason Gunthorpe

On Thu, Apr 01, 2021 at 01:52:01PM +0100, Matthew Wilcox wrote:
> On Thu, Apr 01, 2021 at 09:28:03AM -0300, Jason Gunthorpe wrote:
> > On Thu, Apr 01, 2021 at 12:26:56PM +0100, Matthew Wilcox wrote:
> > > On Thu, Apr 01, 2021 at 08:05:37AM +0100, Christoph Hellwig wrote:
> > > > On Wed, Mar 31, 2021 at 07:47:01PM +0100, Matthew Wilcox (Oracle) wrote:
> > > > >  - Mirror members of struct page (for pagecache / anon) into struct 
> > > > > folio,
> > > > >so (eg) you can use folio->mapping instead of folio->page.mapping
> > > > 
> > > > Eww, why?
> > > 
> > > So that eventually we can rename page->mapping to page->_mapping and
> > > prevent the bugs from people doing page->mapping on a tail page.  eg
> > > https://lore.kernel.org/linux-mm/alpine.LSU.2.11.2103102214170.7159@eggly.anvils/
> > 
> > Is that gcc structure layout randomization stuff going to be a problem
> > here?
> > 
> > Add some 
> >   static_assert(offsetof(struct folio,..) == offsetof(struct page,..))
> > 
> > tests to force it?
> 
> You sound like the kind of person who hasn't read patch 1.

Yes, I missed this hunk :)

Jason

Re: [syzbot] WARNING in unsafe_follow_pfn

2021-04-01 Thread Jason Gunthorpe

On Wed, Mar 31, 2021 at 07:29:22AM +0300, Dan Carpenter wrote:
> On Tue, Mar 30, 2021 at 07:04:30PM +0200, Paolo Bonzini wrote:
> > On 30/03/21 17:26, syzbot wrote:
> > > Hello,
> > > 
> > > syzbot found the following issue on:
> > > 
> > > HEAD commit:93129492 Add linux-next specific files for 20210326
> > > git tree:   linux-next
> > > console output: https://syzkaller.appspot.com/x/log.txt?x=169ab21ad0
> > > kernel config:  https://syzkaller.appspot.com/x/.config?x=6f2f73285ea94c45
> > > dashboard link: 
> > > https://syzkaller.appspot.com/bug?extid=015dd7cdbbbc2c180c65
> > > syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=119b8d06d0
> > > C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=112e978ad0
> > > 
> > > The issue was bisected to:
> > > 
> > > commit d40b9fdee6dc819d8fc35f70c345cbe0394cde4c
> > > Author: Daniel Vetter 
> > > Date:   Tue Mar 16 15:33:01 2021 +
> > > 
> > >  mm: Add unsafe_follow_pfn
> > > 
> > > bisection log:  
> > > https://syzkaller.appspot.com/x/bisect.txt?x=122d2016d0
> > > final oops: 
> > > https://syzkaller.appspot.com/x/report.txt?x=112d2016d0
> > > console output: https://syzkaller.appspot.com/x/log.txt?x=162d2016d0
> > > 
> > > IMPORTANT: if you fix the issue, please add the following tag to the 
> > > commit:
> > > Reported-by: syzbot+015dd7cdbbbc2c180...@syzkaller.appspotmail.com
> > > Fixes: d40b9fdee6dc ("mm: Add unsafe_follow_pfn")
> > 
> > This is basically intentional because get_vaddr_frames is broken, isn't it?
> > I think it needs to be ignored in syzkaller.
> 
> What?
> 
> The bisect is wrong (because it's blaming the commit which added the
> warning instead of the commit which added the buggy caller) but the
> warning is correct.
> 
> Plus users are going to be seeing this as well.  According to the commit
> message for 69bacee7f9ad ("mm: Add unsafe_follow_pfn") "Unfortunately
> there's some users where this is not fixable (like v4l userptr of iomem
> mappings)".  It sort of seems crazy to dump this giant splat and then
> tell users to ignore it forever because it can't be fixed...  0_0

I think the discussion conclusion was that this interface should not
be used by userspace anymore, it is obsolete by some new interface?

It should be protected by some kconfig and the kconfig should be
turned off for syzkaller runs.

Jason

Re: [PATCH v6 00/27] Memory Folios

2021-04-01 Thread Jason Gunthorpe

On Thu, Apr 01, 2021 at 12:26:56PM +0100, Matthew Wilcox wrote:
> On Thu, Apr 01, 2021 at 08:05:37AM +0100, Christoph Hellwig wrote:
> > On Wed, Mar 31, 2021 at 07:47:01PM +0100, Matthew Wilcox (Oracle) wrote:
> > >  - Mirror members of struct page (for pagecache / anon) into struct folio,
> > >so (eg) you can use folio->mapping instead of folio->page.mapping
> > 
> > Eww, why?
> 
> So that eventually we can rename page->mapping to page->_mapping and
> prevent the bugs from people doing page->mapping on a tail page.  eg
> https://lore.kernel.org/linux-mm/alpine.LSU.2.11.2103102214170.7159@eggly.anvils/

Is that gcc structure layout randomization stuff going to be a problem
here?

Add some 
  static_assert(offsetof(struct folio,..) == offsetof(struct page,..))

tests to force it?

Jason

Re: [PATCH v7 5/8] mm: Device exclusive memory access

2021-03-31 Thread Jason Gunthorpe

On Thu, Apr 01, 2021 at 11:45:57AM +1100, Alistair Popple wrote:
> On Thursday, 1 April 2021 12:46:04 AM AEDT Jason Gunthorpe wrote:
> > On Thu, Apr 01, 2021 at 12:27:52AM +1100, Alistair Popple wrote:
> > > On Thursday, 1 April 2021 12:18:54 AM AEDT Jason Gunthorpe wrote:
> > > > On Wed, Mar 31, 2021 at 11:59:28PM +1100, Alistair Popple wrote:
> > > > 
> > > > > I guess that makes sense as the split could go either way at the
> > > > > moment but I should add a check to make sure this isn't used with
> > > > > pinned pages anyway.
> > > > 
> > > > Is it possible to have a pinned page under one of these things? If I
> > > > pin it before you migrate it then it remains pinned but hidden under
> > > > the swap entry?
> > > 
> > > At the moment yes. But I had planned (and this reminded me) to add a 
> > > check 
> to 
> > > prevent marking pinned pages for exclusive access. 
> > 
> > How do you even do that without races with GUP fast?
> 
> Unless I've missed something I think I've convinced myself it should be safe 
> to do the pin check after make_device_exclusive() has replaced all the PTEs 
> with exclusive entries.
> 
> GUP fast sequence:
> 1. Read PTE
> 2. Pin page
> 3. Check PTE
> 4. if PTE changed -> unpin and fallback
> 
> If make_device_exclusive() runs after (1) it will either succeed or see the 
> pin from (2) and fail (as desired). GUP should always see the PTE change and 
> fallback which will revoke the exclusive access.

AFAICT the user can trigger fork at that instant and fork will try to
copy the desposited migration entry before it has been checked

Jason

Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-03-31 Thread Jason Gunthorpe

On Wed, Mar 31, 2021 at 04:46:21PM -0700, Jacob Pan wrote:
> Hi Jason,
> 
> On Wed, 31 Mar 2021 09:38:01 -0300, Jason Gunthorpe  wrote:
> 
> > > > Get rid of the ioasid set.
> > > >
> > > > Each driver has its own list of allowed ioasids.  
> >  [...]  
> > 
> > The /dev/ioasid FD replaces this security check. By becoming FD
> > centric you don't need additional kernel security objects.
> > 
> > Any process with access to the /dev/ioasid FD is allowed to control
> > those PASID. The seperation between VMs falls naturally from the
> > seperation of FDs without creating additional, complicated, security
> > infrastrucure in the kernel.
> > 
> > This is why all APIs must be FD focused, and you need to have a
> > logical layering of responsibility.
> > 
> >  Allocate a /dev/ioasid FD
> >  Allocate PASIDs inside the FD
> >  Assign memory to the PASIDS
> > 
> >  Open a device FD, eg from VFIO or VDP
> >  Instruct the device FD to authorize the device to access PASID A in
> >  an ioasid FD
> How do we know user provided PASID A was allocated by the ioasid FD?

You pass in the ioasid FD and use a 'get pasid from fdno' API to
extract the required kernel structure.

> Shouldn't we validate user input by tracking which PASIDs are
> allocated by which ioasid FD?

Yes, but it is integral to the ioasid FD, not something separated.

> > VFIO extracts some kernel representation of the ioasid from the ioasid
> > fd using an API
> > 
> This lookup API seems to be asking for per ioasid FD storage array. Today,
> the ioasid_set is per mm and contains a Xarray. 

Right, put the xarray per FD. A set per mm is fairly nonsensical, we
don't use the mm as that kind of security key.

> Since each VM, KVM can only open one ioasid FD, this per FD array
> would be equivalent to the per mm ioasid_set, right?

Why only one?  Each interaction with the other FDs should include the
PASID/FD pair. There is no restriction to just one.

> > VFIO does some kernel call to IOMMU/IOASID layer that says 'tell the
> > IOMMU that this PCI device is allowed to use this PASID'
>
> Would it be redundant to what iommu_uapi_sva_bind_gpasid() does? I thought
> the idea is to use ioasid FD IOCTL to issue IOMMU uAPI calls. Or we can
> skip this step for now and wait for the user to do SVA bind.

I'm not sure what you are asking.

Possibly some of the IOMMU API will need a bit adjusting to make
things split.

The act of programming the page tables and the act of authorizing a
PCI BDF to use a PASID are distinct things with two different IOCTLs.

iommu_uapi_sva_bind_gpasid() is never called by anything, and it's
uAPI is never implemented.

Joerg? Why did you merge dead uapi and dead code?

Jason

Re: [PATCH -next] RDMA/iw_cxgb4: Use DEFINE_SPINLOCK() for spinlock

2021-03-31 Thread Jason Gunthorpe

On Wed, Mar 31, 2021 at 10:01:05AM +0800, Tang Yizhou wrote:
> spinlock can be initialized automatically with DEFINE_SPINLOCK()
> rather than explicitly calling spin_lock_init().
> 
> Reported-by: Hulk Robot 
> Signed-off-by: Tang Yizhou 
> ---
>  drivers/infiniband/hw/cxgb4/cm.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)

Applied to for-next, thanks

Jason

Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-03-31 Thread Jason Gunthorpe

On Wed, Mar 31, 2021 at 11:20:30AM -0700, Jacob Pan wrote:
> Hi Jason,
> 
> On Wed, 31 Mar 2021 14:31:48 -0300, Jason Gunthorpe  wrote:
> 
> > > > We should try to avoid hidden behind the scenes kernel
> > > > interconnections between subsystems.
> > > >   
> > > Can we? in case of exception. Since all these IOCTLs are coming from the
> > > unreliable user space, we must deal all exceptions.
> > >
> > > For example, when user closes /dev/ioasid FD before (or w/o) unbind
> > > IOCTL for VFIO, KVM, kernel must do cleanup and coordinate among
> > > subsystems. In this patchset, we have a per mm(ioasid_set) notifier to
> > > inform mdev, KVM to clean up and drop its refcount. Do you have any
> > > suggestion on this?  
> > 
> > The ioasid should be a reference counted object.
> > 
> yes, this is done in this patchset.
> 
> > When the FD is closed, or the ioasid is "destroyed" it just blocks DMA
> > and parks the PASID until *all* places release it. Upon a zero
> > refcount the PASID is recycled for future use.
> > 
> Just to clarify, you are saying (when FREE happens before proper
> teardown) there is no need to proactively notify all users of the IOASID to
> drop their reference. Instead, just wait for the other parties to naturally
> close and drop their references. Am I understanding you correctly?

Yes. What are receivers going to do when you notify them anyhow? What
will a mdev do? This is how you get into they crazy locking problems.

It is an error for userspace to shutdown like this, recover sensibly
and don't crash the kernel. PCIe error TLPs are expected, supress
them. That is what we decided on the mmu notifier discussion.

> I feel having the notifications can add two values:
> 1. Shorten the duration of errors (as you mentioned below), FD close can
> take a long and unpredictable time. e.g. FD shared.

Only if userspace exits in some uncontrolled way. In a controlled exit
it can close all the FDs in the right order.

It is OK if userspace does something weird and ends up with disabled
IOASIDs. It shouldn't do that if it cares.

> 2. Provide teardown ordering among PASID users. i.e. vCPU, IOMMU, mdev.

This is a hard ask too, there is no natural ordering here I can see,
obviously we want vcpu, mdev, iommu for qemu but that doesn't seem to
fall out unless we explicitly hard wire it into the kernel.

Doesn't kvm always kill the vCPU first based on the mmu notifier
shooting down all the memory? IIRC this happens before FD close?

> > The duration between unmapping the ioasid and releasing all HW access
> > will have HW see PCIE TLP errors due to the blocked access. If
> > userspace messes up the order it is fine to cause this. We already had
> > this dicussion when talking about how to deal with process exit in the
> > simple SVA case.
> Yes, we have disabled fault reporting during this period. The slight
> differences vs. the simple SVA case is that KVM is also involved and there
> might be an ordering requirement to stop vCPU first.

KVM can continue to use the PASIDs, they are parked and DMA is
permanently blocked. When KVM reaches a natural point in its teardown
it can release them.

If you have to stop the vcpu from a iommu notifier you are in the
crazy locking world I mentioned. IMHO don't create exciting locking
problems just to avoid PCI errors in uncontrolled shutdown.

Suppress the errors instead.

Jason

Re: [PATCH v2] IB/mlx5: Reduce max order of memory allocated for xlt update

2021-03-31 Thread Jason Gunthorpe

On Thu, Mar 25, 2021 at 11:39:28AM -0300, Jason Gunthorpe wrote:
> On Tue, Mar 23, 2021 at 09:27:38PM -0700, Aruna Ramakrishna wrote:
> 
> > > Do you have benchmarks that show the performance of the high order
> > > pages is not relavent? I'm a bit surprised to hear that
> > > 
> > 
> > I guess my point was more to the effect that an order-8 alloc will
> > fail more often than not, in this flow. For instance, when we were
> > debugging the latency spikes here, this was the typical buddyinfo
> > output on that system:
> > 
> > Node 0, zone  DMA  0  1  1  2  3  0  1  
> > 0  1  1  3 
> > Node 0, zoneDMA32  7  7  7  6 10  2  6  
> > 7  6  2306 
> > Node 0, zone   Normal   3390  51354  17574   6556   1586 26  2  
> > 1  0  0  0 
> > Node 1, zone   Normal  11519  23315  23306   9738 73  2  0  
> > 1  0  0  0 
> > 
> > I think this level of fragmentation is pretty normal on long running
> > systems. Here, in the reg_mr flow, the first try (order-8) alloc
> > will probably fail 9 times out of 10 (esp. after the addition of
> > GFP_NORETRY flag), and then as fallback, the code tries to allocate
> > a lower order, and if that too fails, it allocates a page. I think
> > it makes sense to just avoid trying an order-8 alloc here.
> 
> But a system like this won't get THPs either, so I'm not sure it is
> relevant. The function was designed as it is to consume a "THP" if it
> is available.

So can we do this with just the addition of __GFP_NORETRY ?

Jason

Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-03-31 Thread Jason Gunthorpe

On Wed, Mar 31, 2021 at 09:34:57AM -0700, Jacob Pan wrote:

> "3.3 PASID translation
> To support PASID isolation for Shared Work Queues used by VMs, the CPU must
> provide a way for the PASID to be communicated to the device in the DMWr
> transaction. On Intel CPUs, the CPU provides a PASID translation table in
> the vCPUs virtual machine control structures. During ENQCMD/ENQCMDS
> instruction execution in a VM, the PASID translation table is used by the
> CPU to replace the guest PASID in the work descriptor with a host PASID
> before the descriptor is sent to the device.3.3 PASID translation"

Yikes, a special ENQCMD table in the hypervisor!

Still, pass the /dev/ioasid into a KVM IOCTL and tell it to populate
this table. KVM only adds to the table when userspace presents a
/dev/ioasid FD.

> > Doesn't work submission go either to the mdev driver or through the
> > secure PASID of #1?
> 
> No, once a PASID is bound with IOMMU, KVM, and the mdev, work
> submission is all done in HW.  But I don't think this will change
> for either uAPI design.

The big note here is "only for things that use ENQCMD" and that is
basically nothing these days.

> > Everything should revolve around the /dev/ioasid FD. qemu should pass
> > it to all places that need to know about PASID's in the VM.
> 
> I guess we need to extend KVM interface to support PASIDs. Our original
> intention was to avoid introducing new interfaces.

New features need new interfaces, especially if there is a security
sensitivity! KVM should *not* automatically opt into security
sensitive stuff without being explicitly told what to do.

Here you'd need to authorized *two* things for IDXD:
 - The mdev needs to be told it is allowed to use PASID, this tells
   the IOMMU driver to connect the pci device under the mdev
 - KVM needs to be told to populate a vPASID to the 'ENQCMD'
   security table translated to a physical PASID.

If qemu doesn't explicitly enable the ENQCMD security table it should
be *left disabled* by KVM - even if someone else is using PASID in the
same process. And the API should be narrow like this just to the
EQNCMD table as who knows what will come down the road, or how it will
work.

Having a PASID wrongly leak out into the VM would be a security
disaster. Be explicit.

> > We should try to avoid hidden behind the scenes kernel
> > interconnections between subsystems.
> > 
> Can we? in case of exception. Since all these IOCTLs are coming from the
> unreliable user space, we must deal all exceptions.
>
> For example, when user closes /dev/ioasid FD before (or w/o) unbind IOCTL
> for VFIO, KVM, kernel must do cleanup and coordinate among subsystems.
> In this patchset, we have a per mm(ioasid_set) notifier to inform mdev, KVM
> to clean up and drop its refcount. Do you have any suggestion on this?

The ioasid should be a reference counted object.

When the FD is closed, or the ioasid is "destroyed" it just blocks DMA
and parks the PASID until *all* places release it. Upon a zero
refcount the PASID is recycled for future use.

The duration between unmapping the ioasid and releasing all HW access
will have HW see PCIE TLP errors due to the blocked access. If
userspace messes up the order it is fine to cause this. We already had
this dicussion when talking about how to deal with process exit in the
simple SVA case.

> Let me try to paraphrase, you are suggesting common helper code and data
> format but still driver specific storage of the mapping, correct?

The driver just needs to hold the datastructure in its memory.

Like an xarray, the driver can have an xarray inside its struct
device, but the xarray library provides all the implementation.

Jason

Re: [PATCH v3 3/4] cxl/mem: Do not rely on device_add() side effects for dev_set_name() failures

2021-03-31 Thread Jason Gunthorpe

On Wed, Mar 31, 2021 at 09:04:32AM -0700, Dan Williams wrote:
> On Wed, Mar 31, 2021 at 6:10 AM Jason Gunthorpe  wrote:
> >
> > On Tue, Mar 30, 2021 at 04:36:42PM -0700, Dan Williams wrote:
> > > +static int cxl_mem_add_memdev(struct cxl_mem *cxlm)
> > > +{
> > > + struct cxl_memdev *cxlmd;
> > > + struct device *dev;
> > > + struct cdev *cdev;
> > > + int rc;
> > > +
> > > + cxlmd = cxl_memdev_alloc(cxlm);
> > > + if (IS_ERR(cxlmd))
> > > + return PTR_ERR(cxlmd);
> > > +
> > > + dev = >dev;
> > > + rc = dev_set_name(dev, "mem%d", cxlmd->id);
> > > + if (rc)
> > > + goto err;
> > >
> > > + cdev = >cdev;
> > >   cxl_memdev_activate(cxlmd, cxlm);
> > >   rc = cdev_device_add(cdev, dev);
> > >   if (rc)
> > > - goto err_add;
> > > + goto err;
> >
> > It might read nicer to have the error unwind here just call 
> > cxl_memdev_unregister()
> 
> Perhaps, but I don't think cdev_del() and device_del() are prepared to
> deal with an object that was not successfully added.

Oh, probably not, yuk yuk yuk.

Ideally cdev_device_add should not fail in a way that allows an open,
I think that is just an artifact of it being composed of smaller
functions..

For instance if we replace the kobj_map with xarray then we can
use xa_reserve and xa_store to avoid this condition.

This actually looks like a good fit because the dev_t has pretty
"lumpy" allocations and this isn't really performance sensitive.

A clever person could then make the dev_t self allocating and solve
another pain point with this interface. Hum..

Jason

Re: [PATCH v7 5/8] mm: Device exclusive memory access

2021-03-31 Thread Jason Gunthorpe

On Thu, Apr 01, 2021 at 12:27:52AM +1100, Alistair Popple wrote:
> On Thursday, 1 April 2021 12:18:54 AM AEDT Jason Gunthorpe wrote:
> > On Wed, Mar 31, 2021 at 11:59:28PM +1100, Alistair Popple wrote:
> > 
> > > I guess that makes sense as the split could go either way at the
> > > moment but I should add a check to make sure this isn't used with
> > > pinned pages anyway.
> > 
> > Is it possible to have a pinned page under one of these things? If I
> > pin it before you migrate it then it remains pinned but hidden under
> > the swap entry?
> 
> At the moment yes. But I had planned (and this reminded me) to add a check to 
> prevent marking pinned pages for exclusive access. 

How do you even do that without races with GUP fast?

Jason

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 4366 matches

Mail list logo