Re: [PATCH v3 03/18] driver core: platform: Rename platform_dma_configure()

2021-12-07 Thread Dan Williams
On Mon, Dec 6, 2021 at 7:04 AM Jason Gunthorpe  wrote:
>
> On Mon, Dec 06, 2021 at 06:47:45AM -0800, Christoph Hellwig wrote:
> > On Mon, Dec 06, 2021 at 10:45:35AM -0400, Jason Gunthorpe via iommu wrote:
> > > IIRC the only thing this function does is touch ACPI and OF stuff?
> > > Isn't that firmware?
> > >
> > > AFAICT amba uses this because AMBA devices might be linked to DT
> > > descriptions?
> >
> > But DT descriptions aren't firmware.  They are usually either passed onb
> > the bootloader or in some deeply embedded setups embedded into the
> > kernel image.
>
> Pedenatically yes, but do you know of a common word to refer to both
> OF and ACPI that is better than firmware? :)
>
> AFAICT we already use firwmare for this in a few places, eg
> fwnode_handle and so on.

I've always thought 'platform' was the generic name for otherwise
non-enumerable platform-firmware/data things enumerated by ACPI / OF.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2 9/9] iommu/vt-d: Use pci core's DVSEC functionality

2021-09-28 Thread Dan Williams
On Thu, Sep 23, 2021 at 10:27 AM Ben Widawsky  wrote:
>
> Reduce maintenance burden of DVSEC query implementation by using the
> centralized PCI core implementation.
>
> Cc: iommu@lists.linux-foundation.org
> Cc: David Woodhouse 
> Cc: Lu Baolu 
> Signed-off-by: Ben Widawsky 
> ---
>  drivers/iommu/intel/iommu.c | 15 +--
>  1 file changed, 1 insertion(+), 14 deletions(-)
>
> diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> index d75f59ae28e6..30c97181f0ae 100644
> --- a/drivers/iommu/intel/iommu.c
> +++ b/drivers/iommu/intel/iommu.c
> @@ -5398,20 +5398,7 @@ static int intel_iommu_disable_sva(struct device *dev)
>   */
>  static int siov_find_pci_dvsec(struct pci_dev *pdev)
>  {
> -   int pos;
> -   u16 vendor, id;
> -
> -   pos = pci_find_next_ext_capability(pdev, 0, 0x23);
> -   while (pos) {
> -   pci_read_config_word(pdev, pos + 4, );
> -   pci_read_config_word(pdev, pos + 8, );
> -   if (vendor == PCI_VENDOR_ID_INTEL && id == 5)
> -   return pos;
> -
> -   pos = pci_find_next_ext_capability(pdev, pos, 0x23);
> -   }
> -
> -   return 0;
> +   return pci_find_dvsec_capability(pdev, PCI_VENDOR_ID_INTEL, 5);
>  }

Same comments as the CXL patch, siov_find_pci_dvsec() doesn't seem to
have a reason to exist anymore. What is 5?
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2 7/9] cxl/pci: Use pci core's DVSEC functionality

2021-09-28 Thread Dan Williams
On Thu, Sep 23, 2021 at 10:27 AM Ben Widawsky  wrote:
>
> Reduce maintenance burden of DVSEC query implementation by using the
> centralized PCI core implementation.
>
> Signed-off-by: Ben Widawsky 
> ---
>  drivers/cxl/pci.c | 20 +---
>  1 file changed, 1 insertion(+), 19 deletions(-)
>
> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> index 5eaf2736f779..79d4d9b16d83 100644
> --- a/drivers/cxl/pci.c
> +++ b/drivers/cxl/pci.c
> @@ -340,25 +340,7 @@ static void cxl_pci_unmap_regblock(struct cxl_mem *cxlm, 
> struct cxl_register_map
>
>  static int cxl_pci_dvsec(struct pci_dev *pdev, int dvsec)

cxl_pci_dvsec() has no reason to exist anymore. Let's just have the
caller use pci_find_dvsec_capability() directly.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2 5/9] cxl/pci: Make more use of cxl_register_map

2021-09-28 Thread Dan Williams
On Thu, Sep 23, 2021 at 10:27 AM Ben Widawsky  wrote:
>
> The structure exists to pass around information about register mapping.
> Using it more extensively cleans up many existing functions.

I would have liked to have seen "add @base to cxl_register_map" and
"use @map for @bar and @offset arguments" somewhere in this changelog
to set expectations for what changes are included. That would have
also highlighted that adding a @base to cxl_register_map deserves its
own patch vs the conversion of @bar and @offset to instead use
@map->bar and @map->offset. Can you resend with that split and those
mentions?
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2 4/9] cxl/pci: Refactor cxl_pci_setup_regs

2021-09-28 Thread Dan Williams
On Thu, Sep 23, 2021 at 10:27 AM Ben Widawsky  wrote:
>
> In preparation for moving parts of register mapping to cxl_core, the
> cxl_pci driver is refactored to utilize a new helper to find register
> blocks by type.
>
> cxl_pci scanned through all register blocks and mapping the ones that
> the driver will use. This logic is inverted so that the driver
> specifically requests the register blocks from a new helper. Under the
> hood, the same implementation of scanning through all register locator
> DVSEC entries exists.
>
> There are 2 behavioral changes (#2 is arguable):
> 1. A dev_err is introduced if cxl_map_regs fails.
> 2. The previous logic would try to map component registers and device
>registers multiple times if there were present and keep the mapping
>of the last one found (furthest offset in the register locator).
>While this is disallowed in the spec, CXL 2.0 8.1.9: "Each register
>block identifier shall only occur once in the Register Locator DVSEC
>structure" it was how the driver would respond to the spec violation.
>The new logic will take the first found register block by type and
>move on.
>
> Signed-off-by: Ben Widawsky 
>
> ---
> Changes since v1:

No changes? Luckily git am strips this section...

Overall I think this refactor can be broken down further for
readability and cleanup the long standing problem that the driver maps
component registers for no reason. The main contributing factor to
readability is that cxl_setup_pci_regs() still exists after the
refactor, which also contributes to the component register problem. If
the register mapping is up leveled to the caller of
cxl_setup_pci_regs() (and drops mapping component registers) then a
follow-on patch to rename cxl_setup_pci_regs to find_register_block
becomes easier to read. Moving the cxl_register_map array out of
cxl_setup_pci_regs() also makes a later patch to pass component
register enumeration details to the endpoint-port that much cleaner.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2 3/9] cxl/pci: Remove pci request/release regions

2021-09-28 Thread Dan Williams
On Thu, Sep 23, 2021 at 10:26 AM Ben Widawsky  wrote:
>
> Quoting Dan, "... the request + release regions should probably just be
> dropped. It's not like any of the register enumeration would collide
> with someone else who already has the registers mapped. The collision
> only comes when the registers are mapped for their final usage, and that
> will have more precision in the request."

Looks good to me:

Reviewed-by: Dan Williams 

>
> Recommended-by: Dan Williams 

This isn't one of the canonical tags:
Documentation/process/submitting-patches.rst

I'll change this to Suggested-by:

> Signed-off-by: Ben Widawsky 
> ---
>  drivers/cxl/pci.c | 5 -
>  1 file changed, 5 deletions(-)
>
> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> index ccc7c2573ddc..7256c236fdb3 100644
> --- a/drivers/cxl/pci.c
> +++ b/drivers/cxl/pci.c
> @@ -453,9 +453,6 @@ static int cxl_pci_setup_regs(struct cxl_mem *cxlm)
> return -ENXIO;
> }
>
> -   if (pci_request_mem_regions(pdev, pci_name(pdev)))
> -   return -ENODEV;
> -
> /* Get the size of the Register Locator DVSEC */
> pci_read_config_dword(pdev, regloc + PCI_DVSEC_HEADER1, _size);
> regloc_size = FIELD_GET(PCI_DVSEC_HEADER1_LENGTH_MASK, regloc_size);
> @@ -499,8 +496,6 @@ static int cxl_pci_setup_regs(struct cxl_mem *cxlm)
> n_maps++;
> }
>
> -   pci_release_mem_regions(pdev);
> -
> for (i = 0; i < n_maps; i++) {
> ret = cxl_map_regs(cxlm, [i]);
> if (ret)
> --
> 2.33.0
>
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2 2/9] cxl/pci: Remove dev_dbg for unknown register blocks

2021-09-28 Thread Dan Williams
On Thu, Sep 23, 2021 at 10:27 AM Ben Widawsky  wrote:
>
> While interesting to driver developers, the dev_dbg message doesn't do
> much except clutter up logs. This information should be attainable
> through sysfs, and someday lspci like utilities. This change
> additionally helps reduce the LOC in a subsequent patch to refactor some
> of cxl_pci register mapping.

Looks good to me:

Reviewed-by: Dan Williams 
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2 1/9] cxl: Convert "RBI" to enum

2021-09-27 Thread Dan Williams
Please spell out "register block indicator" in the subject so that the
shortlog remains somewhat readable.

On Thu, Sep 23, 2021 at 10:27 AM Ben Widawsky  wrote:
>
> In preparation for passing around the Register Block Indicator (RBI) as
> a parameter, it is desirable to convert the type to an enum so that the
> interface can use a well defined type checked parameter.

C wouldn't type check this unless it failed an integer conversion,
right? It would need to be a struct to get useful type checking.

I don't mind this for the self documenting properties it has for the
functions that will take this as a parameter, but maybe clarify what
you mean by type checked parameter?

>
> As a result of this change, should future versions of the spec add
> sparsely defined identifiers, it could become a problem if checking for
> invalid identifiers since the code currently checks for the max
> identifier. This is not an issue with current spec, and the algorithm to
> obtain the register blocks will change before those possible additions
> are made.

In general let's not spend changelog space trying to guess what future
specs may or may not do. I.e. I think this text can be dropped,
especially because enums can support sparse number spaces.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 00/16] Add new DMA mapping operation for P2PDMA

2021-04-27 Thread Dan Williams
On Tue, Apr 27, 2021 at 1:22 PM John Hubbard  wrote:
>
> On 4/27/21 12:28 PM, Jason Gunthorpe wrote:
> > On Thu, Apr 08, 2021 at 11:01:07AM -0600, Logan Gunthorpe wrote:
> >> Hi,
> >>
> >> This patchset continues my work to to add P2PDMA support to the common
> >> dma map operations. This allows for creating SGLs that have both P2PDMA
> >> and regular pages which is a necessary step to allowing P2PDMA pages in
> >> userspace.
> >>
> >> The earlier RFC[1] generated a lot of great feedback and I heard no show
> >> stopping objections. Thus, I've incorporated all the feedback and have
> >> decided to post this as a proper patch series with hopes of eventually
> >> getting it in mainline.
> >>
> >> I'm happy to do a few more passes if anyone has any further feedback
> >> or better ideas.
> >
> > For the user of the DMA API the idea seems reasonable enough, the next
> > steps to integrate with pin_user_pages() seem fairly straightfoward
> > too
> >
> > Was there no feedback on this at all?
> >
>
> oops, I meant to review this a lot sooner, because this whole p2pdma thing is
> actually very interesting and important...somehow it slipped but I'll take
> a look now.

Still in my queue as well behind Joao's memmap consolidation series,
and a recent copy_mc_to_iter() fix series from Al.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH 04/15] lib/scatterlist: Add flag for indicating P2PDMA segments in an SGL

2020-12-09 Thread Dan Williams
On Wed, Dec 9, 2020 at 6:07 PM Logan Gunthorpe  wrote:
>
>
>
> On 2020-12-09 6:22 p.m., Dan Williams wrote:
> > On Mon, Nov 9, 2020 at 8:47 AM Logan Gunthorpe  wrote:
> >>
> >>
> >>
> >> On 2020-11-09 2:12 a.m., Christoph Hellwig wrote:
> >>> On Fri, Nov 06, 2020 at 10:00:25AM -0700, Logan Gunthorpe wrote:
> >>>> We make use of the top bit of the dma_length to indicate a P2PDMA
> >>>> segment.
> >>>
> >>> I don't think "we" can.  There is nothing limiting the size of a SGL
> >>> segment.
> >>
> >> Yes, I expected this would be the unacceptable part. Any alternative ideas?
> >
> > Why is the SG_P2PDMA_FLAG needed as compared to checking the SGL
> > segment-pages for is_pci_p2pdma_page()?
>
> Because the DMA and page segments in the SGL aren't necessarily aligned...
>
> The IOMMU implementations can coalesce multiple pages into fewer DMA
> address ranges, so the page pointed to by sg->page_link may not be the
> one that corresponds to the address in sg->dma_address for a given segment.
>
> If that makes sense -- it's not the easiest thing to explain.

It does...

Did someone already grab, or did you already consider the 3rd
available bit in page_link? AFAICS only SG_CHAIN and SG_END are
reserved. However, if you have a CONFIG_64BIT dependency for
user-directed p2pdma that would seem to allow SG_P2PDMA_FLAG to be
(0x4) in page_link.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH 04/15] lib/scatterlist: Add flag for indicating P2PDMA segments in an SGL

2020-12-09 Thread Dan Williams
On Mon, Nov 9, 2020 at 8:47 AM Logan Gunthorpe  wrote:
>
>
>
> On 2020-11-09 2:12 a.m., Christoph Hellwig wrote:
> > On Fri, Nov 06, 2020 at 10:00:25AM -0700, Logan Gunthorpe wrote:
> >> We make use of the top bit of the dma_length to indicate a P2PDMA
> >> segment.
> >
> > I don't think "we" can.  There is nothing limiting the size of a SGL
> > segment.
>
> Yes, I expected this would be the unacceptable part. Any alternative ideas?

Why is the SG_P2PDMA_FLAG needed as compared to checking the SGL
segment-pages for is_pci_p2pdma_page()?
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] dma-debug: fix debug_dma_assert_idle(), use rcu_read_lock()

2020-08-13 Thread Dan Williams
On Thu, Aug 13, 2020 at 12:03 PM Linus Torvalds
 wrote:
>
> On Wed, Aug 12, 2020 at 8:17 PM Hugh Dickins  wrote:
> >
> > Since commit 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic")
> > improved unlock_page(), it has become more noticeable how cow_user_page()
> > in a kernel with CONFIG_DMA_API_DEBUG=y can create and suffer from heavy
> > contention on DMA debug's radix_lock in debug_dma_assert_idle().
>
> Ooh.
>
> Yeah, that's ridiculously expensive, and serializes things for no good reason.
>
> Your patch looks obviously correct to me (Christoph?), but it also
> makes me go "why are we doing this in the first place"?
>
> Because it looks to me like
>  (a) the debug check is wrong
>  (b) this is left-over from early debugging
>
> In particular, I don't see why we couldn't do a COW on a page that is
> under writeback at the same time. We're not changing the page that is
> doing DMA.
>
> In fact, the whole "COW with DMA" makes me feel like the real bug may
> have been due that whole "ambiguous COW" thing, which was fixed in
> 17839856fd58 ("gup: document and work around "COW can break either
> way" issue")
>
> That debug thing goes back almost 7 years, and I don't think it has
> caught anything in those seven years, but I could be wrong.
>
> The commit that adds it does talk about a bug, but that code was
> removed entirely eventually. And google shows no hits for
> debug_dma_assert_idle() since - until your email.
>
> So my gut feel is that we should remove the check entirely, although
> your patch does seem like a big improvement.
>
> Christoph?
>
> (And Dan too, of course, in case he happens to be relaxing in front of
> the computer away from a newborn baby ;)
>

I can at least confirm that it has not caught anything in a long while
except a false positive that needed a fix up.

https://lore.kernel.org/lkml/capcyv4hy_nne8g0o8smrz9a8hcdrzaukgxmvdjkusaaa3fo...@mail.gmail.com/

Part of me says it's not doing anything worthwhile upstream, but I
wonder if it is keeping some people from submitting patches that play
these page reference shenanigans? I know they're out there. The land
of gup and truncate is where questionable kernel changes go to die.

Outside of that, Hugh's patch looks like a definite improvement so I'd
be inclined to run with that, but rip the whole facility out at the
next sign of a false positive.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 09/15] device core: Add ability to handle multiple dma offsets

2020-05-20 Thread Dan Williams
On Wed, May 20, 2020 at 11:27 AM Jim Quinlan  wrote:
>
> Sorry, I meant to put you on the to-list for all patches.   The last
> time I sent out this many patches using a collective cc-list for all
> patches I was told to reduce my cc-list.

You'd be forgiven. There are some developers that are ok to go read
the thread on something like lore if they are cc'd only a subset and
some that require the whole thread copied to them. Perhaps we need an
entry in MAINTAINERS that makes this preference discoverable? To date
I have been manually keeping track of those who want full threads and
those that would prefer to just be cc'd on the cover letter and the
one patch that directly affects their maintenance area.

Certainly blindly cc'ing everyone recommended by
scripts/get_maintainers.pl is overkill, but finding that subset is a
bit of an art.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v8 13/14] IB/core: use MAP_DIRECT to fix / enable RDMA to DAX mappings

2017-10-11 Thread Dan Williams
On Wed, Oct 11, 2017 at 4:54 AM, Joerg Roedel <j...@8bytes.org> wrote:
> On Tue, Oct 10, 2017 at 07:50:12AM -0700, Dan Williams wrote:
>> +static void ib_umem_lease_break(void *__umem)
>> +{
>> + struct ib_umem *umem = umem;
>> + struct ib_device *idev = umem->context->device;
>> + struct device *dev = idev->dma_device;
>> + struct scatterlist *sgl = umem->sg_head.sgl;
>> +
>> + iommu_unmap(umem->iommu, sg_dma_address(sgl) & PAGE_MASK,
>> + iommu_sg_num_pages(dev, sgl, umem->npages));
>> +}
>
> This looks like an invitation to break your code by random iommu-driver
> changes. There is no guarantee that an iommu-backed dma-api
> implemenation will map exactly iommu_sg_num_pages() pages for a given
> sg-list. In other words, you are mixing the use of the IOMMU-API and the
> DMA-API in an incompatible way that only works because you know the
> internals of the iommu-drivers.
>
> I've seen in another patch that your changes strictly require an IOMMU,
> so you what you should do instead is to switch from the DMA-API to the
> IOMMU-API and do the address-space management yourself.
>

Ok, I'll switch over completely to the iommu api for this. It will
also address Robin's concern.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v8 01/14] mm: introduce MAP_SHARED_VALIDATE, a mechanism to safely define new mmap flags

2017-10-11 Thread Dan Williams
On Wed, Oct 11, 2017 at 12:43 AM, Jan Kara <j...@suse.cz> wrote:
> On Tue 10-10-17 07:49:01, Dan Williams wrote:
>> The mmap(2) syscall suffers from the ABI anti-pattern of not validating
>> unknown flags. However, proposals like MAP_SYNC and MAP_DIRECT need a
>> mechanism to define new behavior that is known to fail on older kernels
>> without the support. Define a new MAP_SHARED_VALIDATE flag pattern that
>> is guaranteed to fail on all legacy mmap implementations.
>>
>> It is worth noting that the original proposal was for a standalone
>> MAP_VALIDATE flag. However, when that  could not be supported by all
>> archs Linus observed:
>>
>> I see why you *think* you want a bitmap. You think you want
>> a bitmap because you want to make MAP_VALIDATE be part of MAP_SYNC
>> etc, so that people can do
>>
>> ret = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED
>>   | MAP_SYNC, fd, 0);
>>
>> and "know" that MAP_SYNC actually takes.
>>
>> And I'm saying that whole wish is bogus. You're fundamentally
>> depending on special semantics, just make it explicit. It's already
>> not portable, so don't try to make it so.
>>
>> Rename that MAP_VALIDATE as MAP_SHARED_VALIDATE, make it have a value
>> of 0x3, and make people do
>>
>> ret = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED_VALIDATE
>>   | MAP_SYNC, fd, 0);
>>
>> and then the kernel side is easier too (none of that random garbage
>> playing games with looking at the "MAP_VALIDATE bit", but just another
>> case statement in that map type thing.
>>
>> Boom. Done.
>>
>> Similar to ->fallocate() we also want the ability to validate the
>> support for new flags on a per ->mmap() 'struct file_operations'
>> instance basis.  Towards that end arrange for flags to be generically
>> validated against a mmap_supported_mask exported by 'struct
>> file_operations'. By default all existing flags are implicitly
>> supported, but new flags require MAP_SHARED_VALIDATE and
>> per-instance-opt-in.
>>
>> Cc: Jan Kara <j...@suse.cz>
>> Cc: Arnd Bergmann <a...@arndb.de>
>> Cc: Andy Lutomirski <l...@kernel.org>
>> Cc: Andrew Morton <a...@linux-foundation.org>
>> Suggested-by: Christoph Hellwig <h...@lst.de>
>> Suggested-by: Linus Torvalds <torva...@linux-foundation.org>
>> Signed-off-by: Dan Williams <dan.j.willi...@intel.com>
>> ---
>>  arch/alpha/include/uapi/asm/mman.h   |1 +
>>  arch/mips/include/uapi/asm/mman.h|1 +
>>  arch/mips/kernel/vdso.c  |2 +
>>  arch/parisc/include/uapi/asm/mman.h  |1 +
>>  arch/tile/mm/elf.c   |3 +-
>>  arch/xtensa/include/uapi/asm/mman.h  |1 +
>>  include/linux/fs.h   |2 +
>>  include/linux/mm.h   |2 +
>>  include/linux/mman.h |   39 
>> ++
>>  include/uapi/asm-generic/mman-common.h   |1 +
>>  mm/mmap.c|   21 --
>>  tools/include/uapi/asm-generic/mman-common.h |1 +
>>  12 files changed, 69 insertions(+), 6 deletions(-)
>>
>> diff --git a/arch/alpha/include/uapi/asm/mman.h 
>> b/arch/alpha/include/uapi/asm/mman.h
>> index 3b26cc62dadb..92823f24890b 100644
>> --- a/arch/alpha/include/uapi/asm/mman.h
>> +++ b/arch/alpha/include/uapi/asm/mman.h
>> @@ -14,6 +14,7 @@
>>  #define MAP_TYPE 0x0f/* Mask for type of mapping (OSF/1 is 
>> _wrong_) */
>>  #define MAP_FIXED0x100   /* Interpret addr exactly */
>>  #define MAP_ANONYMOUS0x10/* don't use a file */
>> +#define MAP_SHARED_VALIDATE 0x3  /* share + validate extension 
>> flags */
>
> Just a nit but I'd put definition of MAP_SHARED_VALIDATE close to the
> definition of MAP_SHARED and MAP_PRIVATE where it logically belongs (for
> all archs).

Will do.

>
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index f8c10d336e42..5c4c98e4adc9 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -2133,7 +2133,7 @@ extern unsigned long get_unmapped_area(struct file *, 
>> unsigned long, unsigned lo
>>
>>  extern unsigned long mmap_region(struct file *file, unsigned long addr,
>>   unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
>> - struct list_

Re: [PATCH v8 06/14] xfs: wire up MAP_DIRECT

2017-10-10 Thread Dan Williams
On Tue, Oct 10, 2017 at 6:09 PM, Dave Chinner <da...@fromorbit.com> wrote:
> On Tue, Oct 10, 2017 at 07:49:30AM -0700, Dan Williams wrote:
>> @@ -1009,6 +1019,22 @@ xfs_file_llseek(
>>  }
>>
>>  /*
>> + * MAP_DIRECT faults can only be serviced while the FL_LAYOUT lease is
>> + * valid. See map_direct_invalidate.
>> + */
>> +static int
>> +xfs_can_fault_direct(
>> + struct vm_area_struct   *vma)
>> +{
>> + if (!xfs_vma_is_direct(vma))
>> + return 0;
>> +
>> + if (!test_map_direct_valid(vma->vm_private_data))
>> + return VM_FAULT_SIGBUS;
>> + return 0;
>> +}
>
> Better, but I'm going to be an annoying pedant here: a "can
> " check should return a boolean true/false.
>
> Also, it's a bit jarring to see that a non-direct VMA that /can't/
> do direct faults returns the same thing as a direct-vma that /can/
> do direct faults, so a couple of extra comments for people who will
> quickly forget how this code works (i.e. me) will be helpful. Say
> something like this:
>
> /*
>  * MAP_DIRECT faults can only be serviced while the FL_LAYOUT lease is
>  * valid. See map_direct_invalidate.
>  */
> static bool
> xfs_vma_has_direct_lease(
> struct vm_area_struct   *vma)
> {
> /* Non MAP_DIRECT vmas do not require layout leases */
> if (!xfs_vma_is_direct(vma))
> return true;
>
> if (!test_map_direct_valid(vma->vm_private_data))
> return false;
>
> /* We have a valid lease */
> return true;
> }
>
> .
> if (!xfs_vma_has_direct_lease(vma)) {
> ret = VM_FAULT_SIGBUS;
> goto out_unlock;
> }
> 


Looks good to me.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v8 04/14] xfs: prepare xfs_break_layouts() for reuse with MAP_DIRECT

2017-10-10 Thread Dan Williams
On Tue, Oct 10, 2017 at 5:46 PM, Dave Chinner <da...@fromorbit.com> wrote:
> On Tue, Oct 10, 2017 at 07:49:17AM -0700, Dan Williams wrote:
>> Move xfs_break_layouts() to its own compilation unit so that it can be
>> used for both pnfs layouts and MAP_DIRECT mappings.
> .
>> diff --git a/fs/xfs/xfs_pnfs.h b/fs/xfs/xfs_pnfs.h
>> index b587cb99b2b7..4135b2482697 100644
>> --- a/fs/xfs/xfs_pnfs.h
>> +++ b/fs/xfs/xfs_pnfs.h
>> @@ -1,19 +1,13 @@
>>  #ifndef _XFS_PNFS_H
>>  #define _XFS_PNFS_H 1
>>
>> +#include "xfs_layout.h"
>> +
>
> I missed this the first time through - we try not to put includes
> in header files, and instead make sure each C file has all the
> includes they require. Can you move this to all the C files that
> need layouts and remove the include of the xfs_pnfs.h include from
> them?

Sure, will do.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v8 12/14] iommu/vt-d: use iommu_num_sg_pages

2017-10-10 Thread Dan Williams
Use the common helper for accounting the size of the IOVA range for a
scatterlist so that iommu and dma apis agree on the size of a
scatterlist. This is in support for using iommu_unmap() in advance of
dma_unmap_sg() to invalidate an io-mapping in advance of the IOVA range
being deallocated. MAP_DIRECT needs this functionality for force
revoking RDMA access to a DAX mapping when userspace fails to respond to
within a lease break timeout period.

Cc: Ashok Raj <ashok@intel.com>
Cc: David Woodhouse <dw...@infradead.org>
Cc: Joerg Roedel <j...@8bytes.org>
Signed-off-by: Dan Williams <dan.j.willi...@intel.com>
---
 drivers/iommu/intel-iommu.c |   19 +--
 1 file changed, 5 insertions(+), 14 deletions(-)

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index f3f4939cebad..94a5fbe62fb8 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -3785,14 +3785,9 @@ static void intel_unmap_sg(struct device *dev, struct 
scatterlist *sglist,
   unsigned long attrs)
 {
dma_addr_t startaddr = sg_dma_address(sglist) & PAGE_MASK;
-   unsigned long nrpages = 0;
-   struct scatterlist *sg;
-   int i;
-
-   for_each_sg(sglist, sg, nelems, i) {
-   nrpages += aligned_nrpages(sg_dma_address(sg), sg_dma_len(sg));
-   }
+   unsigned long nrpages;
 
+   nrpages = iommu_sg_num_pages(dev, sglist, nelems);
intel_unmap(dev, startaddr, nrpages << VTD_PAGE_SHIFT);
 }
 
@@ -3813,14 +3808,12 @@ static int intel_nontranslate_map_sg(struct device 
*hddev,
 static int intel_map_sg(struct device *dev, struct scatterlist *sglist, int 
nelems,
enum dma_data_direction dir, unsigned long attrs)
 {
-   int i;
struct dmar_domain *domain;
size_t size = 0;
int prot = 0;
unsigned long iova_pfn;
int ret;
-   struct scatterlist *sg;
-   unsigned long start_vpfn;
+   unsigned long start_vpfn, npages;
struct intel_iommu *iommu;
 
BUG_ON(dir == DMA_NONE);
@@ -3833,11 +3826,9 @@ static int intel_map_sg(struct device *dev, struct 
scatterlist *sglist, int nele
 
iommu = domain_get_iommu(domain);
 
-   for_each_sg(sglist, sg, nelems, i)
-   size += aligned_nrpages(sg->offset, sg->length);
+   npages = iommu_sg_num_pages(dev, sglist, nelems);
 
-   iova_pfn = intel_alloc_iova(dev, domain, dma_to_mm_pfn(size),
-   *dev->dma_mask);
+   iova_pfn = intel_alloc_iova(dev, domain, npages, *dev->dma_mask);
if (!iova_pfn) {
sglist->dma_length = 0;
return 0;

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v8 13/14] IB/core: use MAP_DIRECT to fix / enable RDMA to DAX mappings

2017-10-10 Thread Dan Williams
Currently the ibverbs core in the kernel is completely unaware of the
dangers of filesystem-DAX mappings. Specifically, the filesystem is free
to move file blocks at will. In the case of DAX, it means that RDMA to a
given file offset can dynamically switch to another file offset, another
file, or free space with no notification to RDMA device to cease
operations. Historically, this lack of communication between the ibverbs
core and filesystem was not a problem because RDMA always targeted
dynamically allocated page cache, so at least the RDMA device would have
valid memory to target even if the file was being modified. With DAX we
need to add coordination since RDMA is bypassing page-cache and going
direct to on-media pages of the file. RDMA to DAX can cause damage if
filesystem blocks move / change state.

Use the new ->lease_direct() operation to get a notification when the
filesystem is invalidating the block map of the file and needs RDMA
operations to stop. Given that the kernel can not be in a position where
it needs to wait indefinitely for userspace to stop a device we need a
mechanism where the kernel can force-revoke access. Towards that end, use
the dma_get_iommu_domain() to both check if the device has domain
mappings that can be invalidated and retrieve the iommu_domain for use
with iommu_unmap.

Once we have that assurance that we can block in-flight I/O when the
file's block map changes then we can safely allow RDMA to DAX.

Cc: Sean Hefty <sean.he...@intel.com>
Cc: Doug Ledford <dledf...@redhat.com>
Cc: Hal Rosenstock <hal.rosenst...@gmail.com>
Cc: Jan Kara <j...@suse.cz>
Cc: Jeff Moyer <jmo...@redhat.com>
Cc: Christoph Hellwig <h...@lst.de>
Cc: Dave Chinner <da...@fromorbit.com>
Cc: Joerg Roedel <j...@8bytes.org>
Cc: David Woodhouse <dw...@infradead.org>
Cc: Ashok Raj <ashok@intel.com>
Cc: "Darrick J. Wong" <darrick.w...@oracle.com>
Cc: Ross Zwisler <ross.zwis...@linux.intel.com>
Cc: Jeff Layton <jlay...@poochiereds.net>
Cc: "J. Bruce Fields" <bfie...@fieldses.org>
Signed-off-by: Dan Williams <dan.j.willi...@intel.com>
---
 drivers/infiniband/core/umem.c |   90 +++-
 include/rdma/ib_umem.h |8 
 2 files changed, 86 insertions(+), 12 deletions(-)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index 21e60b1e2ff4..5e4598982359 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -36,6 +36,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -46,10 +47,16 @@
 
 static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int 
dirty)
 {
+   struct lease_direct *ld, *_ld;
struct scatterlist *sg;
struct page *page;
int i;
 
+   list_for_each_entry_safe(ld, _ld, >leases, list) {
+   list_del_init(>list);
+   map_direct_lease_destroy(ld);
+   }
+
if (umem->nmap > 0)
ib_dma_unmap_sg(dev, umem->sg_head.sgl,
umem->npages,
@@ -64,10 +71,20 @@ static void __ib_umem_release(struct ib_device *dev, struct 
ib_umem *umem, int d
}
 
sg_free_table(>sg_head);
-   return;
 
 }
 
+static void ib_umem_lease_break(void *__umem)
+{
+   struct ib_umem *umem = umem;
+   struct ib_device *idev = umem->context->device;
+   struct device *dev = idev->dma_device;
+   struct scatterlist *sgl = umem->sg_head.sgl;
+
+   iommu_unmap(umem->iommu, sg_dma_address(sgl) & PAGE_MASK,
+   iommu_sg_num_pages(dev, sgl, umem->npages));
+}
+
 /**
  * ib_umem_get - Pin and DMA map userspace memory.
  *
@@ -96,7 +113,10 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, 
unsigned long addr,
struct scatterlist *sg, *sg_list_start;
int need_release = 0;
unsigned int gup_flags = FOLL_WRITE;
+   struct vm_area_struct *vma_prev = NULL;
+   struct device *dma_dev;
 
+   dma_dev = context->device->dma_device;
if (dmasync)
dma_attrs |= DMA_ATTR_WRITE_BARRIER;
 
@@ -120,6 +140,7 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, 
unsigned long addr,
umem->address= addr;
umem->page_shift = PAGE_SHIFT;
umem->pid= get_task_pid(current, PIDTYPE_PID);
+   INIT_LIST_HEAD(>leases);
/*
 * We ask for writable memory if any of the following
 * access flags are set.  "Local write" and "remote write"
@@ -147,19 +168,21 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, 
unsigned long addr,
umem->hugetlb   = 1;
 
page_list = (struct page **) __get_free_page(GFP_KERNEL);
-   if (!page_list) {
-   put_pid(umem->pid);
-   kfree(umem);
-

[PATCH v8 14/14] tools/testing/nvdimm: enable rdma unit tests

2017-10-10 Thread Dan Williams
Provide a mock dma_get_iommu_domain() for the ibverbs core. Enable
ib_umem_get() to satisfy its DAX safety checks for a controlled test.

Signed-off-by: Dan Williams <dan.j.willi...@intel.com>
---
 tools/testing/nvdimm/Kbuild |   31 +++
 tools/testing/nvdimm/config_check.c |2 ++
 tools/testing/nvdimm/test/iomap.c   |   14 ++
 3 files changed, 47 insertions(+)

diff --git a/tools/testing/nvdimm/Kbuild b/tools/testing/nvdimm/Kbuild
index d870520da68b..f4a007090950 100644
--- a/tools/testing/nvdimm/Kbuild
+++ b/tools/testing/nvdimm/Kbuild
@@ -15,11 +15,13 @@ ldflags-y += --wrap=insert_resource
 ldflags-y += --wrap=remove_resource
 ldflags-y += --wrap=acpi_evaluate_object
 ldflags-y += --wrap=acpi_evaluate_dsm
+ldflags-y += --wrap=dma_get_iommu_domain
 
 DRIVERS := ../../../drivers
 NVDIMM_SRC := $(DRIVERS)/nvdimm
 ACPI_SRC := $(DRIVERS)/acpi/nfit
 DAX_SRC := $(DRIVERS)/dax
+IBCORE := $(DRIVERS)/infiniband/core
 ccflags-y := -I$(src)/$(NVDIMM_SRC)/
 
 obj-$(CONFIG_LIBNVDIMM) += libnvdimm.o
@@ -33,6 +35,7 @@ obj-$(CONFIG_DAX) += dax.o
 endif
 obj-$(CONFIG_DEV_DAX) += device_dax.o
 obj-$(CONFIG_DEV_DAX_PMEM) += dax_pmem.o
+obj-$(CONFIG_INFINIBAND) += ib_core.o
 
 nfit-y := $(ACPI_SRC)/core.o
 nfit-$(CONFIG_X86_MCE) += $(ACPI_SRC)/mce.o
@@ -75,4 +78,32 @@ libnvdimm-$(CONFIG_NVDIMM_PFN) += $(NVDIMM_SRC)/pfn_devs.o
 libnvdimm-$(CONFIG_NVDIMM_DAX) += $(NVDIMM_SRC)/dax_devs.o
 libnvdimm-y += config_check.o
 
+ib_core-y := $(IBCORE)/packer.o
+ib_core-y += $(IBCORE)/ud_header.o
+ib_core-y += $(IBCORE)/verbs.o
+ib_core-y += $(IBCORE)/cq.o
+ib_core-y += $(IBCORE)/rw.o
+ib_core-y += $(IBCORE)/sysfs.o
+ib_core-y += $(IBCORE)/device.o
+ib_core-y += $(IBCORE)/fmr_pool.o
+ib_core-y += $(IBCORE)/cache.o
+ib_core-y += $(IBCORE)/netlink.o
+ib_core-y += $(IBCORE)/roce_gid_mgmt.o
+ib_core-y += $(IBCORE)/mr_pool.o
+ib_core-y += $(IBCORE)/addr.o
+ib_core-y += $(IBCORE)/sa_query.o
+ib_core-y += $(IBCORE)/multicast.o
+ib_core-y += $(IBCORE)/mad.o
+ib_core-y += $(IBCORE)/smi.o
+ib_core-y += $(IBCORE)/agent.o
+ib_core-y += $(IBCORE)/mad_rmpp.o
+ib_core-y += $(IBCORE)/security.o
+ib_core-y += $(IBCORE)/nldev.o
+
+ib_core-$(CONFIG_INFINIBAND_USER_MEM) += $(IBCORE)/umem.o
+ib_core-$(CONFIG_INFINIBAND_ON_DEMAND_PAGING) += $(IBCORE)/umem_odp.o
+ib_core-$(CONFIG_INFINIBAND_ON_DEMAND_PAGING) += $(IBCORE)/umem_rbtree.o
+ib_core-$(CONFIG_CGROUP_RDMA) += $(IBCORE)/cgroup.o
+ib_core-y += config_check.o
+
 obj-m += test/
diff --git a/tools/testing/nvdimm/config_check.c 
b/tools/testing/nvdimm/config_check.c
index 7dc5a0af9b54..33e7c805bfd6 100644
--- a/tools/testing/nvdimm/config_check.c
+++ b/tools/testing/nvdimm/config_check.c
@@ -14,4 +14,6 @@ void check(void)
BUILD_BUG_ON(!IS_MODULE(CONFIG_ACPI_NFIT));
BUILD_BUG_ON(!IS_MODULE(CONFIG_DEV_DAX));
BUILD_BUG_ON(!IS_MODULE(CONFIG_DEV_DAX_PMEM));
+   BUILD_BUG_ON(!IS_ENABLED(CONFIG_INFINIBAND_USER_MEM));
+   BUILD_BUG_ON(!IS_MODULE(CONFIG_INFINIBAND));
 }
diff --git a/tools/testing/nvdimm/test/iomap.c 
b/tools/testing/nvdimm/test/iomap.c
index e1f75a1914a1..1e439b2b01e7 100644
--- a/tools/testing/nvdimm/test/iomap.c
+++ b/tools/testing/nvdimm/test/iomap.c
@@ -17,6 +17,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -388,4 +389,17 @@ union acpi_object * __wrap_acpi_evaluate_dsm(acpi_handle 
handle, const guid_t *g
 }
 EXPORT_SYMBOL(__wrap_acpi_evaluate_dsm);
 
+/*
+ * This assumes that any iommu api routine we would call with this
+ * domain checks for NULL ops and either returns an error or does
+ * nothing.
+ */
+struct iommu_domain *__wrap_dma_get_iommu_domain(struct device *dev)
+{
+   static struct iommu_domain domain;
+
+   return 
+}
+EXPORT_SYMBOL(__wrap_dma_get_iommu_domain);
+
 MODULE_LICENSE("GPL v2");

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v8 11/14] iommu: up-level sg_num_pages() from amd-iommu

2017-10-10 Thread Dan Williams
iommu_sg_num_pages() is a helper that walks a scattlerlist and counts
pages taking segment boundaries and iommu_num_pages() into account.
Up-level it for determining the IOVA range that dma_map_ops established
at dma_map_sg() time. The intent is to iommu_unmap() the IOVA range in
advance of freeing IOVA range.

Cc: Joerg Roedel <j...@8bytes.org>
Signed-off-by: Dan Williams <dan.j.willi...@intel.com>
---
 drivers/iommu/amd_iommu.c |   30 ++
 drivers/iommu/iommu.c |   27 +++
 include/linux/iommu.h |2 ++
 3 files changed, 31 insertions(+), 28 deletions(-)

diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
index c8e1a45af182..4795b0823469 100644
--- a/drivers/iommu/amd_iommu.c
+++ b/drivers/iommu/amd_iommu.c
@@ -2459,32 +2459,6 @@ static void unmap_page(struct device *dev, dma_addr_t 
dma_addr, size_t size,
__unmap_single(dma_dom, dma_addr, size, dir);
 }
 
-static int sg_num_pages(struct device *dev,
-   struct scatterlist *sglist,
-   int nelems)
-{
-   unsigned long mask, boundary_size;
-   struct scatterlist *s;
-   int i, npages = 0;
-
-   mask  = dma_get_seg_boundary(dev);
-   boundary_size = mask + 1 ? ALIGN(mask + 1, PAGE_SIZE) >> PAGE_SHIFT :
-  1UL << (BITS_PER_LONG - PAGE_SHIFT);
-
-   for_each_sg(sglist, s, nelems, i) {
-   int p, n;
-
-   s->dma_address = npages << PAGE_SHIFT;
-   p = npages % boundary_size;
-   n = iommu_num_pages(sg_phys(s), s->length, PAGE_SIZE);
-   if (p + n > boundary_size)
-   npages += boundary_size - p;
-   npages += n;
-   }
-
-   return npages;
-}
-
 /*
  * The exported map_sg function for dma_ops (handles scatter-gather
  * lists).
@@ -2507,7 +2481,7 @@ static int map_sg(struct device *dev, struct scatterlist 
*sglist,
dma_dom  = to_dma_ops_domain(domain);
dma_mask = *dev->dma_mask;
 
-   npages = sg_num_pages(dev, sglist, nelems);
+   npages = iommu_sg_num_pages(dev, sglist, nelems);
 
address = dma_ops_alloc_iova(dev, dma_dom, npages, dma_mask);
if (address == AMD_IOMMU_MAPPING_ERROR)
@@ -2585,7 +2559,7 @@ static void unmap_sg(struct device *dev, struct 
scatterlist *sglist,
 
startaddr = sg_dma_address(sglist) & PAGE_MASK;
dma_dom   = to_dma_ops_domain(domain);
-   npages= sg_num_pages(dev, sglist, nelems);
+   npages= iommu_sg_num_pages(dev, sglist, nelems);
 
__unmap_single(dma_dom, startaddr, npages << PAGE_SHIFT, dir);
 }
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 3de5c0bcb5cc..cfe6eeea3578 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -33,6 +33,7 @@
 #include 
 #include 
 #include 
+#include 
 
 static struct kset *iommu_group_kset;
 static DEFINE_IDA(iommu_group_ida);
@@ -1631,6 +1632,32 @@ size_t iommu_unmap_fast(struct iommu_domain *domain,
 }
 EXPORT_SYMBOL_GPL(iommu_unmap_fast);
 
+int iommu_sg_num_pages(struct device *dev, struct scatterlist *sglist,
+   int nelems)
+{
+   unsigned long mask, boundary_size;
+   struct scatterlist *s;
+   int i, npages = 0;
+
+   mask = dma_get_seg_boundary(dev);
+   boundary_size = mask + 1 ? ALIGN(mask + 1, PAGE_SIZE) >> PAGE_SHIFT
+   : 1UL << (BITS_PER_LONG - PAGE_SHIFT);
+
+   for_each_sg(sglist, s, nelems, i) {
+   int p, n;
+
+   s->dma_address = npages << PAGE_SHIFT;
+   p = npages % boundary_size;
+   n = iommu_num_pages(sg_phys(s), s->length, PAGE_SIZE);
+   if (p + n > boundary_size)
+   npages += boundary_size - p;
+   npages += n;
+   }
+
+   return npages;
+}
+EXPORT_SYMBOL_GPL(iommu_sg_num_pages);
+
 size_t default_iommu_map_sg(struct iommu_domain *domain, unsigned long iova,
 struct scatterlist *sg, unsigned int nents, int prot)
 {
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index a7f2ac689d29..5b2d20e1475a 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -303,6 +303,8 @@ extern size_t iommu_unmap(struct iommu_domain *domain, 
unsigned long iova,
  size_t size);
 extern size_t iommu_unmap_fast(struct iommu_domain *domain,
   unsigned long iova, size_t size);
+extern int iommu_sg_num_pages(struct device *dev, struct scatterlist *sglist,
+   int nelems);
 extern size_t default_iommu_map_sg(struct iommu_domain *domain, unsigned long 
iova,
struct scatterlist *sg,unsigned int nents,
int prot);

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v8 10/14] device-dax: wire up ->lease_direct()

2017-10-10 Thread Dan Williams
The only event that will break a lease_direct lease in the device-dax
case is the device shutdown path where the physical pages might get
assigned to another device.

Cc: Jan Kara <j...@suse.cz>
Cc: Jeff Moyer <jmo...@redhat.com>
Cc: Christoph Hellwig <h...@lst.de>
Cc: Ross Zwisler <ross.zwis...@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.willi...@intel.com>
---
 drivers/dax/Kconfig   |1 +
 drivers/dax/device.c  |4 
 fs/Kconfig|4 
 fs/Makefile   |3 ++-
 fs/mapdirect.c|3 ++-
 include/linux/mapdirect.h |5 -
 6 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig
index b79aa8f7a497..be03d4dbe646 100644
--- a/drivers/dax/Kconfig
+++ b/drivers/dax/Kconfig
@@ -8,6 +8,7 @@ if DAX
 config DEV_DAX
tristate "Device DAX: direct access mapping device"
depends on TRANSPARENT_HUGEPAGE
+   depends on FILE_LOCKING
help
  Support raw access to differentiated (persistence, bandwidth,
  latency...) memory via an mmap(2) capable character
diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index e9f3b3e4bbf4..fa75004185c4 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -10,6 +10,7 @@
  * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
  * General Public License for more details.
  */
+#include 
 #include 
 #include 
 #include 
@@ -430,6 +431,7 @@ static int dev_dax_fault(struct vm_fault *vmf)
 static const struct vm_operations_struct dax_vm_ops = {
.fault = dev_dax_fault,
.huge_fault = dev_dax_huge_fault,
+   .lease_direct = map_direct_lease,
 };
 
 static int dax_mmap(struct file *filp, struct vm_area_struct *vma)
@@ -540,8 +542,10 @@ static void kill_dev_dax(struct dev_dax *dev_dax)
 {
struct dax_device *dax_dev = dev_dax->dax_dev;
struct inode *inode = dax_inode(dax_dev);
+   const bool wait = true;
 
kill_dax(dax_dev);
+   break_layout(inode, wait);
unmap_mapping_range(inode->i_mapping, 0, 0, 1);
 }
 
diff --git a/fs/Kconfig b/fs/Kconfig
index a7b31a96a753..3668cfb046d5 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -59,6 +59,10 @@ config FS_DAX_PMD
depends on ZONE_DEVICE
depends on TRANSPARENT_HUGEPAGE
 
+config DAX_MAP_DIRECT
+   bool
+   default FS_DAX || DEV_DAX
+
 endif # BLOCK
 
 # Posix ACL utility routines
diff --git a/fs/Makefile b/fs/Makefile
index c0e791d235d8..21b8fb104656 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -29,7 +29,8 @@ obj-$(CONFIG_TIMERFD) += timerfd.o
 obj-$(CONFIG_EVENTFD)  += eventfd.o
 obj-$(CONFIG_USERFAULTFD)  += userfaultfd.o
 obj-$(CONFIG_AIO)   += aio.o
-obj-$(CONFIG_FS_DAX)   += dax.o mapdirect.o
+obj-$(CONFIG_FS_DAX)   += dax.o
+obj-$(CONFIG_DAX_MAP_DIRECT)   += mapdirect.o
 obj-$(CONFIG_FS_ENCRYPTION)+= crypto/
 obj-$(CONFIG_FILE_LOCKING)  += locks.o
 obj-$(CONFIG_COMPAT)   += compat.o compat_ioctl.o
diff --git a/fs/mapdirect.c b/fs/mapdirect.c
index c6954033fc1a..dd4a16f9ffc6 100644
--- a/fs/mapdirect.c
+++ b/fs/mapdirect.c
@@ -218,7 +218,7 @@ static const struct lock_manager_operations 
lease_direct_lm_ops = {
.lm_change = lease_direct_lm_change,
 };
 
-static struct lease_direct *map_direct_lease(struct vm_area_struct *vma,
+struct lease_direct *map_direct_lease(struct vm_area_struct *vma,
void (*lds_break_fn)(void *), void *lds_owner)
 {
struct file *file = vma->vm_file;
@@ -272,6 +272,7 @@ static struct lease_direct *map_direct_lease(struct 
vm_area_struct *vma,
kfree(lds);
return ERR_PTR(rc);
 }
+EXPORT_SYMBOL_GPL(map_direct_lease);
 
 struct lease_direct *generic_map_direct_lease(struct vm_area_struct *vma,
void (*break_fn)(void *), void *owner)
diff --git a/include/linux/mapdirect.h b/include/linux/mapdirect.h
index e0df6ac5795a..6695fdcf8009 100644
--- a/include/linux/mapdirect.h
+++ b/include/linux/mapdirect.h
@@ -26,13 +26,15 @@ struct lease_direct {
struct lease_direct_state *lds;
 };
 
-#if IS_ENABLED(CONFIG_FS_DAX)
+#if IS_ENABLED(CONFIG_DAX_MAP_DIRECT)
 struct map_direct_state *map_direct_register(int fd, struct vm_area_struct 
*vma);
 bool test_map_direct_valid(struct map_direct_state *mds);
 void generic_map_direct_open(struct vm_area_struct *vma);
 void generic_map_direct_close(struct vm_area_struct *vma);
 struct lease_direct *generic_map_direct_lease(struct vm_area_struct *vma,
void (*ld_break_fn)(void *), void *ld_owner);
+struct lease_direct *map_direct_lease(struct vm_area_struct *vma,
+   void (*lds_break_fn)(void *), void *lds_owner);
 void map_direct_lease_destroy(struct lease_direct *ld);
 #else
 static inline struct map_direct_state *map_direct_register(int fd,
@@ -47,6 +49,7 @@ static inline bool test_map_direct_valid(struct 
map_direct_

[PATCH v8 08/14] fs, mapdirect: introduce ->lease_direct()

2017-10-10 Thread Dan Williams
Provide a vma operation that registers a lease that is broken by
break_layout(). This is motivated by a need to stop in-progress RDMA
when the block-map of a DAX-file changes. I.e. since DAX gives
direct-access to filesystem blocks we can not allow those blocks to move
or change state while they are under active RDMA. So, if the filesystem
determines it needs to move blocks it can revoke device access before
proceeding.

Cc: Jan Kara <j...@suse.cz>
Cc: Jeff Moyer <jmo...@redhat.com>
Cc: Christoph Hellwig <h...@lst.de>
Cc: Dave Chinner <da...@fromorbit.com>
Cc: "Darrick J. Wong" <darrick.w...@oracle.com>
Cc: Ross Zwisler <ross.zwis...@linux.intel.com>
Cc: Jeff Layton <jlay...@poochiereds.net>
Cc: "J. Bruce Fields" <bfie...@fieldses.org>
Signed-off-by: Dan Williams <dan.j.willi...@intel.com>
---
 fs/mapdirect.c|  144 +
 include/linux/mapdirect.h |   14 
 include/linux/mm.h|8 +++
 3 files changed, 166 insertions(+)

diff --git a/fs/mapdirect.c b/fs/mapdirect.c
index 9f4dd7395dcd..c6954033fc1a 100644
--- a/fs/mapdirect.c
+++ b/fs/mapdirect.c
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -32,12 +33,25 @@ struct map_direct_state {
struct vm_area_struct *mds_vma;
 };
 
+struct lease_direct_state {
+   void *lds_owner;
+   struct file *lds_file;
+   unsigned long lds_state;
+   void (*lds_break_fn)(void *lds_owner);
+   struct delayed_work lds_work;
+};
+
 bool test_map_direct_valid(struct map_direct_state *mds)
 {
return test_bit(MAPDIRECT_VALID, >mds_state);
 }
 EXPORT_SYMBOL_GPL(test_map_direct_valid);
 
+static bool test_map_direct_broken(struct map_direct_state *mds)
+{
+   return test_bit(MAPDIRECT_BREAK, >mds_state);
+}
+
 static void put_map_direct(struct map_direct_state *mds)
 {
if (!atomic_dec_and_test(>mds_ref))
@@ -168,6 +182,136 @@ static const struct lock_manager_operations 
map_direct_lm_ops = {
.lm_setup = map_direct_lm_setup,
 };
 
+static void lease_direct_invalidate(struct work_struct *work)
+{
+   struct lease_direct_state *lds;
+   void *owner;
+
+   lds = container_of(work, typeof(*lds), lds_work.work);
+   owner = lds;
+   lds->lds_break_fn(lds->lds_owner);
+   vfs_setlease(lds->lds_file, F_UNLCK, NULL, );
+}
+
+static bool lease_direct_lm_break(struct file_lock *fl)
+{
+   struct lease_direct_state *lds = fl->fl_owner;
+
+   if (!test_and_set_bit(MAPDIRECT_BREAK, >lds_state))
+   schedule_delayed_work(>lds_work, lease_break_time * HZ);
+
+   /* Tell the core lease code to wait for delayed work completion */
+   fl->fl_break_time = 0;
+
+   return false;
+}
+
+static int lease_direct_lm_change(struct file_lock *fl, int arg,
+   struct list_head *dispose)
+{
+   WARN_ON(!(arg & F_UNLCK));
+   return lease_modify(fl, arg, dispose);
+}
+
+static const struct lock_manager_operations lease_direct_lm_ops = {
+   .lm_break = lease_direct_lm_break,
+   .lm_change = lease_direct_lm_change,
+};
+
+static struct lease_direct *map_direct_lease(struct vm_area_struct *vma,
+   void (*lds_break_fn)(void *), void *lds_owner)
+{
+   struct file *file = vma->vm_file;
+   struct lease_direct_state *lds;
+   struct lease_direct *ld;
+   struct file_lock *fl;
+   int rc = -ENOMEM;
+   void *owner;
+
+   ld = kzalloc(sizeof(*ld) + sizeof(*lds), GFP_KERNEL);
+   if (!ld)
+   return ERR_PTR(-ENOMEM);
+   INIT_LIST_HEAD(>list);
+   lds = (struct lease_direct_state *)(ld + 1);
+   owner = lds;
+   ld->lds = lds;
+   lds->lds_break_fn = lds_break_fn;
+   lds->lds_owner = lds_owner;
+   INIT_DELAYED_WORK(>lds_work, lease_direct_invalidate);
+   lds->lds_file = get_file(file);
+
+   fl = locks_alloc_lock();
+   if (!fl)
+   goto err_lock_alloc;
+
+   locks_init_lock(fl);
+   fl->fl_lmops = _direct_lm_ops;
+   fl->fl_flags = FL_LAYOUT;
+   fl->fl_type = F_RDLCK;
+   fl->fl_end = OFFSET_MAX;
+   fl->fl_owner = lds;
+   fl->fl_pid = current->tgid;
+   fl->fl_file = file;
+
+   rc = vfs_setlease(file, fl->fl_type, , );
+   if (rc)
+   goto err_setlease;
+   if (fl) {
+   WARN_ON(1);
+   owner = lds;
+   vfs_setlease(file, F_UNLCK, NULL, );
+   owner = NULL;
+   rc = -ENXIO;
+   goto err_setlease;
+   }
+
+   return ld;
+err_setlease:
+   locks_free_lock(fl);
+err_lock_alloc:
+   kfree(lds);
+   return ERR_PTR(rc);
+}
+
+struct lease_direct *generic_map_direct_lease(struct vm_area_struct *vma,
+   void (*break_fn)(void *), void *owner)
+{
+   struct 

[PATCH v8 09/14] xfs: wire up ->lease_direct()

2017-10-10 Thread Dan Williams
A 'lease_direct' lease requires that the vma have a valid MAP_DIRECT
mapping established. For xfs we use the generic_map_direct_lease()
handler for ->lease_direct(). It establishes a new lease and then checks
if the MAP_DIRECT mapping has been broken. We want to be sure that the
process will receive notification that the MAP_DIRECT mapping is being
torn down so it knows why other code paths are throwing failures.

For example in the RDMA/ibverbs case we want ibv_reg_mr() to fail if the
MAP_DIRECT mapping is invalid or in the process of being invalidated.

Cc: Jan Kara <j...@suse.cz>
Cc: Jeff Moyer <jmo...@redhat.com>
Cc: Christoph Hellwig <h...@lst.de>
Cc: Dave Chinner <da...@fromorbit.com>
Cc: "Darrick J. Wong" <darrick.w...@oracle.com>
Cc: Ross Zwisler <ross.zwis...@linux.intel.com>
Cc: Jeff Layton <jlay...@poochiereds.net>
Cc: "J. Bruce Fields" <bfie...@fieldses.org>
Signed-off-by: Dan Williams <dan.j.willi...@intel.com>
---
 fs/xfs/xfs_file.c |5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 4bee027c9366..bc512a9a8df5 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1157,6 +1157,7 @@ static const struct vm_operations_struct 
xfs_file_vm_direct_ops = {
 
.open   = generic_map_direct_open,
.close  = generic_map_direct_close,
+   .lease_direct   = generic_map_direct_lease,
 };
 
 static const struct vm_operations_struct xfs_file_vm_ops = {
@@ -1209,8 +1210,8 @@ xfs_file_mmap_direct(
vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE;
 
/*
-* generic_map_direct_{open,close} expect ->vm_private_data is
-* set to the result of map_direct_register
+* generic_map_direct_{open,close,lease} expect
+* ->vm_private_data is set to the result of map_direct_register
 */
vma->vm_private_data = mds;
return 0;

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v8 07/14] iommu, dma-mapping: introduce dma_get_iommu_domain()

2017-10-10 Thread Dan Williams
Add a dma-mapping api helper to retrieve the generic iommu_domain for a
device.  The motivation for this interface is making RDMA transfers to
DAX mappings safe. If the DAX file's block map changes we need to be to
reliably stop accesses to blocks that have been freed or re-assigned to
a new file. With the iommu_domain and a callback from the DAX filesystem
the kernel can safely revoke access to a DMA device. The process that
performed the RDMA memory registration is also notified of this
revocation event, but the kernel can not otherwise be in the position of
waiting for userspace to quiesce the device.

Since PMEM+DAX is currently only enabled for x86, we only update the x86
iommu drivers.

Cc: Marek Szyprowski <m.szyprow...@samsung.com>
Cc: Robin Murphy <robin.mur...@arm.com>
Cc: Greg Kroah-Hartman <gre...@linuxfoundation.org>
Cc: Joerg Roedel <j...@8bytes.org>
Cc: David Woodhouse <dw...@infradead.org>
Cc: Ashok Raj <ashok@intel.com>
Cc: Jan Kara <j...@suse.cz>
Cc: Jeff Moyer <jmo...@redhat.com>
Cc: Christoph Hellwig <h...@lst.de>
Cc: Dave Chinner <da...@fromorbit.com>
Cc: "Darrick J. Wong" <darrick.w...@oracle.com>
Cc: Ross Zwisler <ross.zwis...@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.willi...@intel.com>
---
 drivers/base/dma-mapping.c  |   10 ++
 drivers/iommu/amd_iommu.c   |   10 ++
 drivers/iommu/intel-iommu.c |   15 +++
 include/linux/dma-mapping.h |3 +++
 4 files changed, 38 insertions(+)

diff --git a/drivers/base/dma-mapping.c b/drivers/base/dma-mapping.c
index e584eddef0a7..fdb9764f95a4 100644
--- a/drivers/base/dma-mapping.c
+++ b/drivers/base/dma-mapping.c
@@ -369,3 +369,13 @@ void dma_deconfigure(struct device *dev)
of_dma_deconfigure(dev);
acpi_dma_deconfigure(dev);
 }
+
+struct iommu_domain *dma_get_iommu_domain(struct device *dev)
+{
+   const struct dma_map_ops *ops = get_dma_ops(dev);
+
+   if (ops && ops->get_iommu)
+   return ops->get_iommu(dev);
+   return NULL;
+}
+EXPORT_SYMBOL(dma_get_iommu_domain);
diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
index 51f8215877f5..c8e1a45af182 100644
--- a/drivers/iommu/amd_iommu.c
+++ b/drivers/iommu/amd_iommu.c
@@ -2271,6 +2271,15 @@ static struct protection_domain *get_domain(struct 
device *dev)
return domain;
 }
 
+static struct iommu_domain *amd_dma_get_iommu(struct device *dev)
+{
+   struct protection_domain *domain = get_domain(dev);
+
+   if (IS_ERR(domain))
+   return NULL;
+   return >domain;
+}
+
 static void update_device_table(struct protection_domain *domain)
 {
struct iommu_dev_data *dev_data;
@@ -2689,6 +2698,7 @@ static const struct dma_map_ops amd_iommu_dma_ops = {
.unmap_sg   = unmap_sg,
.dma_supported  = amd_iommu_dma_supported,
.mapping_error  = amd_iommu_mapping_error,
+   .get_iommu  = amd_dma_get_iommu,
 };
 
 static int init_reserved_iova_ranges(void)
diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 6784a05dd6b2..f3f4939cebad 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -3578,6 +3578,20 @@ static int iommu_no_mapping(struct device *dev)
return 0;
 }
 
+static struct iommu_domain *intel_dma_get_iommu(struct device *dev)
+{
+   struct dmar_domain *domain;
+
+   if (iommu_no_mapping(dev))
+   return NULL;
+
+   domain = get_valid_domain_for_dev(dev);
+   if (!domain)
+   return NULL;
+
+   return >domain;
+}
+
 static dma_addr_t __intel_map_single(struct device *dev, phys_addr_t paddr,
 size_t size, int dir, u64 dma_mask)
 {
@@ -3872,6 +3886,7 @@ const struct dma_map_ops intel_dma_ops = {
.map_page = intel_map_page,
.unmap_page = intel_unmap_page,
.mapping_error = intel_mapping_error,
+   .get_iommu = intel_dma_get_iommu,
 #ifdef CONFIG_X86
.dma_supported = x86_dma_supported,
 #endif
diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index 29ce9815da87..aa62df1d0d72 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -128,6 +128,7 @@ struct dma_map_ops {
   enum dma_data_direction dir);
int (*mapping_error)(struct device *dev, dma_addr_t dma_addr);
int (*dma_supported)(struct device *dev, u64 mask);
+   struct iommu_domain *(*get_iommu)(struct device *dev);
 #ifdef ARCH_HAS_DMA_GET_REQUIRED_MASK
u64 (*get_required_mask)(struct device *dev);
 #endif
@@ -221,6 +222,8 @@ static inline const struct dma_map_ops *get_dma_ops(struct 
device *dev)
 }
 #endif
 
+extern struct iommu_domain *dma_get_iommu_domain(struct device *dev);
+
 static inline dma_addr_t dma_map_single_attrs(struct device *dev, void *ptr,

[PATCH v8 06/14] xfs: wire up MAP_DIRECT

2017-10-10 Thread Dan Williams
MAP_DIRECT is an mmap(2) flag with the following semantics:

  MAP_DIRECT
  When specified with MAP_SHARED_VALIDATE, sets up a file lease with the
  same lifetime as the mapping. Unlike a typical F_RDLCK lease this lease
  is broken when a "lease breaker" attempts to write(2), change the block
  map (fallocate), or change the size of the file. Otherwise the mechanism
  of a lease break is identical to the typical lease break case where the
  lease needs to be removed (munmap) within the number of seconds
  specified by /proc/sys/fs/lease-break-time. If the lease holder fails to
  remove the lease in time the kernel will invalidate the mapping and
  force all future accesses to the mapping to trigger SIGBUS.

  In addition to lease break timeouts causing faults in the mapping to
  result in SIGBUS, other states of the file will trigger SIGBUS at fault
  time:

  * The fault would trigger the filesystem to allocate blocks
  * The fault would trigger the filesystem to perform extent conversion

  In other words, MAP_DIRECT expects and enforces a fully allocated file
  where faults can be satisfied without modifying block map metadata.

  An unprivileged process may establish a MAP_DIRECT mapping on a file
  whose UID (owner) matches the filesystem UID of the  process. A process
  with the CAP_LEASE capability may establish a MAP_DIRECT mapping on
  arbitrary files

  ERRORS
  EACCES Beyond the typical mmap(2) conditions that trigger EACCES
  MAP_DIRECT also requires the permission to set a file lease.

  EOPNOTSUPP The filesystem explicitly does not support the flag

  EPERM The file does not permit MAP_DIRECT mappings. Potential reasons
  are that DAX access is not available or the file has reflink extents.

  SIGBUS Attempted to write a MAP_DIRECT mapping at a file offset that
 might require block-map updates, or the lease timed out and the
 kernel invalidated the mapping.

Cc: Jan Kara <j...@suse.cz>
Cc: Arnd Bergmann <a...@arndb.de>
Cc: Jeff Moyer <jmo...@redhat.com>
Cc: Christoph Hellwig <h...@lst.de>
Cc: Dave Chinner <da...@fromorbit.com>
Cc: Alexander Viro <v...@zeniv.linux.org.uk>
Cc: "Darrick J. Wong" <darrick.w...@oracle.com>
Cc: Ross Zwisler <ross.zwis...@linux.intel.com>
Cc: Jeff Layton <jlay...@poochiereds.net>
Cc: "J. Bruce Fields" <bfie...@fieldses.org>
Signed-off-by: Dan Williams <dan.j.willi...@intel.com>
---
 fs/xfs/Kconfig  |2 -
 fs/xfs/xfs_file.c   |  103 ++-
 include/linux/mman.h|3 +
 include/uapi/asm-generic/mman.h |1 
 4 files changed, 106 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/Kconfig b/fs/xfs/Kconfig
index f62fc6629abb..f8765653a438 100644
--- a/fs/xfs/Kconfig
+++ b/fs/xfs/Kconfig
@@ -112,4 +112,4 @@ config XFS_ASSERT_FATAL
 
 config XFS_LAYOUT
def_bool y
-   depends on EXPORTFS_BLOCK_OPS
+   depends on EXPORTFS_BLOCK_OPS || FS_DAX
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index ebdd0bd2b261..4bee027c9366 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -40,12 +40,22 @@
 #include "xfs_iomap.h"
 #include "xfs_reflink.h"
 
+#include 
 #include 
 #include 
 #include 
+#include 
 #include 
 
 static const struct vm_operations_struct xfs_file_vm_ops;
+static const struct vm_operations_struct xfs_file_vm_direct_ops;
+
+static bool
+xfs_vma_is_direct(
+   struct vm_area_struct   *vma)
+{
+   return vma->vm_ops == _file_vm_direct_ops;
+}
 
 /*
  * Clear the specified ranges to zero through either the pagecache or DAX.
@@ -1009,6 +1019,22 @@ xfs_file_llseek(
 }
 
 /*
+ * MAP_DIRECT faults can only be serviced while the FL_LAYOUT lease is
+ * valid. See map_direct_invalidate.
+ */
+static int
+xfs_can_fault_direct(
+   struct vm_area_struct   *vma)
+{
+   if (!xfs_vma_is_direct(vma))
+   return 0;
+
+   if (!test_map_direct_valid(vma->vm_private_data))
+   return VM_FAULT_SIGBUS;
+   return 0;
+}
+
+/*
  * Locking for serialisation of IO during page faults. This results in a lock
  * ordering of:
  *
@@ -1024,7 +1050,8 @@ __xfs_filemap_fault(
enum page_entry_sizepe_size,
boolwrite_fault)
 {
-   struct inode*inode = file_inode(vmf->vma->vm_file);
+   struct vm_area_struct   *vma = vmf->vma;
+   struct inode*inode = file_inode(vma->vm_file);
struct xfs_inode*ip = XFS_I(inode);
int ret;
 
@@ -1032,10 +1059,14 @@ __xfs_filemap_fault(
 
if (write_fault) {
sb_start_pagefault(inode->i_sb);
-   file_update_time(vmf->vma->vm_file);
+   file_update_time(vma->vm_file);
}
 
xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
+   ret = xfs_can_fault_direct(vma);
+   if (re

[PATCH v8 04/14] xfs: prepare xfs_break_layouts() for reuse with MAP_DIRECT

2017-10-10 Thread Dan Williams
Move xfs_break_layouts() to its own compilation unit so that it can be
used for both pnfs layouts and MAP_DIRECT mappings.

Cc: Jan Kara <j...@suse.cz>
Cc: Jeff Moyer <jmo...@redhat.com>
Cc: Christoph Hellwig <h...@lst.de>
Cc: Dave Chinner <da...@fromorbit.com>
Cc: "Darrick J. Wong" <darrick.w...@oracle.com>
Cc: Ross Zwisler <ross.zwis...@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.willi...@intel.com>
---
 fs/xfs/Kconfig  |4 
 fs/xfs/Makefile |1 +
 fs/xfs/xfs_layout.c |   42 ++
 fs/xfs/xfs_layout.h |   13 +
 fs/xfs/xfs_pnfs.c   |   30 --
 fs/xfs/xfs_pnfs.h   |   10 ++
 6 files changed, 62 insertions(+), 38 deletions(-)
 create mode 100644 fs/xfs/xfs_layout.c
 create mode 100644 fs/xfs/xfs_layout.h

diff --git a/fs/xfs/Kconfig b/fs/xfs/Kconfig
index 1b98cfa342ab..f62fc6629abb 100644
--- a/fs/xfs/Kconfig
+++ b/fs/xfs/Kconfig
@@ -109,3 +109,7 @@ config XFS_ASSERT_FATAL
  result in warnings.
 
  This behavior can be modified at runtime via sysfs.
+
+config XFS_LAYOUT
+   def_bool y
+   depends on EXPORTFS_BLOCK_OPS
diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index a6e955bfead8..d44135107490 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -135,3 +135,4 @@ xfs-$(CONFIG_XFS_POSIX_ACL) += xfs_acl.o
 xfs-$(CONFIG_SYSCTL)   += xfs_sysctl.o
 xfs-$(CONFIG_COMPAT)   += xfs_ioctl32.o
 xfs-$(CONFIG_EXPORTFS_BLOCK_OPS)   += xfs_pnfs.o
+xfs-$(CONFIG_XFS_LAYOUT)   += xfs_layout.o
diff --git a/fs/xfs/xfs_layout.c b/fs/xfs/xfs_layout.c
new file mode 100644
index ..71d95e1a910a
--- /dev/null
+++ b/fs/xfs/xfs_layout.c
@@ -0,0 +1,42 @@
+/*
+ * Copyright (c) 2014 Christoph Hellwig.
+ */
+#include "xfs.h"
+#include "xfs_format.h"
+#include "xfs_log_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_sb.h"
+#include "xfs_mount.h"
+#include "xfs_inode.h"
+
+#include 
+
+/*
+ * Ensure that we do not have any outstanding pNFS layouts that can be used by
+ * clients to directly read from or write to this inode.  This must be called
+ * before every operation that can remove blocks from the extent map.
+ * Additionally we call it during the write operation, where aren't concerned
+ * about exposing unallocated blocks but just want to provide basic
+ * synchronization between a local writer and pNFS clients.  mmap writes would
+ * also benefit from this sort of synchronization, but due to the tricky 
locking
+ * rules in the page fault path we don't bother.
+ */
+int
+xfs_break_layouts(
+   struct inode*inode,
+   uint*iolock)
+{
+   struct xfs_inode*ip = XFS_I(inode);
+   int error;
+
+   ASSERT(xfs_isilocked(ip, XFS_IOLOCK_SHARED|XFS_IOLOCK_EXCL));
+
+   while ((error = break_layout(inode, false) == -EWOULDBLOCK)) {
+   xfs_iunlock(ip, *iolock);
+   error = break_layout(inode, true);
+   *iolock = XFS_IOLOCK_EXCL;
+   xfs_ilock(ip, *iolock);
+   }
+
+   return error;
+}
diff --git a/fs/xfs/xfs_layout.h b/fs/xfs/xfs_layout.h
new file mode 100644
index ..f848ee78cc93
--- /dev/null
+++ b/fs/xfs/xfs_layout.h
@@ -0,0 +1,13 @@
+#ifndef _XFS_LAYOUT_H
+#define _XFS_LAYOUT_H 1
+
+#ifdef CONFIG_XFS_LAYOUT
+int xfs_break_layouts(struct inode *inode, uint *iolock);
+#else
+static inline int
+xfs_break_layouts(struct inode *inode, uint *iolock)
+{
+   return 0;
+}
+#endif /* CONFIG_XFS_LAYOUT */
+#endif /* _XFS_LAYOUT_H */
diff --git a/fs/xfs/xfs_pnfs.c b/fs/xfs/xfs_pnfs.c
index 2f2dc3c09ad0..8ec72220e73b 100644
--- a/fs/xfs/xfs_pnfs.c
+++ b/fs/xfs/xfs_pnfs.c
@@ -20,36 +20,6 @@
 #include "xfs_pnfs.h"
 
 /*
- * Ensure that we do not have any outstanding pNFS layouts that can be used by
- * clients to directly read from or write to this inode.  This must be called
- * before every operation that can remove blocks from the extent map.
- * Additionally we call it during the write operation, where aren't concerned
- * about exposing unallocated blocks but just want to provide basic
- * synchronization between a local writer and pNFS clients.  mmap writes would
- * also benefit from this sort of synchronization, but due to the tricky 
locking
- * rules in the page fault path we don't bother.
- */
-int
-xfs_break_layouts(
-   struct inode*inode,
-   uint*iolock)
-{
-   struct xfs_inode*ip = XFS_I(inode);
-   int error;
-
-   ASSERT(xfs_isilocked(ip, XFS_IOLOCK_SHARED|XFS_IOLOCK_EXCL));
-
-   while ((error = break_layout(inode, false) == -EWOULDBLOCK)) {
-   xfs_iunlock(ip, *iolock);
-   error = break_layout(inode, true);
-   *iolock = XFS_IOLOCK_EXCL;
-   

[PATCH v8 00/14] MAP_DIRECT for DAX RDMA and userspace flush

2017-10-10 Thread Dan Williams
Changes since v7 [1]:
* Fix IOVA reuse race by leaving the dma scatterlist mapped until
  unregistration time. Use iommu_unmap() in ib_umem_lease_break() to
  force-invalidate the ibverbs memory registration. (David Woodhouse)

* Introduce iomap_can_allocate() as a way to check if any layouts are
  present in the mmap write-fault path to prevent block map changes, and
  start the leak break process when an allocating write-fault occurs.
  This also removes the i_mapdcount bloat of 'struct inode' from v7.
  (Dave Chinner)

* Provide generic_map_direct_{open,close,lease} to cleanup the
  filesystem wiring to implement MAP_DIRECT support (Dave Chinner)

* Abandon (defer to a potential new fcntl()) support for using
  MAP_DIRECT on non-DAX files. With this change we can validate the
  inode is MAP_DIRECT capable just once at mmap time rather than every
  fault.  (Dave Chinner)

* Arrange for lease_direct leases to also wait the
  /proc/sys/fs/lease-break-time period before calling break_fn. For
  example, allow the lease-holder time to quiesce RDMA operations before
  the iommu starts throwing io-faults.

* Switch intel-iommu to use iommu_num_sg_pages().

[1]: https://lists.01.org/pipermail/linux-nvdimm/2017-October/012707.html

---

MAP_DIRECT is a mechanism that allows an application to establish a
mapping where the kernel will not change the block-map, or otherwise
dirty the block-map metadata of a file without notification. It supports
a "flush from userspace" model where persistent memory applications can
bypass the overhead of ongoing coordination of writes with the
filesystem, and it provides safety to RDMA operations involving DAX
mappings.

The kernel always has the ability to revoke access and convert the file
back to normal operation after performing a "lease break". Similar to
fcntl leases, there is no way for userspace to to cancel the lease break
process once it has started, it can only delay it via the
/proc/sys/fs/lease-break-time setting.

MAP_DIRECT enables XFS to supplant the device-dax interface for
mmap-write access to persistent memory with no ongoing coordination with
the filesystem via fsync/msync syscalls.

---

Dan Williams (14):
  mm: introduce MAP_SHARED_VALIDATE, a mechanism to safely define new mmap 
flags
  fs, mm: pass fd to ->mmap_validate()
  fs: MAP_DIRECT core
  xfs: prepare xfs_break_layouts() for reuse with MAP_DIRECT
  fs, xfs, iomap: introduce iomap_can_allocate()
  xfs: wire up MAP_DIRECT
  iommu, dma-mapping: introduce dma_get_iommu_domain()
  fs, mapdirect: introduce ->lease_direct()
  xfs: wire up ->lease_direct()
  device-dax: wire up ->lease_direct()
  iommu: up-level sg_num_pages() from amd-iommu
  iommu/vt-d: use iommu_num_sg_pages
  IB/core: use MAP_DIRECT to fix / enable RDMA to DAX mappings
  tools/testing/nvdimm: enable rdma unit tests


 arch/alpha/include/uapi/asm/mman.h   |1 
 arch/mips/include/uapi/asm/mman.h|1 
 arch/mips/kernel/vdso.c  |2 
 arch/parisc/include/uapi/asm/mman.h  |1 
 arch/tile/mm/elf.c   |3 
 arch/x86/mm/mpx.c|3 
 arch/xtensa/include/uapi/asm/mman.h  |1 
 drivers/base/dma-mapping.c   |   10 +
 drivers/dax/Kconfig  |1 
 drivers/dax/device.c |4 
 drivers/infiniband/core/umem.c   |   90 +-
 drivers/iommu/amd_iommu.c|   40 +--
 drivers/iommu/intel-iommu.c  |   30 +-
 drivers/iommu/iommu.c|   27 ++
 fs/Kconfig   |5 
 fs/Makefile  |1 
 fs/aio.c |2 
 fs/mapdirect.c   |  382 ++
 fs/xfs/Kconfig   |4 
 fs/xfs/Makefile  |1 
 fs/xfs/xfs_file.c|  103 +++
 fs/xfs/xfs_iomap.c   |3 
 fs/xfs/xfs_layout.c  |   45 +++
 fs/xfs/xfs_layout.h  |   13 +
 fs/xfs/xfs_pnfs.c|   30 --
 fs/xfs/xfs_pnfs.h|   10 -
 include/linux/dma-mapping.h  |3 
 include/linux/fs.h   |2 
 include/linux/iomap.h|   10 +
 include/linux/iommu.h|2 
 include/linux/mapdirect.h|   57 
 include/linux/mm.h   |   17 +
 include/linux/mman.h |   42 +++
 include/rdma/ib_umem.h   |8 +
 include/uapi/asm-generic/mman-common.h   |1 
 include/uapi/asm-generic/mman.h  |1 
 ipc/shm.c  

[PATCH v8 01/14] mm: introduce MAP_SHARED_VALIDATE, a mechanism to safely define new mmap flags

2017-10-10 Thread Dan Williams
The mmap(2) syscall suffers from the ABI anti-pattern of not validating
unknown flags. However, proposals like MAP_SYNC and MAP_DIRECT need a
mechanism to define new behavior that is known to fail on older kernels
without the support. Define a new MAP_SHARED_VALIDATE flag pattern that
is guaranteed to fail on all legacy mmap implementations.

It is worth noting that the original proposal was for a standalone
MAP_VALIDATE flag. However, when that  could not be supported by all
archs Linus observed:

I see why you *think* you want a bitmap. You think you want
a bitmap because you want to make MAP_VALIDATE be part of MAP_SYNC
etc, so that people can do

ret = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED
| MAP_SYNC, fd, 0);

and "know" that MAP_SYNC actually takes.

And I'm saying that whole wish is bogus. You're fundamentally
depending on special semantics, just make it explicit. It's already
not portable, so don't try to make it so.

Rename that MAP_VALIDATE as MAP_SHARED_VALIDATE, make it have a value
of 0x3, and make people do

ret = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED_VALIDATE
| MAP_SYNC, fd, 0);

and then the kernel side is easier too (none of that random garbage
playing games with looking at the "MAP_VALIDATE bit", but just another
case statement in that map type thing.

Boom. Done.

Similar to ->fallocate() we also want the ability to validate the
support for new flags on a per ->mmap() 'struct file_operations'
instance basis.  Towards that end arrange for flags to be generically
validated against a mmap_supported_mask exported by 'struct
file_operations'. By default all existing flags are implicitly
supported, but new flags require MAP_SHARED_VALIDATE and
per-instance-opt-in.

Cc: Jan Kara <j...@suse.cz>
Cc: Arnd Bergmann <a...@arndb.de>
Cc: Andy Lutomirski <l...@kernel.org>
Cc: Andrew Morton <a...@linux-foundation.org>
Suggested-by: Christoph Hellwig <h...@lst.de>
Suggested-by: Linus Torvalds <torva...@linux-foundation.org>
Signed-off-by: Dan Williams <dan.j.willi...@intel.com>
---
 arch/alpha/include/uapi/asm/mman.h   |1 +
 arch/mips/include/uapi/asm/mman.h|1 +
 arch/mips/kernel/vdso.c  |2 +
 arch/parisc/include/uapi/asm/mman.h  |1 +
 arch/tile/mm/elf.c   |3 +-
 arch/xtensa/include/uapi/asm/mman.h  |1 +
 include/linux/fs.h   |2 +
 include/linux/mm.h   |2 +
 include/linux/mman.h |   39 ++
 include/uapi/asm-generic/mman-common.h   |1 +
 mm/mmap.c|   21 --
 tools/include/uapi/asm-generic/mman-common.h |1 +
 12 files changed, 69 insertions(+), 6 deletions(-)

diff --git a/arch/alpha/include/uapi/asm/mman.h 
b/arch/alpha/include/uapi/asm/mman.h
index 3b26cc62dadb..92823f24890b 100644
--- a/arch/alpha/include/uapi/asm/mman.h
+++ b/arch/alpha/include/uapi/asm/mman.h
@@ -14,6 +14,7 @@
 #define MAP_TYPE   0x0f/* Mask for type of mapping (OSF/1 is 
_wrong_) */
 #define MAP_FIXED  0x100   /* Interpret addr exactly */
 #define MAP_ANONYMOUS  0x10/* don't use a file */
+#define MAP_SHARED_VALIDATE 0x3/* share + validate extension 
flags */
 
 /* not used by linux, but here to make sure we don't clash with OSF/1 defines 
*/
 #define _MAP_HASSEMAPHORE 0x0200
diff --git a/arch/mips/include/uapi/asm/mman.h 
b/arch/mips/include/uapi/asm/mman.h
index da3216007fe0..c77689076577 100644
--- a/arch/mips/include/uapi/asm/mman.h
+++ b/arch/mips/include/uapi/asm/mman.h
@@ -30,6 +30,7 @@
 #define MAP_PRIVATE0x002   /* Changes are private */
 #define MAP_TYPE   0x00f   /* Mask for type of mapping */
 #define MAP_FIXED  0x010   /* Interpret addr exactly */
+#define MAP_SHARED_VALIDATE 0x3/* share + validate extension 
flags */
 
 /* not used by linux, but here to make sure we don't clash with ABI defines */
 #define MAP_RENAME 0x020   /* Assign page to file */
diff --git a/arch/mips/kernel/vdso.c b/arch/mips/kernel/vdso.c
index 019035d7225c..cf10654477a9 100644
--- a/arch/mips/kernel/vdso.c
+++ b/arch/mips/kernel/vdso.c
@@ -110,7 +110,7 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, 
int uses_interp)
base = mmap_region(NULL, STACK_TOP, PAGE_SIZE,
   VM_READ|VM_WRITE|VM_EXEC|
   VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC,
-  0, NULL);
+  0, NULL, 0);
if (IS_ERR_VALUE(base)) {
ret = base;
goto out;
diff --git a/arch/parisc/include/uapi/asm/mman.h 
b/arch/parisc/include/uapi/asm/mman.h
index 775b5d5e41a1..36b6

[PATCH v8 05/14] fs, xfs, iomap: introduce iomap_can_allocate()

2017-10-10 Thread Dan Williams
In preparation for using FL_LAYOUT leases to allow coordination between
the kernel and processes doing userspace flushes / RDMA with DAX
mappings, add this helper that can be used to detect when block-map
updates are not allowed.

This is targeted to be used in an ->iomap_begin() implementation where
we may have various filesystem locks held and can not synchronously wait
for any FL_LAYOUT leases to be released. In particular an iomap mmap
fault handler running under mmap_sem can not unlock that semaphore and
wait for these leases to be unlocked. Instead, this signals the lease
holder(s) that a break is requested and immediately returns with an
error.

Cc: Jan Kara <j...@suse.cz>
Cc: Jeff Moyer <jmo...@redhat.com>
Cc: Christoph Hellwig <h...@lst.de>
Cc: "Darrick J. Wong" <darrick.w...@oracle.com>
Cc: Ross Zwisler <ross.zwis...@linux.intel.com>
Suggested-by: Dave Chinner <da...@fromorbit.com>
Signed-off-by: Dan Williams <dan.j.willi...@intel.com>
---
 fs/xfs/xfs_iomap.c|3 +++
 fs/xfs/xfs_layout.c   |5 -
 include/linux/iomap.h |   10 ++
 3 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index a1909bc064e9..b3cda11e9515 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -1052,6 +1052,9 @@ xfs_file_iomap_begin(
error = -EAGAIN;
goto out_unlock;
}
+   error = iomap_can_allocate(inode);
+   if (error)
+   goto out_unlock;
/*
 * We cap the maximum length we map here to MAX_WRITEBACK_PAGES
 * pages to keep the chunks of work done where somewhat 
symmetric
diff --git a/fs/xfs/xfs_layout.c b/fs/xfs/xfs_layout.c
index 71d95e1a910a..88c533bf5b7c 100644
--- a/fs/xfs/xfs_layout.c
+++ b/fs/xfs/xfs_layout.c
@@ -19,7 +19,10 @@
  * about exposing unallocated blocks but just want to provide basic
  * synchronization between a local writer and pNFS clients.  mmap writes would
  * also benefit from this sort of synchronization, but due to the tricky 
locking
- * rules in the page fault path we don't bother.
+ * rules in the page fault path all we can do is start the lease break
+ * timeout. See usage of iomap_can_allocate in xfs_file_iomap_begin to
+ * prevent write-faults from allocating blocks or performing extent
+ * conversion.
  */
 int
 xfs_break_layouts(
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index f64dc6ce5161..e24b4e81d41a 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -2,6 +2,7 @@
 #define LINUX_IOMAP_H 1
 
 #include 
+#include 
 
 struct fiemap_extent_info;
 struct inode;
@@ -88,6 +89,15 @@ loff_t iomap_seek_hole(struct inode *inode, loff_t offset,
const struct iomap_ops *ops);
 loff_t iomap_seek_data(struct inode *inode, loff_t offset,
const struct iomap_ops *ops);
+/*
+ * Check if there are any file layout leases preventing block map
+ * changes and if so start the lease break process, but do not wait for
+ * it to complete (return -EWOULDBLOCK);
+ */
+static inline int iomap_can_allocate(struct inode *inode)
+{
+   return break_layout(inode, false);
+}
 
 /*
  * Flags for direct I/O ->end_io:

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v8 03/14] fs: MAP_DIRECT core

2017-10-10 Thread Dan Williams
Introduce a set of helper apis for filesystems to establish FL_LAYOUT
leases to protect against writes and block map updates while a
MAP_DIRECT mapping is established. While the lease protects against the
syscall write path and fallocate it does not protect against allocating
write-faults, so this relies on i_mapdcount to disable block map updates
from write faults.

Like the pnfs case MAP_DIRECT does its own timeout of the lease since we
need to have a process context for running map_direct_invalidate().

Cc: Jan Kara <j...@suse.cz>
Cc: Jeff Moyer <jmo...@redhat.com>
Cc: Christoph Hellwig <h...@lst.de>
Cc: Dave Chinner <da...@fromorbit.com>
Cc: "Darrick J. Wong" <darrick.w...@oracle.com>
Cc: Ross Zwisler <ross.zwis...@linux.intel.com>
Cc: Jeff Layton <jlay...@poochiereds.net>
Cc: "J. Bruce Fields" <bfie...@fieldses.org>
Signed-off-by: Dan Williams <dan.j.willi...@intel.com>
---
 fs/Kconfig|1 
 fs/Makefile   |2 
 fs/mapdirect.c|  237 +
 include/linux/mapdirect.h |   40 
 4 files changed, 279 insertions(+), 1 deletion(-)
 create mode 100644 fs/mapdirect.c
 create mode 100644 include/linux/mapdirect.h

diff --git a/fs/Kconfig b/fs/Kconfig
index 7aee6d699fd6..a7b31a96a753 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -37,6 +37,7 @@ source "fs/f2fs/Kconfig"
 config FS_DAX
bool "Direct Access (DAX) support"
depends on MMU
+   depends on FILE_LOCKING
depends on !(ARM || MIPS || SPARC)
select FS_IOMAP
select DAX
diff --git a/fs/Makefile b/fs/Makefile
index 7bbaca9c67b1..c0e791d235d8 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -29,7 +29,7 @@ obj-$(CONFIG_TIMERFD) += timerfd.o
 obj-$(CONFIG_EVENTFD)  += eventfd.o
 obj-$(CONFIG_USERFAULTFD)  += userfaultfd.o
 obj-$(CONFIG_AIO)   += aio.o
-obj-$(CONFIG_FS_DAX)   += dax.o
+obj-$(CONFIG_FS_DAX)   += dax.o mapdirect.o
 obj-$(CONFIG_FS_ENCRYPTION)+= crypto/
 obj-$(CONFIG_FILE_LOCKING)  += locks.o
 obj-$(CONFIG_COMPAT)   += compat.o compat_ioctl.o
diff --git a/fs/mapdirect.c b/fs/mapdirect.c
new file mode 100644
index ..9f4dd7395dcd
--- /dev/null
+++ b/fs/mapdirect.c
@@ -0,0 +1,237 @@
+/*
+ * Copyright(c) 2017 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define MAPDIRECT_BREAK 0
+#define MAPDIRECT_VALID 1
+
+struct map_direct_state {
+   atomic_t mds_ref;
+   atomic_t mds_vmaref;
+   unsigned long mds_state;
+   struct inode *mds_inode;
+   struct delayed_work mds_work;
+   struct fasync_struct *mds_fa;
+   struct vm_area_struct *mds_vma;
+};
+
+bool test_map_direct_valid(struct map_direct_state *mds)
+{
+   return test_bit(MAPDIRECT_VALID, >mds_state);
+}
+EXPORT_SYMBOL_GPL(test_map_direct_valid);
+
+static void put_map_direct(struct map_direct_state *mds)
+{
+   if (!atomic_dec_and_test(>mds_ref))
+   return;
+   kfree(mds);
+}
+
+static void put_map_direct_vma(struct map_direct_state *mds)
+{
+   struct vm_area_struct *vma = mds->mds_vma;
+   struct file *file = vma->vm_file;
+   struct inode *inode = file_inode(file);
+   void *owner = mds;
+
+   if (!atomic_dec_and_test(>mds_vmaref))
+   return;
+
+   /*
+* Flush in-flight+forced lm_break events that may be
+* referencing this dying vma.
+*/
+   mds->mds_vma = NULL;
+   set_bit(MAPDIRECT_BREAK, >mds_state);
+   vfs_setlease(vma->vm_file, F_UNLCK, NULL, );
+   flush_delayed_work(>mds_work);
+   iput(inode);
+
+   put_map_direct(mds);
+}
+
+void generic_map_direct_close(struct vm_area_struct *vma)
+{
+   put_map_direct_vma(vma->vm_private_data);
+}
+EXPORT_SYMBOL_GPL(generic_map_direct_close);
+
+static void get_map_direct_vma(struct map_direct_state *mds)
+{
+   atomic_inc(>mds_vmaref);
+}
+
+void generic_map_direct_open(struct vm_area_struct *vma)
+{
+   get_map_direct_vma(vma->vm_private_data);
+}
+EXPORT_SYMBOL_GPL(generic_map_direct_open);
+
+static void map_direct_invalidate(struct work_struct *work)
+{
+   struct map_direct_state *mds;
+   struct vm_area_struct *vma;
+   struct inode *inode;
+   void *owner;
+
+   mds = container_of(work, typeof(*mds), mds_work.

[PATCH v8 02/14] fs, mm: pass fd to ->mmap_validate()

2017-10-10 Thread Dan Williams
The MAP_DIRECT mechanism for mmap intends to use a file lease to prevent
block map changes while the file is mapped. It requires the fd to setup
an fasync_struct for signalling lease break events to the lease holder.

Cc: Jan Kara <j...@suse.cz>
Cc: Jeff Moyer <jmo...@redhat.com>
Cc: Christoph Hellwig <h...@lst.de>
Cc: Dave Chinner <da...@fromorbit.com>
Cc: "Darrick J. Wong" <darrick.w...@oracle.com>
Cc: Ross Zwisler <ross.zwis...@linux.intel.com>
Cc: Andrew Morton <a...@linux-foundation.org>
Signed-off-by: Dan Williams <dan.j.willi...@intel.com>
---
 arch/mips/kernel/vdso.c |2 +-
 arch/tile/mm/elf.c  |2 +-
 arch/x86/mm/mpx.c   |3 ++-
 fs/aio.c|2 +-
 include/linux/fs.h  |2 +-
 include/linux/mm.h  |9 +
 ipc/shm.c   |3 ++-
 mm/internal.h   |2 +-
 mm/mmap.c   |   13 +++--
 mm/nommu.c  |5 +++--
 mm/util.c   |7 ---
 11 files changed, 28 insertions(+), 22 deletions(-)

diff --git a/arch/mips/kernel/vdso.c b/arch/mips/kernel/vdso.c
index cf10654477a9..ab26c7ac0316 100644
--- a/arch/mips/kernel/vdso.c
+++ b/arch/mips/kernel/vdso.c
@@ -110,7 +110,7 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, 
int uses_interp)
base = mmap_region(NULL, STACK_TOP, PAGE_SIZE,
   VM_READ|VM_WRITE|VM_EXEC|
   VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC,
-  0, NULL, 0);
+  0, NULL, 0, -1);
if (IS_ERR_VALUE(base)) {
ret = base;
goto out;
diff --git a/arch/tile/mm/elf.c b/arch/tile/mm/elf.c
index 5ffcbe76aef9..61a9588e141a 100644
--- a/arch/tile/mm/elf.c
+++ b/arch/tile/mm/elf.c
@@ -144,7 +144,7 @@ int arch_setup_additional_pages(struct linux_binprm *bprm,
addr = mmap_region(NULL, addr, INTRPT_SIZE,
   VM_READ|VM_EXEC|
   VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC, 0,
-  NULL, 0);
+  NULL, 0, -1);
if (addr > (unsigned long) -PAGE_SIZE)
retval = (int) addr;
}
diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c
index 9ceaa955d2ba..a8baa94a496b 100644
--- a/arch/x86/mm/mpx.c
+++ b/arch/x86/mm/mpx.c
@@ -52,7 +52,8 @@ static unsigned long mpx_mmap(unsigned long len)
 
down_write(>mmap_sem);
addr = do_mmap(NULL, 0, len, PROT_READ | PROT_WRITE,
-  MAP_ANONYMOUS | MAP_PRIVATE, VM_MPX, 0, , NULL);
+   MAP_ANONYMOUS | MAP_PRIVATE, VM_MPX, 0, ,
+   NULL, -1);
up_write(>mmap_sem);
if (populate)
mm_populate(addr, populate);
diff --git a/fs/aio.c b/fs/aio.c
index 5a2487217072..d10ca6db2ee6 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -519,7 +519,7 @@ static int aio_setup_ring(struct kioctx *ctx, unsigned int 
nr_events)
 
ctx->mmap_base = do_mmap_pgoff(ctx->aio_ring_file, 0, ctx->mmap_size,
   PROT_READ | PROT_WRITE,
-  MAP_SHARED, 0, , NULL);
+  MAP_SHARED, 0, , NULL, -1);
up_write(>mmap_sem);
if (IS_ERR((void *)ctx->mmap_base)) {
ctx->mmap_size = 0;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 51538958f7f5..c2b9bf3dc4e9 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1702,7 +1702,7 @@ struct file_operations {
long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
int (*mmap) (struct file *, struct vm_area_struct *);
int (*mmap_validate) (struct file *, struct vm_area_struct *,
-   unsigned long);
+   unsigned long, int);
int (*open) (struct inode *, struct file *);
int (*flush) (struct file *, fl_owner_t id);
int (*release) (struct inode *, struct file *);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5c4c98e4adc9..0afa19feb755 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2133,11 +2133,11 @@ extern unsigned long get_unmapped_area(struct file *, 
unsigned long, unsigned lo
 
 extern unsigned long mmap_region(struct file *file, unsigned long addr,
unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
-   struct list_head *uf, unsigned long map_flags);
+   struct list_head *uf, unsigned long map_flags, int fd);
 extern unsigned long do_mmap(struct file *file, unsigned long addr,
unsigned long len, unsigned long prot, unsigned long flags,
vm_flags_t vm_flags, unsigned long pgoff, unsigned long *populate,
-   struct list_head *uf);
+   struct list_head *uf, int fd);
 extern int do_munmap(struct mm_struct *, unsigned long, size_t,
   

Re: [PATCH v8 2/2] IB/core: use MAP_DIRECT to fix / enable RDMA to DAX mappings

2017-10-08 Thread Dan Williams
On Sat, Oct 7, 2017 at 11:45 PM, kbuild test robot <l...@intel.com> wrote:
> Hi Dan,
>
> [auto build test ERROR on rdma/master]
> [also build test ERROR on v4.14-rc3 next-20170929]
> [if your patch is applied to the wrong git tree, please drop us a note to 
> help improve the system]

This was a fixed up resend of patch [v7 11/12]. It's not clear how to
teach the kbuild robot to be aware of patch-replies to individual
patches in the series. I.e. reworked patches without resending the
complete series?

> url:
> https://github.com/0day-ci/linux/commits/Dan-Williams/iommu-up-level-sg_num_pages-from-amd-iommu/20171008-133505
> base:   https://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma.git 
> master
> config: i386-randconfig-n0-201741 (attached as .config)
> compiler: gcc-4.8 (Debian 4.8.4-1) 4.8.4
> reproduce:
> # save the attached .config to linux build tree
> make ARCH=i386
>
> All errors (new ones prefixed by >>):
>
>>> drivers/infiniband/core/umem.c:39:29: fatal error: linux/mapdirect.h: No 
>>> such file or directory
> #include 

mapdirect.h indeed does not exist when missing the earlier patches in
the series. It would be slick if the 0day-robot read the the
"in-reply-to" header and auto replaced a patch in a series, but that
would be a feature approaching magic.

>compilation terminated.
>
> vim +39 drivers/infiniband/core/umem.c
>
>   > 39  #include 
> 40  #include 
> 41  #include 
> 42  #include 
> 43  #include 
> 44
>
> ---
> 0-DAY kernel test infrastructureOpen Source Technology Center
> https://lists.01.org/pipermail/kbuild-all   Intel Corporation
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v8 2/2] IB/core: use MAP_DIRECT to fix / enable RDMA to DAX mappings

2017-10-07 Thread Dan Williams
Currently the ibverbs core in the kernel is completely unaware of the
dangers of filesystem-DAX mappings. Specifically, the filesystem is free
to move file blocks at will. In the case of DAX, it means that RDMA to a
given file offset can dynamically switch to another file offset, another
file, or free space with no notification to RDMA device to cease
operations. Historically, this lack of communication between the ibverbs
core and filesystem was not a problem because RDMA always targeted
dynamically allocated page cache, so at least the RDMA device would have
valid memory to target even if the file was being modified. With DAX we
need to add coordination since RDMA is bypassing page-cache and going
direct to on-media pages of the file. RDMA to DAX can cause damage if
filesystem blocks move / change state.

Use the new ->lease_direct() operation to get a notification when the
filesystem is invalidating the block map of the file and needs RDMA
operations to stop. Given that the kernel can not be in a position where
it needs to wait indefinitely for userspace to stop a device we need a
mechanism where the kernel can force-revoke access. Towards that end, use
the dma_get_iommu_domain() to both check if the device has domain
mappings that can be invalidated and retrieve the iommu_domain for use
with iommu_unmap.

Once we have that assurance that we can block in-flight I/O when the
file's block map changes then we can safely allow RDMA to DAX.

Cc: Sean Hefty <sean.he...@intel.com>
Cc: Doug Ledford <dledf...@redhat.com>
Cc: Hal Rosenstock <hal.rosenst...@gmail.com>
Cc: Jan Kara <j...@suse.cz>
Cc: Jeff Moyer <jmo...@redhat.com>
Cc: Christoph Hellwig <h...@lst.de>
Cc: Dave Chinner <da...@fromorbit.com>
Cc: Joerg Roedel <j...@8bytes.org>
Cc: David Woodhouse <dw...@infradead.org>
Cc: Ashok Raj <ashok@intel.com>
Cc: "Darrick J. Wong" <darrick.w...@oracle.com>
Cc: Ross Zwisler <ross.zwis...@linux.intel.com>
Cc: Jeff Layton <jlay...@poochiereds.net>
Cc: "J. Bruce Fields" <bfie...@fieldses.org>
Signed-off-by: Dan Williams <dan.j.willi...@intel.com>
---
Changes since v7:
* Switch from dma_has_iommu() to dma_get_iommu_domain().
* Switch from dma_unmap_sg() at lease break time, to iommu_unmap() so
  that the IOVA remains allocated while the device might still be
  sending transactions.

 drivers/infiniband/core/umem.c |   90 +++-
 include/rdma/ib_umem.h |8 
 2 files changed, 86 insertions(+), 12 deletions(-)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index 21e60b1e2ff4..5e4598982359 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -36,6 +36,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -46,10 +47,16 @@
 
 static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int 
dirty)
 {
+   struct lease_direct *ld, *_ld;
struct scatterlist *sg;
struct page *page;
int i;
 
+   list_for_each_entry_safe(ld, _ld, >leases, list) {
+   list_del_init(>list);
+   map_direct_lease_destroy(ld);
+   }
+
if (umem->nmap > 0)
ib_dma_unmap_sg(dev, umem->sg_head.sgl,
umem->npages,
@@ -64,10 +71,20 @@ static void __ib_umem_release(struct ib_device *dev, struct 
ib_umem *umem, int d
}
 
sg_free_table(>sg_head);
-   return;
 
 }
 
+static void ib_umem_lease_break(void *__umem)
+{
+   struct ib_umem *umem = umem;
+   struct ib_device *idev = umem->context->device;
+   struct device *dev = idev->dma_device;
+   struct scatterlist *sgl = umem->sg_head.sgl;
+
+   iommu_unmap(umem->iommu, sg_dma_address(sgl) & PAGE_MASK,
+   iommu_sg_num_pages(dev, sgl, umem->npages));
+}
+
 /**
  * ib_umem_get - Pin and DMA map userspace memory.
  *
@@ -96,7 +113,10 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, 
unsigned long addr,
struct scatterlist *sg, *sg_list_start;
int need_release = 0;
unsigned int gup_flags = FOLL_WRITE;
+   struct vm_area_struct *vma_prev = NULL;
+   struct device *dma_dev;
 
+   dma_dev = context->device->dma_device;
if (dmasync)
dma_attrs |= DMA_ATTR_WRITE_BARRIER;
 
@@ -120,6 +140,7 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, 
unsigned long addr,
umem->address= addr;
umem->page_shift = PAGE_SHIFT;
umem->pid= get_task_pid(current, PIDTYPE_PID);
+   INIT_LIST_HEAD(>leases);
/*
 * We ask for writable memory if any of the following
 * access flags are set.  "Local write" and "remote write"
@@ -147,19 +168,21 @@ struct ib_umem *ib_umem_get(struct ib_uc

[PATCH v8 1/2] iommu: up-level sg_num_pages() from amd-iommu

2017-10-07 Thread Dan Williams
iommu_sg_num_pages() is a helper that walks a scattlerlist and counts
pages taking segment boundaries and iommu_num_pages() into account.
Up-level it for determining the IOVA range that dma_map_ops established
at dma_map_sg() time. The intent is to iommu_unmap() the IOVA range in
advance of freeing IOVA range.

Cc: Joerg Roedel <j...@8bytes.org>
Signed-off-by: Dan Williams <dan.j.willi...@intel.com>
---
New patch in v8.

 drivers/iommu/amd_iommu.c |   30 ++
 drivers/iommu/iommu.c |   27 +++
 include/linux/iommu.h |2 ++
 3 files changed, 31 insertions(+), 28 deletions(-)

diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
index c8e1a45af182..4795b0823469 100644
--- a/drivers/iommu/amd_iommu.c
+++ b/drivers/iommu/amd_iommu.c
@@ -2459,32 +2459,6 @@ static void unmap_page(struct device *dev, dma_addr_t 
dma_addr, size_t size,
__unmap_single(dma_dom, dma_addr, size, dir);
 }
 
-static int sg_num_pages(struct device *dev,
-   struct scatterlist *sglist,
-   int nelems)
-{
-   unsigned long mask, boundary_size;
-   struct scatterlist *s;
-   int i, npages = 0;
-
-   mask  = dma_get_seg_boundary(dev);
-   boundary_size = mask + 1 ? ALIGN(mask + 1, PAGE_SIZE) >> PAGE_SHIFT :
-  1UL << (BITS_PER_LONG - PAGE_SHIFT);
-
-   for_each_sg(sglist, s, nelems, i) {
-   int p, n;
-
-   s->dma_address = npages << PAGE_SHIFT;
-   p = npages % boundary_size;
-   n = iommu_num_pages(sg_phys(s), s->length, PAGE_SIZE);
-   if (p + n > boundary_size)
-   npages += boundary_size - p;
-   npages += n;
-   }
-
-   return npages;
-}
-
 /*
  * The exported map_sg function for dma_ops (handles scatter-gather
  * lists).
@@ -2507,7 +2481,7 @@ static int map_sg(struct device *dev, struct scatterlist 
*sglist,
dma_dom  = to_dma_ops_domain(domain);
dma_mask = *dev->dma_mask;
 
-   npages = sg_num_pages(dev, sglist, nelems);
+   npages = iommu_sg_num_pages(dev, sglist, nelems);
 
address = dma_ops_alloc_iova(dev, dma_dom, npages, dma_mask);
if (address == AMD_IOMMU_MAPPING_ERROR)
@@ -2585,7 +2559,7 @@ static void unmap_sg(struct device *dev, struct 
scatterlist *sglist,
 
startaddr = sg_dma_address(sglist) & PAGE_MASK;
dma_dom   = to_dma_ops_domain(domain);
-   npages= sg_num_pages(dev, sglist, nelems);
+   npages= iommu_sg_num_pages(dev, sglist, nelems);
 
__unmap_single(dma_dom, startaddr, npages << PAGE_SHIFT, dir);
 }
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 3de5c0bcb5cc..cfe6eeea3578 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -33,6 +33,7 @@
 #include 
 #include 
 #include 
+#include 
 
 static struct kset *iommu_group_kset;
 static DEFINE_IDA(iommu_group_ida);
@@ -1631,6 +1632,32 @@ size_t iommu_unmap_fast(struct iommu_domain *domain,
 }
 EXPORT_SYMBOL_GPL(iommu_unmap_fast);
 
+int iommu_sg_num_pages(struct device *dev, struct scatterlist *sglist,
+   int nelems)
+{
+   unsigned long mask, boundary_size;
+   struct scatterlist *s;
+   int i, npages = 0;
+
+   mask = dma_get_seg_boundary(dev);
+   boundary_size = mask + 1 ? ALIGN(mask + 1, PAGE_SIZE) >> PAGE_SHIFT
+   : 1UL << (BITS_PER_LONG - PAGE_SHIFT);
+
+   for_each_sg(sglist, s, nelems, i) {
+   int p, n;
+
+   s->dma_address = npages << PAGE_SHIFT;
+   p = npages % boundary_size;
+   n = iommu_num_pages(sg_phys(s), s->length, PAGE_SIZE);
+   if (p + n > boundary_size)
+   npages += boundary_size - p;
+   npages += n;
+   }
+
+   return npages;
+}
+EXPORT_SYMBOL_GPL(iommu_sg_num_pages);
+
 size_t default_iommu_map_sg(struct iommu_domain *domain, unsigned long iova,
 struct scatterlist *sg, unsigned int nents, int prot)
 {
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index a7f2ac689d29..5b2d20e1475a 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -303,6 +303,8 @@ extern size_t iommu_unmap(struct iommu_domain *domain, 
unsigned long iova,
  size_t size);
 extern size_t iommu_unmap_fast(struct iommu_domain *domain,
   unsigned long iova, size_t size);
+extern int iommu_sg_num_pages(struct device *dev, struct scatterlist *sglist,
+   int nelems);
 extern size_t default_iommu_map_sg(struct iommu_domain *domain, unsigned long 
iova,
struct scatterlist *sg,unsigned int nents,
int prot);

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: DMA error when sg->offset value is greater than PAGE_SIZE in Intel IOMMU

2017-09-27 Thread Dan Williams
On Wed, Sep 27, 2017 at 9:31 AM, Casey Leedom <lee...@chelsio.com> wrote:
> | From: Dan Williams <dan.j.willi...@intel.com>
> | Sent: Tuesday, September 26, 2017 9:10 AM
> |
> | On Tue, Sep 26, 2017 at 9:06 AM, Casey Leedom <lee...@chelsio.com> wrote:
> | > | From: Robin Murphy <robin.mur...@arm.com>
> | > | Sent: Tuesday, September 26, 2017 7:22 AM
> | > |...
> | > ...
> | >   Regardless, it seems that you agree that there's an issue with the Intel
> | > I/O MMU support code with regard to the legal values which a (struct
> | > scatterlist) can take on?  I still can't find any documentation for this
> | > and, personally, I'm a bit baffled by a Page-oriented Scatter/Gather List
> | > representation where [Offset, Offset+Length) can reside outside the Page.
> |
> | Consider the case where the page represents a huge page, then an
> | offset greater than PAGE_SIZE (up to HPAGE_SIZE) makes sense.
>
>   Okay, but whatever the underlaying Page Size is, should [Offset,
> Offset+Length) completely reside within the referenced Page?  I'm just
> trying to understand the Invariance Conditions which are assumed by all of
> the code which processes Scatter/gather Lists ...

As far as I can see "Offset can be greater than PAGE_SIZE" is the only
safe assumption for core code.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: DMA error when sg->offset value is greater than PAGE_SIZE in Intel IOMMU

2017-09-25 Thread Dan Williams
On Mon, Sep 25, 2017 at 1:05 PM, Casey Leedom <lee...@chelsio.com> wrote:
> | From: Dan Williams <dan.j.willi...@intel.com>
> | Sent: Monday, September 25, 2017 12:31 PM
> | ...
> | IIUC it looks like this has been broken ever since commit e1605495c716
> | "intel-iommu: Introduce domain_sg_mapping() to speed up
> | intel_map_sg()". I.e. it looks like the calculation for pte_val should
> | be:
> |
> | pteval = (page_to_phys(sg_page(sg)) + sg->offset) | prot;
>
> Hhmmm, shouldn't that be:
>
> pteval = (page_to_phys(sg_page(sg)) + (sg->offset>>PAGE_SHIFT)) | prot;

Yes, I think you're right. We do want to mask off the page-unaligned
portion of sg->offset.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: DMA error when sg->offset value is greater than PAGE_SIZE in Intel IOMMU

2017-09-25 Thread Dan Williams
On Mon, Sep 25, 2017 at 10:46 AM, Casey Leedom <lee...@chelsio.com> wrote:
> | From: Robin Murphy <robin.mur...@arm.com>
> | Sent: Wednesday, September 20, 2017 3:12 AM
> |
> | On 20/09/17 09:01, Herbert Xu wrote:
> | >
> | > Harsh Jain <ha...@chelsio.com> wrote:
> | >>
> | >> While debugging DMA mapping error in chelsio crypto driver we
> | >> observed that when scatter/gather list received by driver has
> | >> some entry with page->offset > 4096 (PAGE_SIZE). It starts
> | >> giving DMA error.  Without IOMMU it works fine.
> | >
> | > This is not a bug.  The network stack can and will feed us such
> | > SG lists.
> | >
> | >> 2) It cannot be driver's responsibilty to update received sg
> | >> entries to adjust offset and page because we are not the only
> | >> one who directly uses received sg list.
> | >
> | > No the driver must deal with this.  Having said that, if we can
> | > improve our driver helper interface to make this easier then we
> | > should do that too.  What we certainly shouldn't do is to take a
> | > whack-a-mole approach like this patch does.
> |
> | AFAICS this is entirely on intel-iommu - from a brief look it appears
> | that all the IOVA calculations would handle the offset correctly, but
> | then __domain_mapping() blindly uses sg_page() for the physical address,
> | so if offset is larger than a page it would end up with the DMA mapping
> | covering the wrong part of the buffer.
> |
> | Does the diff below help?
> |
> | Robin.
> |
> | ->8-
> | diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
> | index b3914fce8254..2ed43d928135 100644
> | --- a/drivers/iommu/intel-iommu.c
> | +++ b/drivers/iommu/intel-iommu.c
> | @@ -2253,7 +2253,7 @@ static int __domain_mapping(struct dmar_domain 
> *domain, unsigned long iov_pfn,
> |  sg_res = aligned_nrpages(sg->offset, sg->length);
> |  sg->dma_address = ((dma_addr_t)iov_pfn << 
> VTD_PAGE_SHIFT) + sg->offset;
> |  sg->dma_length = sg->length;
> | -   pteval = page_to_phys(sg_page(sg)) | prot;
> | +   pteval = (sg_phys(sg) & PAGE_MASK) | prot;

This breaks on platforms where sizeof(phys_addr_t) > sizeof(unsigned
long). I.e. it's not always safe to assume that PAGE_MASK is the
correct width.

> |  phys_pfn = pteval >> VTD_PAGE_SHIFT;
> |  }
>
>   Adding some likely people to the Cc list so they can comment on this.
> Dan Williams submitted that specific piece of code in kernel.org:3e6110fd54
> ... but there are lots of similar bits in that function.  Hopefully one of
> the Intel I/O MMU Gurus will have a better idea of what may be going wrong
> here.  In the mean time I've asked our team to gather far more detailed
> debug traces showing the exact Scatter/Gather Lists we're getting, what they
> get translated to in the DMA Mappings, and what DMA Addresses were seeing in
> error.

IIUC it looks like this has been broken ever since commit e1605495c716
"intel-iommu: Introduce domain_sg_mapping() to speed up
intel_map_sg()". I.e. it looks like the calculation for pte_val should
be:

pteval = (page_to_phys(sg_page(sg)) + sg->offset) | prot;
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v1] ACPI: Switch to use generic UUID API

2017-05-05 Thread Dan Williams
On Thu, May 4, 2017 at 2:21 AM, Andy Shevchenko
<andriy.shevche...@linux.intel.com> wrote:
> acpi_evaluate_dsm() and friends take a pointer to a raw buffer of 16
> bytes. Instead we convert them to use uuid_le type. At the same time we
> convert current users.
>
> acpi_str_to_uuid() becomes useless after the conversion and it's safe to
> get rid of it.
>
> The conversion fixes a potential bug in int340x_thermal as well since
> we have to use memcmp() on binary data.
>
> Cc: Rafael J. Wysocki <r...@rjwysocki.net>
> Cc: Mika Westerberg <mika.westerb...@linux.intel.com>
> Cc: Borislav Petkov <b...@suse.de>
> Cc: Dan Williams <dan.j.willi...@intel.com>
> Cc: Amir Goldstein <amir7...@gmail.com>
> Cc: Jarkko Sakkinen <jarkko.sakki...@linux.intel.com>
> Cc: Jani Nikula <jani.nik...@linux.intel.com>
> Cc: Ben Skeggs <bske...@redhat.com>
> Cc: Benjamin Tissoires <benjamin.tissoi...@redhat.com>
> Cc: Joerg Roedel <j...@8bytes.org>
> Cc: Adrian Hunter <adrian.hun...@intel.com>
> Cc: Yisen Zhuang <yisen.zhu...@huawei.com>
> Cc: Bjorn Helgaas <bhelg...@google.com>
> Cc: Zhang Rui <rui.zh...@intel.com>
> Cc: Felipe Balbi <ba...@kernel.org>
> Cc: Mathias Nyman <mathias.ny...@intel.com>
> Cc: Heikki Krogerus <heikki.kroge...@linux.intel.com>
> Cc: Liam Girdwood <lgirdw...@gmail.com>
> Cc: Mark Brown <broo...@kernel.org>
> Signed-off-by: Andy Shevchenko <andriy.shevche...@linux.intel.com>
[..]
> diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c
> index 0f7982a1caaf..bd3e45ede056 100644
> --- a/drivers/acpi/nfit/core.c
> +++ b/drivers/acpi/nfit/core.c
> @@ -74,11 +74,11 @@ struct nfit_table_prev {
> struct list_head flushes;
>  };
>
> -static u8 nfit_uuid[NFIT_UUID_MAX][16];
> +static uuid_le nfit_uuid[NFIT_UUID_MAX];
>
> -const u8 *to_nfit_uuid(enum nfit_uuids id)
> +const uuid_le *to_nfit_uuid(enum nfit_uuids id)
>  {
> -   return nfit_uuid[id];
> +   return _uuid[id];
>  }
>  EXPORT_SYMBOL(to_nfit_uuid);
>
> @@ -207,7 +207,7 @@ int acpi_nfit_ctl(struct nvdimm_bus_descriptor *nd_desc, 
> struct nvdimm *nvdimm,
> u32 offset, fw_status = 0;
> acpi_handle handle;
> unsigned int func;
> -   const u8 *uuid;
> +   const uuid_le *uuid;
> int rc, i;
>
> func = cmd;
> @@ -394,7 +394,7 @@ int nfit_spa_type(struct acpi_nfit_system_address *spa)
> int i;
>
> for (i = 0; i < NFIT_UUID_MAX; i++)
> -   if (memcmp(to_nfit_uuid(i), spa->range_guid, 16) == 0)
> +   if (!uuid_le_cmp_pp(to_nfit_uuid(i), (uuid_le 
> *)spa->range_guid))

What is _cmp_pp? Why not _compare?

Other than that, looks ok to me.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v5 3/9] dma-mapping: add dma_{map,unmap}_resource

2016-03-11 Thread Dan Williams
On Fri, Mar 11, 2016 at 5:46 AM, Robin Murphy <robin.mur...@arm.com> wrote:
> Hi Dan,
>
>
> On 11/03/16 06:47, Dan Williams wrote:
>>
>> On Thu, Mar 10, 2016 at 8:05 AM, Niklas S??derlund
>> <niklas.soderl...@ragnatech.se> wrote:
>>>
>>> Hi Christoph,
>>>
>>> On 2016-03-07 23:38:47 -0800, Christoph Hellwig wrote:
>>>>
>>>> Please add some documentation on where/how this should be used.  It's
>>>> not a very obvious interface.
>>>
>>>
>>> Good idea, I have added the following to Documentation/DMA-API.txt and
>>> folded it in to this patch. Do you feel it's adequate and do you know
>>> anywhere else I should add documentation?
>>>
>>> diff --git a/Documentation/DMA-API.txt b/Documentation/DMA-API.txt
>>> index 45ef3f2..248556a 100644
>>> --- a/Documentation/DMA-API.txt
>>> +++ b/Documentation/DMA-API.txt
>>> @@ -277,14 +277,29 @@ and  parameters are provided to do partial
>>> page mapping, it is
>>>   recommended that you never use these unless you really know what the
>>>   cache width is.
>>>
>>> +dma_addr_t
>>> +dma_map_resource(struct device *dev, phys_addr_t phys_addr, size_t size,
>>> +enum dma_data_direction dir, struct dma_attrs *attrs)
>>> +
>>> +Maps a MMIO region so it can be accessed by the device and returns the
>>> +DMA address of the memory. API should only be used to map device MMIO,
>>> +mapping of RAM is not permitted.
>>> +
>>
>>
>> I think it is confusing to use the dma_ prefix for this peer-to-peer
>> mmio functionality.  dma_addr_t is a device's view of host memory.
>> Something like bus_addr_t bus_map_resource().  Doesn't this routine
>> also need the source device in addition to the target device?  The
>> resource address is from the perspective of the host cpu, it may be a
>> different address space in the view of two devices relative to each
>> other.
>
>
> Hmm, the trouble with that is that when the DMA master is behind an IOMMU,
> the address space as seen by the device is dynamic and whatever we decide it
> to be, so there is no distinction between a "DMA" address and a "bus"
> address.
>
> In practice the dmaengine API has clearly worked for however long with slave
> MMIO addresses being a dma_addr_t, and it doesn't look like anyone objected
> to the change to phys_addr_t in -next either. If nothing is using bus_addr_t
> anyway, what's the right thing to do? Looking up through higher abstraction
> layers, we have the likes of struct snd_dmaengine_dai_dma_data also
> expecting the slave address to be a dma_addr_t, leading to things like the
> direct casting in bcm2835_i2s_probe() for the non-IOMMU dma != phys != bus
> case that could also be cleaned up with this proposed interface.
>

So the "bus_addr_t" reaction was prompted by the recent activity of
RDMA developers looking to re-use the devm_memremap_pages() api.  That
enabling is looking at how to setup peer-to-peer PCI-E cycles for an
RDMA device to deliver data to another local device without taking a
round trip through host memory.

I understand the history of the dmaengine-slave implementation, but it
seems we're getting to point where we need a less overloaded
identifier than "dma" for the case of devices talking to each other.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v5 3/9] dma-mapping: add dma_{map,unmap}_resource

2016-03-10 Thread Dan Williams
On Thu, Mar 10, 2016 at 8:05 AM, Niklas S??derlund
 wrote:
> Hi Christoph,
>
> On 2016-03-07 23:38:47 -0800, Christoph Hellwig wrote:
>> Please add some documentation on where/how this should be used.  It's
>> not a very obvious interface.
>
> Good idea, I have added the following to Documentation/DMA-API.txt and
> folded it in to this patch. Do you feel it's adequate and do you know
> anywhere else I should add documentation?
>
> diff --git a/Documentation/DMA-API.txt b/Documentation/DMA-API.txt
> index 45ef3f2..248556a 100644
> --- a/Documentation/DMA-API.txt
> +++ b/Documentation/DMA-API.txt
> @@ -277,14 +277,29 @@ and  parameters are provided to do partial page 
> mapping, it is
>  recommended that you never use these unless you really know what the
>  cache width is.
>
> +dma_addr_t
> +dma_map_resource(struct device *dev, phys_addr_t phys_addr, size_t size,
> +enum dma_data_direction dir, struct dma_attrs *attrs)
> +
> +Maps a MMIO region so it can be accessed by the device and returns the
> +DMA address of the memory. API should only be used to map device MMIO,
> +mapping of RAM is not permitted.
> +

I think it is confusing to use the dma_ prefix for this peer-to-peer
mmio functionality.  dma_addr_t is a device's view of host memory.
Something like bus_addr_t bus_map_resource().  Doesn't this routine
also need the source device in addition to the target device?  The
resource address is from the perspective of the host cpu, it may be a
different address space in the view of two devices relative to each
other.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [arch] WARNING: CPU: 0 PID: 0 at kernel/memremap.c:31 memremap()

2015-07-25 Thread Dan Williams
[ note: this patch is in a dev branch for test coverage, safe to
disregard for now ]

On Wed, Jul 22, 2015 at 4:32 PM, kernel test robot
fengguang...@intel.com wrote:
 Greetings,

 0day kernel testing robot got the below dmesg and the first bad commit is

 git://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm.git pmem-api

 commit 163f9409a57082aed03fbeeb321fbf18bdaf5f42
 Author: Dan Williams dan.j.willi...@intel.com
 AuthorDate: Wed Jul 22 18:09:01 2015 -0400
 Commit: Dan Williams dan.j.willi...@intel.com
 CommitDate: Wed Jul 22 18:09:01 2015 -0400

 arch: introduce memremap(), replace ioremap_cache()

 Existing users of ioremap_cache() are mapping memory that is known in
 advance to not have i/o side effects.  These users are forced to cast
 away the __iomem annotation, or otherwise neglect to fix the sparse
 errors thrown when dereferencing pointers to this memory.  Provide
 memremap() as a non __iomem annotated ioremap_*() in the case when
 ioremap is otherwise a pointer to memory.

 The ARCH_HAS_MEMREMAP kconfig symbol is introduced for archs to assert
 that it is safe to recast / reuse the return value from ioremap as a
 normal pointer to memory.  In other words, archs that mandate specific
 accessors for __iomem are not memremap() capable and drivers that care,
 like pmem, can add a dependency to disable themselves on these archs.

 Note, that memremap is a break from the ioremap implementation pattern
 of adding a new memremap_type() for each mapping type.  Instead,
 the implementation defines flags that are passed to the central
 memremap() implementation.

 Outside of ioremap() and ioremap_nocache(), the expectation is that most
 calls to ioremap_type() are seeking memory-like semantics (e.g.
 speculative reads, and prefetching permitted).  These callsites can be
 moved to memremap() over time.

 Cc: Arnd Bergmann a...@arndb.de
 Acked-by: Andy Shevchenko andy.shevche...@gmail.com
 Signed-off-by: Dan Williams dan.j.willi...@intel.com

 +++++
 || dc5d38e432 | 163f9409a5 | 
 5dfe2d5864 |
 +++++
 | boot_successes | 63 | 0  | 
 0  |
 | boot_failures  | 0  | 26 | 
 19 |
 | WARNING:at_kernel/memremap.c:#memremap()   | 0  | 26 | 
 19 |
 | backtrace:acpi_load_tables | 0  | 26 | 
 19 |
 | backtrace:acpi_early_init  | 0  | 26 | 
 19 |
 | IP-Config:Auto-configuration_of_network_failed | 0  | 0  | 
 2  |
 +++++

 [0.227312] ACPI: Core revision 20150619
 [0.227690] memremap: acpi_os_map_iomem
 [0.228021] [ cut here ]
 [0.228406] WARNING: CPU: 0 PID: 0 at kernel/memremap.c:31 
 memremap+0x73/0x159()
 [0.229202] memremap attempted on unknown/mixed range 0x0ffe 
 size: 4096
 [0.229829] Modules linked in:
 [0.230106] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 
 4.2.0-rc3-7-g163f940 #1
 [0.230718]    c175bf04 c1489068 c175bf30 c175bf20 
 c103a1ae 001f
 [0.231456]  c10f6407 1000 0001 0001 c175bf38 c103a1f0 
 0009 c175bf30
 [0.232242]  c16b3ecb c175bf4c c175bf68 c10f6407 c16b3f04 001f 
 c16b3ecb c175bf54
 [0.232985] Call Trace:
 [0.233196]  [c1489068] dump_stack+0x48/0x60
 [0.233570]  [c103a1ae] warn_slowpath_common+0x89/0xa0
 [0.234016]  [c10f6407] ? memremap+0x73/0x159
 [0.234396]  [c103a1f0] warn_slowpath_fmt+0x2b/0x2f
 [0.234816]  [c10f6407] memremap+0x73/0x159
 [0.235190]  [c1486181] acpi_os_map_iomem+0x10b/0x15f
 [0.235660]  [c14861e2] acpi_os_map_memory+0xd/0xf
 [0.236089]  [c1293d5c] acpi_tb_acquire_table+0x35/0x5a
 [0.236538]  [c1293e41] acpi_tb_validate_table+0x22/0x37
 [0.237001]  [c1841255] acpi_load_tables+0x38/0x146
 [0.237424]  [c184080e] acpi_early_init+0x64/0xd4
 [0.237830]  [c1817b14] start_kernel+0x3e1/0x447
 [0.238232]  [c18172bb] i386_start_kernel+0x85/0x89
 [0.238734] ---[ end trace cb88537fdc8fa200 ]---
 [0.239133] ACPI Exception: AE_NO_ACPI_TABLES, While loading namespace 
 from ACPI tables (20150619/tbxfload-80)

 git bisect start 5dfe2d5864e91244e7befaa2317519ea15dc9c89 
 52721d9d3334c1cb1f76219a161084094ec634dc --
 git bisect good cfc652fabeee43adf800889a5a4a935a9af090a7  # 07:04 20+ 
  0  arch, drivers: don't include asm/io.h directly, use linux/io.h instead
 git bisect good dc5d38e432ff171125f746d80a037692feb16fc9  # 07:11 22+ 
  0  intel_iommu: fix